Jon Karrer

Daily Linear Classifier

I wanted to get a baseline started for a machine learning model that would predict if a stock would go up or down the next period. The period in this case is daily, so if the stock will go up the next day. Building a baseline model is the first step in all my projects. There is magic in the 80 / 20 approach to everything. Here I start with a 2 layer linear neural network that tries to classify just one row of data as buy or sell.

Code Repo

Data

A single row of data has the candle for that day, and some technical indicators. Here is a raw example of a row:

| Column Name | Value | |-------------|-------| | id | 200 | | event_datetime | 2016-10-17 04:00:00 | | event_unix_timestamp | 1476676800000 | | open_price | 17.7999992370605 | | close_price | 17.7700004577637 | | high_price | 18.2000007629395 | | low_price | 17.7049999237061 | | volume | 4385696.0 | | volume_weighted_price | 17.8098182678223 | | stock_symbol | JBLU | | timeframe | 1D | | bar_trend | bearish | | buy_or_sell | 1 | | next_frame_price | 17.7800006866455 | | next_frame_trend | bearish | | next_frame_unix_timestamp | 1476763200000 | | next_frame_event_datetime | 2016-10-18 04:00:00 | | hundred_day_sma | 17.1679515838623 | | hundred_day_ema | 17.1679515838623 | | fifty_day_sma | 16.9481010437012 | | fifty_day_ema | 16.9481010437012 | | twenty_day_sma | 17.5162487030029 | | twenty_day_ema | 17.5162487030029 | | nine_day_ema | 17.8033351898193 | | nine_day_sma | 17.8033351898193 | | hundred_day_high | 18.9400005340576 | | hundred_day_low | 14.7600002288818 | | fifty_day_high | 18.4699993133545 | | fifty_day_low | 15.6999998092651 | | ten_day_high | 18.4699993133545 | | ten_day_low | 17.1499996185303 | | fourteen_day_rsi | 57.9687461853027 | | top_bollinger_band | 18.182430267334 | | middle_bollinger_band | 17.5162487030029 | | bottom_bollinger_band | 16.8500671386719 |

Data processing

The framework I use is called Burn. It provides a number of utilities that help create nueral nets and train them. They have a sqlite database utility that I use for this model. Their opinions on table format are linked here.

Along with formatting the data for the framework, I need to split it into training and validation. There are 1,000,000 rows in the dataset. I split it into 80% training and 20% validation. Also, since this is time series data, I want the validation to be after the training data. Here is an example of one of those tables after the split:

| Column Name | Value | |-------------|-------| | row_id | 1 | | open_price | 17.7999992370605 | | close_price | 17.7700004577637 | | high_price | 18.2000007629395 | | low_price | 17.7049999237061 | | volume | 4385696.0 | | volume_weighted_price | 17.8098182678223 | | bar_trend | 1 | | buy_or_sell | 1 | | hundred_day_sma | 17.1679515838623 | | hundred_day_ema | 17.1679515838623 | | fifty_day_sma | 16.9481010437012 | | fifty_day_ema | 16.9481010437012 | | twenty_day_sma | 17.5162487030029 | | twenty_day_ema | 17.5162487030029 | | nine_day_ema | 17.8033351898193 | | nine_day_sma | 17.8033351898193 | | hundred_day_high | 18.9400005340576 | | hundred_day_low | 14.7600002288818 | | fifty_day_high | 18.4699993133545 | | fifty_day_low | 15.6999998092651 | | ten_day_high | 18.4699993133545 | | ten_day_low | 17.1499996185303 | | fourteen_day_rsi | 57.9687461853027 | | top_bollinger_band | 18.182430267334 | | middle_bollinger_band | 17.5162487030029 | | bottom_bollinger_band | 16.8500671386719 |

Training

Linear classifiers are usually the starting point for classification tasks on tabular data. I will walk through my experiment configs on each run and see how they perform. Hopefully adjusting hpyerparameters will improve performance with each run, but that's why it's called experimenting. First though, I will start with the most simple one that I can think of.

Run 1

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 1e-4 | | weight_decay | 5e-5 | | batch_size | 64 | | num_workers | 4 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | Adam | | input_size | 25 | | hidden_layers | 1 | | hidden_layer_size | 64 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | true | | bias | true |

| Epoch | Loss | Accuracy | |-------|------|----------| | 0 | 51.0 | 50.0 | | 1 | 52.0 | 51.0 | | 2 | 52.0 | 51.0 |

Early stop... no improvement.

Run 2

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 1e-3 | | weight_decay | 5e-5 | | batch_size | 256 | | num_workers | 4 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | Adam | | input_size | 25 | | hidden_layers | 1 | | hidden_layer_size | 128 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | true | | bias | true |

Model { input_layer: Linear {d_input: 25, d_output: 128, bias: true, params: 3328} output_layer: Linear {d_input: 128, d_output: 2, bias: true, params: 258} activation: Relu params: 3586 } Total Epochs: 5

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|-----------------|----------|----------|----------|----------| | Train | CPU Usage | 52.270 | 3 | 56.542 | 1 | | Train | CPU Memory | 19.332 | 5 | 19.482 | 2 | | Train | Loss | 0.692 | 5 | 0.692 | 1 | | Train | Accuracy | 51.839 | 1 | 52.129 | 5 | | Valid | CPU Usage | 50.215 | 4 | 55.582 | 1 | | Valid | CPU Memory | 18.967 | 5 | 19.404 | 2 | | Valid | Loss | 0.693 | 3 | 0.695 | 2 | | Valid | Accuracy | 51.061 | 1 | 51.369 | 5 |

Run 3

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 1e-5 | | weight_decay | 5e-5 | | batch_size | 256 | | num_workers | 4 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | Adam | | input_size | 25 | | hidden_layers | 1 | | hidden_layer_size | 256 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | true | | bias | true |

Model { input_layer: Linear {d_input: 25, d_output: 256, bias: true, params: 6656} output_layer: Linear {d_input: 256, d_output: 2, bias: true, params: 514} activation: Relu params: 7170 } Total Epochs: 3

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | Loss | 0.692 | 3 | 0.692 | 1 | | Train | CPU Memory | 19.209 | 3 | 19.498 | 1 | | Train | Accuracy | 51.713 | 1 | 51.851 | 3 | | Train | CPU Usage | 53.681 | 1 | 54.714 | 3 | | Valid | Loss | 0.693 | 1 | 0.693 | 2 | | Valid | CPU Memory | 19.228 | 2 | 19.481 | 1 | | Valid | Accuracy | 51.387 | 3 | 51.414 | 1 | | Valid | CPU Usage | 52.290 | 1 | 53.637 | 3 |

Run 4

tried taking the log of the volume column and removing the min max norm. Spoiler alert, fail.

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 1e-5 | | weight_decay | 5e-5 | | batch_size | 256 | | num_workers | 4 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | Adam | | input_size | 25 | | hidden_layers | 1 | | hidden_layer_size | 256 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | true | | bias | true |

Model { input_layer: Linear {d_input: 25, d_output: 256, bias: true, params: 6656} output_layer: Linear {d_input: 256, d_output: 2, bias: true, params: 514} activation: Relu params: 7170 } Total Epochs: 3

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | CPU Usage | 54.374 | 3 | 55.855 | 1 | | Train | Loss | NaN | 1 | NaN | 3 | | Train | CPU Memory | 19.632 | 1 | 19.952 | 3 | | Train | Accuracy | 48.180 | 2 | 48.184 | 1 | | Valid | CPU Usage | 50.668 | 3 | 53.622 | 1 | | Valid | Loss | NaN | 1 | NaN | 3 | | Valid | CPU Memory | 19.915 | 2 | 19.958 | 3 | | Valid | Accuracy | 48.584 | 1 | 48.584 | 3 |

Run 5

tried taking the log of the volume column and removing the min max norm. Spoiler alert, fail.

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 1e-5 | | weight_decay | 5e-5 | | batch_size | 256 | | num_workers | 4 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | Adam | | input_size | 25 | | hidden_layers | 1 | | hidden_layer_size | 256 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | true | | bias | true |

Model { input_layer: Linear {d_input: 25, d_output: 256, bias: true, params: 6656} output_layer: Linear {d_input: 256, d_output: 2, bias: true, params: 514} activation: Relu params: 7170 } Total Epochs: 3

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | CPU Usage | 54.374 | 3 | 55.855 | 1 | | Train | Loss | NaN | 1 | NaN | 3 | | Train | CPU Memory | 19.632 | 1 | 19.952 | 3 | | Train | Accuracy | 48.180 | 2 | 48.184 | 1 | | Valid | CPU Usage | 50.668 | 3 | 53.622 | 1 | | Valid | Loss | NaN | 1 | NaN | 3 | | Valid | CPU Memory | 19.915 | 2 | 19.958 | 3 | | Valid | Accuracy | 48.584 | 1 | 48.584 | 3 |

Run 6

Going to add a dropout layer.

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 1e-5 | | weight_decay | 5e-5 | | batch_size | 256 | | num_workers | 4 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | Adam | | input_size | 25 | | hidden_layers | 1 | | hidden_layer_size | 256 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | true | | bias | true | | dropout | 0.5 |

Model { input_layer: Linear {d_input: 25, d_output: 256, bias: true, params: 6656} output_layer: Linear {d_input: 256, d_output: 2, bias: true, params: 514} dropout: Dropout {prob: 0.5} activation: Relu params: 7170 } Total Epochs: 3

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | Accuracy | 50.903 | 1 | 51.391 | 3 | | Train | Loss | 0.693 | 3 | 0.694 | 1 | | Train | CPU Memory | 19.722 | 1 | 19.949 | 2 | | Train | CPU Usage | 55.119 | 1 | 56.131 | 2 | | Valid | Accuracy | 51.397 | 3 | 51.417 | 1 | | Valid | Loss | 0.693 | 1 | 0.693 | 2 | | Valid | CPU Memory | 19.340 | 3 | 20.021 | 2 | | Valid | CPU Usage | 52.641 | 2 | 54.060 | 3 |

Run 7

Added 2 more hidden layers.

Model { input_layer: Linear {d_input: 25, d_output: 256, bias: true, params: 6656} ln1: Linear {d_input: 256, d_output: 256, bias: true, params: 65792} ln2: Linear {d_input: 256, d_output: 256, bias: true, params: 65792} output_layer: Linear {d_input: 256, d_output: 2, bias: true, params: 514} dropout: Dropout {prob: 0.5} activation: Relu params: 138754 } Total Epochs: 3

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | Accuracy | 50.975 | 1 | 51.648 | 3 | | Train | CPU Usage | 51.280 | 2 | 51.635 | 1 | | Train | Loss | 0.693 | 3 | 0.693 | 1 | | Train | CPU Memory | 19.638 | 3 | 19.773 | 2 | | Valid | Accuracy | 51.416 | 1 | 51.416 | 3 | | Valid | CPU Usage | 48.600 | 3 | 49.028 | 1 | | Valid | Loss | 0.693 | 1 | 0.693 | 3 | | Valid | CPU Memory | 19.627 | 2 | 19.733 | 1 |

Run 8

Taking out the bias.

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | CPU Usage | 55.270 | 1 | 61.351 | 2 | | Train | CPU Memory | 19.492 | 2 | 19.718 | 1 | | Train | Loss | 0.693 | 3 | 0.693 | 1 | | Train | Accuracy | 50.980 | 1 | 51.621 | 3 | | Valid | CPU Usage | 50.155 | 1 | 56.421 | 3 | | Valid | CPU Memory | 19.342 | 2 | 19.663 | 3 | | Valid | Loss | 0.693 | 1 | 0.693 | 3 | | Valid | Accuracy | 51.416 | 1 | 51.416 | 3 |

Run 9

Seems I am stuck at a loss of 0.693. My learning rate or initialization probably off. Going to up my learning rate.

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 5e-1 | | weight_decay | 5e-5 | | batch_size | 256 | | num_workers | 4 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | Adam | | input_size | 25 | | hidden_layers | 3 | | hidden_layer_size | 256 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | true | | bias | true |

Model { input_layer: Linear {d_input: 25, d_output: 256, bias: true, params: 6656} ln1: Linear {d_input: 256, d_output: 256, bias: true, params: 65792} ln2: Linear {d_input: 256, d_output: 256, bias: true, params: 65792} output_layer: Linear {d_input: 256, d_output: 2, bias: true, params: 514} dropout: Dropout {prob: 0.5} activation: Relu params: 138754 } Total Epochs: 6

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | Loss | 0.719 | 2 | NaN | 6 | | Train | CPU Usage | 54.249 | 6 | 66.030 | 4 | | Train | Accuracy | 50.465 | 4 | 50.626 | 1 | | Train | CPU Memory | 19.801 | 2 | 20.048 | 4 | | Valid | Loss | 0.693 | 4 | NaN | 6 | | Valid | CPU Usage | 51.482 | 5 | 57.649 | 2 | | Valid | Accuracy | 48.584 | 1 | 51.416 | 5 | | Valid | CPU Memory | 19.713 | 1 | 20.040 | 6 |

Run 10

The learning rate increase was fine, still only getting my losst to around 0.7. Let's add more layers.

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 5e-1 | | weight_decay | 5e-5 | | batch_size | 256 | | num_workers | 4 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | Adam | | input_size | 25 | | hidden_layers | 7 | | hidden_layer_size | 256 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | true | | bias | true |

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | Accuracy | 51.278 | 2 | 51.310 | 3 | | Train | CPU Usage | 60.536 | 1 | 61.883 | 2 | | Train | CPU Memory | 19.877 | 3 | 20.124 | 2 | | Train | Loss | 0.694 | 2 | 0.698 | 1 | | Valid | Accuracy | 48.584 | 1 | 51.416 | 3 | | Valid | CPU Usage | 54.317 | 2 | 57.741 | 3 | | Valid | CPU Memory | 19.761 | 3 | 20.103 | 1 | | Valid | Loss | 0.694 | 1 | 0.756 | 2 |

Run 11

Not budging. Going to take out the shuffle and add 2 more workers, and slow wieght decay.

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 5e-2 | | weight_decay | 2e-5 | | batch_size | 256 | | num_workers | 6 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | Adam | | input_size | 25 | | hidden_layers | 7 | | hidden_layer_size | 256 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | false | | bias | true |

Model { input_layer: Linear {d_input: 25, d_output: 256, bias: true, params: 6656} ln1: Linear {d_input: 256, d_output: 256, bias: true, params: 65792} ln2: Linear {d_input: 256, d_output: 256, bias: true, params: 65792} ln3: Linear {d_input: 256, d_output: 256, bias: true, params: 65792} ln4: Linear {d_input: 256, d_output: 256, bias: true, params: 65792} ln5: Linear {d_input: 256, d_output: 256, bias: true, params: 65792} ln6: Linear {d_input: 256, d_output: 256, bias: true, params: 65792} output_layer: Linear {d_input: 256, d_output: 2, bias: true, params: 514} dropout: Dropout {prob: 0.5} activation: Relu params: 401922 } Total Epochs: 5

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | Accuracy | 51.259 | 5 | 51.362 | 1 | | Train | CPU Memory | 19.808 | 3 | 20.172 | 4 | | Train | CPU Usage | 74.068 | 1 | 77.307 | 4 | | Train | Loss | 0.693 | 5 | 0.721 | 1 | | Valid | Accuracy | 51.416 | 1 | 51.416 | 5 | | Valid | CPU Memory | 19.788 | 2 | 20.195 | 4 | | Valid | CPU Usage | 72.220 | 4 | 78.686 | 3 | | Valid | Loss | 0.693 | 3 | 0.693 | 1 |

Run 12

Last run. Going to add gradient clipping, and reduce the number of layers. Trying to combat maybe a vanishing gradient problem.

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 1e-2 | | weight_decay | 5e-5 | | batch_size | 512 | | num_workers | 4 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | SGD | | input_size | 25 | | hidden_layers | 2 | | hidden_layer_size | 512 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | true | | bias | true |

Model { input_layer: Linear {d_input: 25, d_output: 512, bias: true, params: 13312} ln1: Linear {d_input: 512, d_output: 512, bias: true, params: 262656} output_layer: Linear {d_input: 512, d_output: 2, bias: true, params: 1026} dropout: Dropout {prob: 0.5} activation: Relu params: 276994 } Total Epochs: 5

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | CPU Usage | 49.157 | 1 | 51.136 | 4 | | Train | Accuracy | 51.389 | 5 | 51.486 | 2 | | Train | CPU Memory | 20.042 | 4 | 20.456 | 3 | | Train | Loss | 0.693 | 3 | 0.705 | 1 | | Valid | CPU Usage | 48.439 | 2 | 51.192 | 3 | | Valid | Accuracy | 51.416 | 1 | 51.416 | 5 | | Valid | CPU Memory | 19.971 | 4 | 20.622 | 3 | | Valid | Loss | 0.696 | 3 | 0.701 | 2 |

Run 13

Added two more features, previous bar trend and macd signal. This was more of the same result wise.

| Hyperparameters | Value | |-----------------|-------| | epochs | 10 | | learning_rate | 1e-2 | | weight_decay | 5e-5 | | batch_size | 512 | | num_workers | 4 | | seed | 42 | | device | wgpu | | loss | CrossEntropyLoss | | optimizer | SGD | | input_size | 27 | | hidden_layers | 2 | | hidden_layer_size | 512 | | output_size | 2 | | hidden_layer_activation | Relu | | output_activation | with logits | | shuffle_batch | true | | bias | true |

Model { input_layer: Linear {d_input: 27, d_output: 512, bias: true, params: 14336} ln1: Linear {d_input: 512, d_output: 512, bias: true, params: 262656} output_layer: Linear {d_input: 512, d_output: 2, bias: true, params: 1026} dropout: Dropout {prob: 0.5} activation: Relu params: 278018 } Total Epochs: 8

| Split | Metric | Min. | Epoch | Max. | Epoch | |-------|------------|----------|----------|----------|----------| | Train | CPU Memory | 20.956 | 7 | 21.549 | 2 | | Train | CPU Usage | 53.669 | 8 | 58.540 | 3 | | Train | Loss | 0.693 | 2 | 0.709 | 1 | | Train | Accuracy | 51.283 | 6 | 51.380 | 4 | | Valid | CPU Memory | 20.979 | 7 | 21.453 | 2 | | Valid | CPU Usage | 52.584 | 8 | 59.560 | 6 | | Valid | Loss | 0.692 | 6 | 0.698 | 3 | | Valid | Accuracy | 48.155 | 1 | 51.845 | 8 |

Conclusion

Still not moving the needle. This was expected, as predicting stocks is hard. I may need to rework my data, but a simple model like this was expected to not be very accurate. Random guessing is all that it can do, and my assumption is there is not much predictive power in the data. So my next step it do do some feature engineering and try to improve the dataset.