Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Renato Votto

Last update: Nov 17, 2022

Related tags

Machine Learning python machine-learning sentiment-analysis concurrency parallel-computing algorithmic-trading data-pipeline quantitative-trading idempotence

Overview

Trading Tesla with Machine Learning and Sentiment Analysis

An interactive program to train a Random Forest Classifier to predict Tesla daily prices using technical indicators and sentiment scores of Twitter posts, backtesting the trading strategy and producing performance metrics.

The project leverages techniques, paradigms and data structures such as:

Functional and Object-Oriented Programming
Machine Learning
Sentiment Analysis
Concurrency and Parallel Processing
Direct Acyclic Graph (D.A.G.)
Data Pipeline
Idempotence

Scope

The intention behind this project was to implement the end-to-end workflow of the backtesting of an Algorithmic Trading strategy in a program with a sleek interface, and with a level of automation such that the user is able to tailor the details of the strategy and the output of the program by entering a minimal amount of data, partly even in an interactive way. This should make the program reusable, meaning that it's easy to carry out the backtesting of the trading strategy on a different asset. Furthermore, the modularity of the software design should facilitate changes to adapt the program to different requirements (i.e. different data or ML models).

Strategy Backtesting Results

The Random Forest classifier model was trained and optimised with the scikit-learn GridSearchCV module. After computing the trading signals predictions and backtesting the strategy, the following performances were recorded:

	Performance Indicators Summary
Return Buy and Hold (%)	273.94
Return Buy and Hold Ann. (%)	91.5
Return Trading Strategy (%)	1555.54
Return Trading Strategy Ann. (%)	298.53
Sharpe Ratio	0.85
Hit Ratio (%)	93.0
Average Trades Profit (%)	3.99
Average Trades Loss (%)	-1.15
Max Drawdown (%)	-7.69
Days Max Drawdown Recovery	2

Running the Program

This is straightforward. There are very few variables and methods to initialise and call in order to run the whole program.

Let me illustrate it in the steps below:

Provide the variables in download_params, a dictionary containing all the strategy and data downloading details.

download_params = {'ticker' : 'TSLA',
                   'since' : '2010-06-29', 
                   'until' : '2021-06-02',
                   'twitter_scrape_by_account' : {'elonmusk': {'search_keyword' : '',
                                                               'by_hashtag' : False},
                                                  'tesla': {'search_keyword' : '',
                                                            'by_hashtag' : False},
                                                  'WSJ' : {'search_keyword' : 'Tesla',
                                                           'by_hashtag' : False},
                                                  'Reuters' : {'search_keyword' : 'Tesla',
                                                               'by_hashtag' : False},
                                                  'business': {'search_keyword' : 'Tesla',
                                                               'by_hashtag' : False},
                                                  'CNBC': {'search_keyword' : 'Tesla',
                                                           'by_hashtag' : False},
                                                  'FinancialTimes' : {'search_keyword' : 'Tesla',
                                                                      'by_hashtag' : True}},
                   'twitter_scrape_by_most_popular' : {'all_twitter_1': {'search_keyword' : 'Tesla',
                                                                       'max_tweets_per_day' : 30,
                                                                       'by_hashtag' : True}},
                   'language' : 'en'                                      
                   }

Initialise an instance of the Pipeline class:
```
TSLA_data_pipeline = Pipeline()
```
Call the run method on the Pipeline instance:
```
TSLA_pipeline_outputs = TSLA_data_pipeline.run()
```
This will return a dictionary with the Pipeline functions outputs, which in this example has been assigned to TSLA_pipeline_outputs. It will also print messages about the status and operations of the data downloading and manipulation process.
Retrieve the path to the aggregated data to feed into the Backtest_Strategy class:
```
data = glob.glob('data/prices_TI_sentiment_scores/*')[0]
```
Initialise an instance of the Backtest_Strategy class with the data variable assigned in the previous step.
```
TSLA_backtest_strategy = Backtest_Strategy(data)
```
Call the preprocess_data method on the Backtest_Strategy instance:
```
TSLA_backtest_strategy.preprocess_data()
```
This method will show a summary of the data preprocessing results such as missing values, infinite values and features statistics.

From this point the program becomes interactive, and the user is able to input data, save and delete files related to the training and testing of the Random Forest model, and proceed to display the strategy backtesting summary and graphs.

Call the train_model method on the Backtest_Strategy instance:
```
TSLA_backtest_strategy.train_model()
```
Here you will be able to train the model with the scikit-learn GridSearchCV, creating your own parameters grid, save and delete files containing the parameters grid and the best set of parameters found.
Call the test_model method on the Backtest_Strategy instance:
```
TSLA_backtest_strategy.test_model()
```
This method will allow you to test the model by selecting one of the model's best parameters files saved during the training process (or the "default_best_param.json" file created by default by the program, if no other file was saved by the user).

Once the process is complete, it will display the testing summary metrics and graphs.

If you are satisfied with the testing results, from here you can display the backtesting summary, which equates to call the next and last method below. In this case, the program will also save a csv file with the data to compute the strategy performance metrics.
Call the strategy_performance method on the Backtest_Strategy instance:
```
TSLA_backtest_strategy.strategy_performance()
```
This is the method to display the backtesting summary shown above in this document. Assuming a testing session has been completed and there is a csv file for computing the performance metrics, the program will display the backtesting results straight away using the existing csv file, which in turn is overwritten every time a testing process is completed. Otherwise, it will prompt you to run a training/testing session first.

Tips

If the required data (historical prices and Twitter posts) have been already downloaded, the only long execution time you may encounter is during the model training: the larger the parameters grid search, the longer the time. I recommend that you start getting confident with the program by using the data already provided within the repo (backtesting on Tesla stock).

This is because any downloading of new data on a significantly large period of time such to be reliable for the model training will likely require many hours, essentially due to the Twitter scraping process.

That said, please be also aware that as soon as you change the variables in the download_params dictionary and run the Pipeline instance, all the existing data files will be overwritten. This is because the program recognise on its own the relevant data that need to be downloaded according to the parameters passed into download_params, and this is a deliberate choice behind the program design.

That's all! Clone the repository and play with it. Any feedback welcome.

Disclaimer

Please be aware that the content and results of this project do not represent financial advice. You should conduct your own research before trading or investing in the markets. Your capital is at risk.

References

You might also like...

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

64 Nov 30, 2022

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

1 Jan 11, 2022

machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made this project as a requirement for an internship at Indian Servers. We are now making it open to contribution.

73 Dec 1, 2022

MasTrade is a trading bot in baselines3,pytorch,gym

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

366 Jan 3, 2023

A data preprocessing package for time series data. Design for machine learning and deep learning.

152 Jan 7, 2023

30 Days Of Machine Learning Using Pytorch

Objective of the repository is to learn and build machine learning models using Pytorch. 30DaysofML Using Pytorch

119 Nov 24, 2022

Comments

cross-platform file paths
Instead of this path parsing one-liner...

existing_files = [file.split('/')[2] for file in glob.glob(prices_folder + '*')]

I'm running on Windows and not Mac or linux. The filepath returns "" and not "/".

I solved this by using "from pathlib import Path" and replacing this one-liner with:

existing_files = [Path(file).name for file in glob.glob(prices_folder + '*')]

I replaced about ~20 different file path instances relying on '/' split. Using pathlib .name and .path should solve all of these issues.

folder_name = subdirectory.split('/')[1][:-1] description = subdirectory.split('/')[2]

For these, I used "from pathlib import PurePath" like so:

folder_name = PurePath(subdirectory).parts[1][:-1] description = PurePath(subdirectory).parts[2]

Thought you might be interested in changing this, too.

Also, I've trained the model and now I'm testing. The "test" has been running for over 30+ and still running. I have a powerful computer with 32gb RAM. Is this normal?
opened by peteawest 1
Next day forecast

Hi Renato, Thanks for sharing your work, it’s a great concept. I had a question, can you advise me on how can I get the next day’s forecast? I tried to fit and predict the model but wasn’t able to. I appreciate your help. Thanks.

opened by saeed7733 9

Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Related tags

Overview

Trading Tesla with Machine Learning and Sentiment Analysis

Scope

Strategy Backtesting Results

Running the Program

Tips

Disclaimer

References

You might also like...

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

MasTrade is a trading bot in baselines3,pytorch,gym

CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning applications.

Predicting Keystrokes using an Audio Side-Channel Attack and Machine Learning

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

30 Days Of Machine Learning Using Pytorch

Comments

cross-platform file paths

Next day forecast

Owner

Renato Votto

Time-series momentum for momentum investing strategy

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

customer churn prediction prevention in telecom industry using machine learning and survival analysis

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

A toolkit for making real world machine learning and data analysis applications in C++

A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects