Both social media sentiment and stock market data are crucial for stock price prediction

Vishal Singh Parmar

Last update: Oct 29, 2022

Related tags

Machine Learning Relating-Social-Media-to-Stock-Movement-Public

Overview

Relating Social Media to Stock Movements_DA-31st-December

Both social media sentiment and stock market data are crucial for stock price prediction. So, in this project we analyzed the dynamics of stock markets based on both social media news (text data) and stock prices (numerical data).

Understanding the Dataset

The dataset we are working on is a combination of Wallstreetbets-Reddit news and the Standard & Poor’s 500 (S&p 500) stock price from 2013 to 2018.

The news dataset contains the top 25 news from Reddit on each day from 2013 to 2018.
The S&P 500 contains the core stock market information for each day such as Open, Close, and Volume.
The SCORE of the dataset is whether the stock price is increase (labeled as 1) or decrease (labeled as 0) on that day.

EDA

Introduction:

data dataset comprises 5698 rows and 8 columns.
Dataset consists of continuous variable and float data type.
Dataset column variables 'Open', 'Close', 'High', 'Low', 'Volume', are the stock variables from historical dataset and other variables are showing polarity of news which are the derived variables using sentiment analysis as discussed in the above section.

Descriptive Statistics:

Using describe() we could get the following result for the numerical features

open high low close volume count 5697.000000 5697.000000 5697.000000 5698.000000 5.698000e+03 mean 88.139399 89.012936 87.245609 88.146015 1.718703e+06 std 32.666995 32.960833 32.363413 32.660301 1.248357e+06 min 30.380000 31.090000 29.730000 29.940000 1.000000e+02 25% 64.650000 65.310000 64.053300 64.672500 9.880475e+05 50% 80.750000 81.490000 79.990000 80.750000 1.460298e+06 75% 105.270000 106.270000 104.350000 105.345000 2.135991e+06 max 201.240000 201.240000 198.160000 200.380000 3.378024e+07

Preprocessing and Sentiment Analysis

We filled out the NaN values in the missed three topics. And got the polarity and subjectivity for the news' topics. Polarity is of 'float' type and lies in the range of -1, 1, where 1

means a high positive sentiment, and -1 means a high negative sentiment.

So, they will be very helpful in determining the increase or decrease of the stock market.

Then we checked the missing values in the stock market information, it was complete. Then we merged the sentiment information (polarity ) by date with the stock market information (Open, High, Low, Close, Volume, Adj Close) in merged_data dataframe.

Before modeling and after splitting we scaled the data using standardization to shift the distribution to have a mean of zero and a standard deviation of one.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
rescaledValidationX = scaler.transform(X_valid)

fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data.

transform() uses the same mean and variance as it is calculated from our training data to transform our test data. Thus, the parameters learned by our model using the training data will help us to transform our test data. As we do not want to be biased with our model, but we want our test data to be completely new and a surprise set for our model.

Preprocessing Again

Now, after observing the outliers in polarity of a lot of topics, we decided to concatenate all the 14 topics in one paragraph, then we can get only one column for polarity. So, we merged these data again with the stock market numerical information and got merged_data dataframe, then scaled it.

Model Building

Metrics considered for Model Evaluation

Accuracy , Precision , Recall and F1 Score

Accuracy: What proportion of actual positives and negatives is correctly classified?
Precision: What proportion of predicted positives are truly positive ?
Recall: What proportion of actual positives is correctly classified ?
F1 Score : Harmonic mean of Precision and Recall

Logistic Regression

Logistic Regression helps find how probabilities are changed with actions.
The function is defined as P(y) = 1 / 1+e^-(A+Bx)
Logistic regression involves finding the best fit S-curve where A is the intercept and B is the regression coefficient. The output of logistic regression is a probability score.

Choosing the features

After choosing model based on confusion matrix here where choose the features taking in consideration the deployment phase.

We know from the EDA that all the features are highly correlated and almost follow the same trend among the time. So, along with polarity and subjectivity we choose the open price with the assumption that the user knows the open price but not the close price and wants to figure out if the stock price will increase or decrease.

When we apply the logistic regression model accuracy dropped from 80% to 55%. So, we will use both Open and Close and exclude High, Low, Volume, Adj Close.

precision    recall  f1-score   support

           0       1.00      1.00      1.00   2563950
           1       0.00      0.00      0.00       968

    accuracy                           1.00   2564918
   macro avg       0.50      0.50      0.50   2564918
weighted avg       1.00      1.00      1.00   2564918

Comments

Arrange a pipeline to simplify work and implement different algorithms, also plot visualizations for better understanding of dataset.
Feature description

Hello there, I am Prajwal Waykos, Data Science and ML enthusiast. I also am very keen on the stock market and have 2 years of stock trading experience. You can visit my shared links for more information.

I have the following suggestions.

Re-arrange files and codes so that everything runs in a single click.

Have a better understanding of data by the means of detailed, in-depth EDA and some good Data Visualizations.

Try a variety of other algorithms as discussed in a previous issue and then do a comparative study on them.

Also, as rightly mentioned in the PDF, nothing is certain in the stock market hence we will also need to keep on enlarging our Datasets with their quality.

Work on some EDA for better understanding.

Work on Data Preprocessing.

I am a GSSOC '22 Participant and want to work on this project for its betterment.

THE LINK HUB

Email 1 - [email protected] 2- [email protected]

Resume https://drive.google.com/file/d/1XG5AX_hk46MBHciLqpEn3w2j_HqlpOev/view?usp=sharing

LinkedIn https://www.linkedin.com/in/prajwal-waykos-a78105207/

My Website http://prajtech.xyz/

Twitter https://twitter.com/waykos_prajwal

Git Hub https://github.com/Praj-17

Drive Sharable https://drive.google.com/drive/u/2/folders/1WTgYq3rXPaE8_qHWLkPcmdkUkz_-ghWP

Tableau https://public.tableau.com/app/profile/prajwal.waykos

Hacker Rank https://www.hackerrank.com/prajwal_22010591

Phone - 1.- +917249542810 2. - +919405398736

Facebook https://www.facebook.com/prajwal.waykos

Instagram https://www.instagram.com/the_resurrection17/

👀 Have you spent some time to check if this issue has been raised before?

[X] I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

[X] I agree to follow this project's Code of Conduct

enhancement GSSoc22
opened by Praj-17 4
Issue and PR Template Done
Closes #5 .

Changes :

[x] 🐛 Bug Report Issue template done

[x] 📄 Documentation Issue template done

[x] 💡 Feature Issue template done

[x] PR template done.

Look the Changes :

3 Templates ..

1. Bug Report Issue template

2. Documentation Issue template

3. Feature Issue template

3. PR template

[x] I agree to follow this project's Code of Conduct .

Amit Maity - GSSoC'22 Contributor.
GSSoc22 Level1
opened by maityamit 3
Better Model
I am GSSoC 22 Participant. I would like to suggest some changes and work on them:-

Implementing GridSearchCV for better tuning and increasing prediction metric scores.

Implementing a Pipeline for better management and streamlining the workflow.

Try to implement other algorithms like SVC and KNN models to check for better prediction and comparison.

I would also like to work on these suggestions. Please assign these issues to me.
opened by Krish2208 2
Model overfit on unbalanced outcomes

The reason that the model that has been trained in the notebook LOGISTIC REGRESSION MODEL.ipynb is because of imbalanced classes. I can fix the model if this issue is assigned to me.

opened by visheshks04 1
Multiple Classifiers & Ensemble Method & Learning Curve
⚙️Related Issue

the issue was to implement ensemble method

Closes: #[10]

📝Describe the changes you've made

I implemented the following classifiers :

Decision Tree

Random Forest

AdaBoost

Bagging Classifier

Voting Classifier and that is the ENSEMBLE METHOD

Type of change

What sort of change have you made:

Mention any unusual behaviour of your code (Write NA if not)

Any unusual behaviour of your code

📷Screenshot

Add Screenshot here

Checklist:

[Y ] My code follows the guidelines of this project.

[Y] I have performed a self-review of my own code.

[Y] I have commented my code, particularly wherever it was hard to understand.

[ N] My changes generate no new warnings.

[ N] I have verified/tested my code by running it locally.

Additional Info (optional)

The data which is analyzed in this project is saved locally somewhere, so, I couldn't test my code on this data particularly, but I tested it on another dataset.
opened by reemabdelrazek30 0
Feature:
Feature description

I want to apply the Random Forests model in this project for better accuracy

👀 Have you spent some time to check if this issue has been raised before?

[X] I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

[X] I agree to follow this project's Code of Conduct

bug enhancement
opened by VishalSinghParmar2001 0
[Bug]:
Contact Details

[email protected]

What happened?

A bug happened! Duplicate dates present on the data set affect the model accuracy please assign me this task I want to work on that !!1

Relevant log output

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

bug
opened by VishalSinghParmar2001 0
Documentation: Update Readme file
Description

I will update readme file so that contributor get more clear idea about the project and help them to solve issue doubtlessly.

👀 Have you spent some time to check if this issue has been raised before?

[X] I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

[X] I agree to follow this project's Code of Conduct

documentation
opened by AyushJain001 2
Ensemble Method
Feature description

I would like to use Ensemble methods and compare the accuracy afterwards

👀 Have you spent some time to check if this issue has been raised before?

[X] I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

[X] I agree to follow this project's Code of Conduct

enhancement
opened by reemabdelrazek30 0
[Bug]: Imbalanced Classes
Contact Details

[email protected]

What happened?

As of now, the model is being trained on unbalanced classes which causes it to overfit the majority class and ignore the other one. I can fix this. @ricardoprins can you assign this to me?

Relevant log output

#y_train[y_train==1].count() target 1444 dtype: int64 #y_train[y_train==0].count() target 3845932 dtype: int64 precision recall f1-score support 0 1.00 1.00 1.00 2563950 1 0.00 0.00 0.00 968

Code of Conduct

[X] I agree to follow this project's Code of Conduct

bug GSSoc22 Level2
opened by visheshks04 5
Finding the best model.
Feature description

Full Name: Suvodeep Das GitHub Profile Link: https://github.com/Suvodeep-Das Objective: Creating a model and testing accuracy using different algorithms to find out the best one for the model. GSSoC'22 participant I would like to work on this issue.

👀 Have you spent some time to check if this issue has been raised before?

[X] I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

[X] I agree to follow this project's Code of Conduct

enhancement GSSoc22 Level2
opened by Suvodeep-Das 1

Owner

Vishal Singh Parmar

I am Vishal Singh Parmar, I have been pursuing B.Tech in Computer Science Engineering from Shivaji Rao Kadam Institute of Technology,

GitHub

Cryptocurrency price prediction and exceptions in python

Cryptocurrency price prediction and exceptions in python This is a coursework on foundations of computing module Through this coursework i worked on m

1 Nov 7, 2021

A linear regression model for house price prediction

Linear_Regression_Model A linear regression model for house price prediction. This code is using these packages, so please make sure your have install

1 Nov 29, 2021

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

6 Dec 4, 2022

🔬 A curated list of awesome machine learning strategies & tools in financial market.

?? A curated list of awesome machine learning strategies & tools in financial market.

1.6k Dec 30, 2022

Price forecasting of SGB and IRFC Bonds and comparing there returns

Project_Bonds Project Title : Price forecasting of SGB and IRFC Bonds and comparing there returns. Introduction of the Project The 2008-09 global fina

1 Oct 28, 2021

Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

17 Nov 20, 2022

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

31 Dec 16, 2022

This repository contains the code to predict house price using Linear Regression Method

House-Price-Prediction-Using-Linear-Regression The dataset I used for this personal project is from Kaggle uploaded by aariyan panchal. Link of Datase

0 Jan 28, 2022

A machine learning project that predicts the price of used cars in the UK

Car Price Prediction Image Credit: AA Cars Project Overview Scraped 3000 used cars data from AA Cars website using Python and BeautifulSoup. Cleaned t

7 Oct 13, 2022

Ml based project which uses regression technique to predict the price.

Price-Predictor Ml based project which uses regression technique to predict the price. I have used various regression models and finds the model with

1 Jul 9, 2022

Avocado hass time series vs predict price

AVOCADO HASS TIME SERIES VÀ PREDICT PRICE Trước khi vào Heroku muốn giao diện đẹp mọi người chuyển giúp mình theo hình bên dưới https://avocado-hass.h

3 Dec 18, 2021

Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Trading Tesla with Machine Learning and Sentiment Analysis An interactive program to train a Random Forest Classifier to predict Tesla daily prices us

31 Nov 17, 2022

#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

30 Days Of Streamlit ?? This is the official repo of #30DaysOfStreamlit — a 30-day social challenge for you to learn, build and deploy Streamlit apps.

53 Jan 2, 2023

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

64 Nov 30, 2022

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

10 May 15, 2022

MachineLearningStocks is designed to be an intuitive and highly extensible template project applying machine learning to making stock predictions.

Using python and scikit-learn to make stock predictions

1.3k Jan 3, 2023

LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is implemented based on PyTorch, and includes all the necessary steps or components related to traffic prediction into a systematic pipeline.

432 Jan 5, 2023

customer churn prediction prevention in telecom industry using machine learning and survival analysis

Telco Customer Churn Prediction - Plotly Dash Application Description This dash application allows you to predict telco customer churn using machine l

3 Nov 20, 2021

This repo includes some graph-based CTR prediction models and other representative baselines.

Graph-based CTR prediction This is a repository designed for graph-based CTR prediction methods, it includes our graph-based CTR prediction methods: F

Big Data and Multi-modal Computing Group, CRIPAC

47 Dec 30, 2022

Both social media sentiment and stock market data are crucial for stock price prediction

Related tags

Overview

Relating Social Media to Stock Movements_DA-31st-December

Understanding the Dataset

EDA

Preprocessing and Sentiment Analysis

Preprocessing Again

Model Building

Metrics considered for Model Evaluation

Logistic Regression

Choosing the features

Comments

Feature description

👀 Have you spent some time to check if this issue has been raised before?

🏢 Have you read the Code of Conduct?

⚙️Related Issue

📝Describe the changes you've made

Type of change

Mention any unusual behaviour of your code (Write NA if not)

📷Screenshot

Checklist:

Additional Info (optional)

Feature description

👀 Have you spent some time to check if this issue has been raised before?

🏢 Have you read the Code of Conduct?

Contact Details

What happened?

Relevant log output

Code of Conduct

Description

👀 Have you spent some time to check if this issue has been raised before?

🏢 Have you read the Code of Conduct?

Feature description

👀 Have you spent some time to check if this issue has been raised before?

🏢 Have you read the Code of Conduct?

Contact Details

What happened?

Relevant log output

Code of Conduct

Feature description

👀 Have you spent some time to check if this issue has been raised before?

🏢 Have you read the Code of Conduct?

Owner

Vishal Singh Parmar

Cryptocurrency price prediction and exceptions in python

A linear regression model for house price prediction

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

🔬 A curated list of awesome machine learning strategies & tools in financial market.

Price forecasting of SGB and IRFC Bonds and comparing there returns

Tools for Optuna, MLflow and the integration of both.

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

This repository contains the code to predict house price using Linear Regression Method

A machine learning project that predicts the price of used cars in the UK

Ml based project which uses regression technique to predict the price.

Avocado hass time series vs predict price

Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

MachineLearningStocks is designed to be an intuitive and highly extensible template project applying machine learning to making stock predictions.

LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

customer churn prediction prevention in telecom industry using machine learning and survival analysis

This repo includes some graph-based CTR prediction models and other representative baselines.