Both social media sentiment and stock market data are crucial for stock price prediction

Overview

Relating Social Media to Stock Movements_DA-31st-December

Both social media sentiment and stock market data are crucial for stock price prediction. So, in this project we analyzed the dynamics of stock markets based on both social media news (text data) and stock prices (numerical data).

Understanding the Dataset

The dataset we are working on is a combination of Wallstreetbets-Reddit news and the Standard & Poor’s 500 (S&p 500) stock price from 2013 to 2018.

  • The news dataset contains the top 25 news from Reddit on each day from 2013 to 2018.

  • The S&P 500 contains the core stock market information for each day such as Open, Close, and Volume.

  • The SCORE of the dataset is whether the stock price is increase (labeled as 1) or decrease (labeled as 0) on that day.

EDA

Introduction:

  • data dataset comprises 5698 rows and 8 columns.
  • Dataset consists of continuous variable and float data type.
  • Dataset column variables 'Open', 'Close', 'High', 'Low', 'Volume', are the stock variables from historical dataset and other variables are showing polarity of news which are the derived variables using sentiment analysis as discussed in the above section.

Descriptive Statistics:

Using describe() we could get the following result for the numerical features

open high low close volume count 5697.000000 5697.000000 5697.000000 5698.000000 5.698000e+03 mean 88.139399 89.012936 87.245609 88.146015 1.718703e+06 std 32.666995 32.960833 32.363413 32.660301 1.248357e+06 min 30.380000 31.090000 29.730000 29.940000 1.000000e+02 25% 64.650000 65.310000 64.053300 64.672500 9.880475e+05 50% 80.750000 81.490000 79.990000 80.750000 1.460298e+06 75% 105.270000 106.270000 104.350000 105.345000 2.135991e+06 max 201.240000 201.240000 198.160000 200.380000 3.378024e+07

Preprocessing and Sentiment Analysis

We filled out the NaN values in the missed three topics. And got the polarity and subjectivity for the news' topics. Polarity is of 'float' type and lies in the range of -1, 1, where 1

means a high positive sentiment, and -1 means a high negative sentiment.

So, they will be very helpful in determining the increase or decrease of the stock market.

Then we checked the missing values in the stock market information, it was complete. Then we merged the sentiment information (polarity ) by date with the stock market information (Open, High, Low, Close, Volume, Adj Close) in merged_data dataframe.

Before modeling and after splitting we scaled the data using standardization to shift the distribution to have a mean of zero and a standard deviation of one.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
rescaledValidationX = scaler.transform(X_valid)

fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data.

transform() uses the same mean and variance as it is calculated from our training data to transform our test data. Thus, the parameters learned by our model using the training data will help us to transform our test data. As we do not want to be biased with our model, but we want our test data to be completely new and a surprise set for our model.

Preprocessing Again

Now, after observing the outliers in polarity of a lot of topics, we decided to concatenate all the 14 topics in one paragraph, then we can get only one column for polarity. So, we merged these data again with the stock market numerical information and got merged_data dataframe, then scaled it.

Model Building

Metrics considered for Model Evaluation

Accuracy , Precision , Recall and F1 Score

  • Accuracy: What proportion of actual positives and negatives is correctly classified?
  • Precision: What proportion of predicted positives are truly positive ?
  • Recall: What proportion of actual positives is correctly classified ?
  • F1 Score : Harmonic mean of Precision and Recall

Logistic Regression

  • Logistic Regression helps find how probabilities are changed with actions.
  • The function is defined as P(y) = 1 / 1+e^-(A+Bx)
  • Logistic regression involves finding the best fit S-curve where A is the intercept and B is the regression coefficient. The output of logistic regression is a probability score.

Choosing the features

After choosing model based on confusion matrix here where choose the features taking in consideration the deployment phase.

We know from the EDA that all the features are highly correlated and almost follow the same trend among the time. So, along with polarity and subjectivity we choose the open price with the assumption that the user knows the open price but not the close price and wants to figure out if the stock price will increase or decrease.

When we apply the logistic regression model accuracy dropped from 80% to 55%. So, we will use both Open and Close and exclude High, Low, Volume, Adj Close.

precision    recall  f1-score   support

           0       1.00      1.00      1.00   2563950
           1       0.00      0.00      0.00       968

    accuracy                           1.00   2564918
   macro avg       0.50      0.50      0.50   2564918
weighted avg       1.00      1.00      1.00   2564918







Comments
  • Arrange a pipeline to simplify work and implement different algorithms, also plot visualizations for better understanding of dataset.

    Arrange a pipeline to simplify work and implement different algorithms, also plot visualizations for better understanding of dataset.

    Feature description

    Hello there, I am Prajwal Waykos, Data Science and ML enthusiast. I also am very keen on the stock market and have 2 years of stock trading experience. You can visit my shared links for more information.

    I have the following suggestions.

    1. Re-arrange files and codes so that everything runs in a single click.
    2. Have a better understanding of data by the means of detailed, in-depth EDA and some good Data Visualizations.
    3. Try a variety of other algorithms as discussed in a previous issue and then do a comparative study on them.
    4. Also, as rightly mentioned in the PDF, nothing is certain in the stock market hence we will also need to keep on enlarging our Datasets with their quality.
    5. Work on some EDA for better understanding.
    6. Work on Data Preprocessing.

    I am a GSSOC '22 Participant and want to work on this project for its betterment.

    THE LINK HUB

    Email 1 - [email protected] 2- [email protected]

    Resume https://drive.google.com/file/d/1XG5AX_hk46MBHciLqpEn3w2j_HqlpOev/view?usp=sharing

    LinkedIn https://www.linkedin.com/in/prajwal-waykos-a78105207/

    My Website http://prajtech.xyz/

    Twitter https://twitter.com/waykos_prajwal

    Git Hub https://github.com/Praj-17

    Drive Sharable https://drive.google.com/drive/u/2/folders/1WTgYq3rXPaE8_qHWLkPcmdkUkz_-ghWP

    Tableau https://public.tableau.com/app/profile/prajwal.waykos

    Hacker Rank https://www.hackerrank.com/prajwal_22010591

    Phone - 1.- +917249542810 2. - +919405398736

    Facebook https://www.facebook.com/prajwal.waykos

    Instagram https://www.instagram.com/the_resurrection17/

    👀 Have you spent some time to check if this issue has been raised before?

    • [X] I checked and didn't find similar issue

    🏢 Have you read the Code of Conduct?

    • [X] I agree to follow this project's Code of Conduct
    enhancement GSSoc22 
    opened by Praj-17 4
  • Issue and PR Template Done

    Issue and PR Template Done

    Closes #5 .

    Changes :

    • [x] 🐛 Bug Report Issue template done
    • [x] 📄 Documentation Issue template done
    • [x] 💡 Feature Issue template done
    • [x] PR template done.

    Look the Changes :

    3 Templates ..

    1. Bug Report Issue template

    2. Documentation Issue template

    3. Feature Issue template

    3. PR template

    • [x] I agree to follow this project's Code of Conduct .

    Amit Maity - GSSoC'22 Contributor.

    GSSoc22 Level1 
    opened by maityamit 3
  • Better Model

    Better Model

    I am GSSoC 22 Participant. I would like to suggest some changes and work on them:-

    • Implementing GridSearchCV for better tuning and increasing prediction metric scores.
    • Implementing a Pipeline for better management and streamlining the workflow.
    • Try to implement other algorithms like SVC and KNN models to check for better prediction and comparison.

    I would also like to work on these suggestions. Please assign these issues to me.

    opened by Krish2208 2
  • Model overfit on unbalanced outcomes

    Model overfit on unbalanced outcomes

    The reason that the model that has been trained in the notebook LOGISTIC REGRESSION MODEL.ipynb is because of imbalanced classes. I can fix the model if this issue is assigned to me.

    opened by visheshks04 1
  • Multiple Classifiers & Ensemble Method & Learning Curve

    Multiple Classifiers & Ensemble Method & Learning Curve

    ⚙️Related Issue

    • the issue was to implement ensemble method

    Closes: #[10]

    📝Describe the changes you've made

    I implemented the following classifiers :

    • Decision Tree
    • Random Forest
    • AdaBoost
    • Bagging Classifier
    • Voting Classifier and that is the ENSEMBLE METHOD

    Type of change

    What sort of change have you made:

    Mention any unusual behaviour of your code (Write NA if not)

    Any unusual behaviour of your code

    📷Screenshot

    Add Screenshot here

    Checklist:

    • [Y ] My code follows the guidelines of this project.
    • [Y] I have performed a self-review of my own code.
    • [Y] I have commented my code, particularly wherever it was hard to understand.
    • [ N] My changes generate no new warnings.
    • [ N] I have verified/tested my code by running it locally.

    Additional Info (optional)

    The data which is analyzed in this project is saved locally somewhere, so, I couldn't test my code on this data particularly, but I tested it on another dataset.

    opened by reemabdelrazek30 0
  • Feature:

    Feature:

    Feature description

    I want to apply the Random Forests model in this project for better accuracy

    👀 Have you spent some time to check if this issue has been raised before?

    • [X] I checked and didn't find similar issue

    🏢 Have you read the Code of Conduct?

    • [X] I agree to follow this project's Code of Conduct
    bug enhancement 
    opened by VishalSinghParmar2001 0
  • [Bug]:

    [Bug]:

    Contact Details

    [email protected]

    What happened?

    A bug happened! Duplicate dates present on the data set affect the model accuracy please assign me this task I want to work on that !!1

    Relevant log output

    No response

    Code of Conduct

    • [X] I agree to follow this project's Code of Conduct
    bug 
    opened by VishalSinghParmar2001 0
  • Documentation: Update Readme file

    Documentation: Update Readme file

    Description

    I will update readme file so that contributor get more clear idea about the project and help them to solve issue doubtlessly.

    👀 Have you spent some time to check if this issue has been raised before?

    • [X] I checked and didn't find similar issue

    🏢 Have you read the Code of Conduct?

    • [X] I agree to follow this project's Code of Conduct
    documentation 
    opened by AyushJain001 2
  • Ensemble Method

    Ensemble Method

    Feature description

    I would like to use Ensemble methods and compare the accuracy afterwards

    👀 Have you spent some time to check if this issue has been raised before?

    • [X] I checked and didn't find similar issue

    🏢 Have you read the Code of Conduct?

    • [X] I agree to follow this project's Code of Conduct
    enhancement 
    opened by reemabdelrazek30 0
  • [Bug]: Imbalanced Classes

    [Bug]: Imbalanced Classes

    Contact Details

    [email protected]

    What happened?

    As of now, the model is being trained on unbalanced classes which causes it to overfit the majority class and ignore the other one. I can fix this. @ricardoprins can you assign this to me?

    Relevant log output

    #y_train[y_train==1].count()
    
    target    1444
    dtype: int64
    
    #y_train[y_train==0].count()
    
    target    3845932
    dtype: int64
    
    
     
    
           precision    recall  f1-score   support
    
     0       1.00      1.00      1.00   2563950
     1       0.00      0.00      0.00       968
    

    Code of Conduct

    • [X] I agree to follow this project's Code of Conduct
    bug GSSoc22 Level2 
    opened by visheshks04 5
  • Finding the best model.

    Finding the best model.

    Feature description

    Full Name: Suvodeep Das GitHub Profile Link: https://github.com/Suvodeep-Das Objective: Creating a model and testing accuracy using different algorithms to find out the best one for the model. GSSoC'22 participant I would like to work on this issue.

    👀 Have you spent some time to check if this issue has been raised before?

    • [X] I checked and didn't find similar issue

    🏢 Have you read the Code of Conduct?

    • [X] I agree to follow this project's Code of Conduct
    enhancement GSSoc22 Level2 
    opened by Suvodeep-Das 1
Owner
Vishal Singh Parmar
I am Vishal Singh Parmar, I have been pursuing B.Tech in Computer Science Engineering from Shivaji Rao Kadam Institute of Technology,
Vishal Singh Parmar
Cryptocurrency price prediction and exceptions in python

Cryptocurrency price prediction and exceptions in python This is a coursework on foundations of computing module Through this coursework i worked on m

Panagiotis Sotirellos 1 Nov 7, 2021
A linear regression model for house price prediction

Linear_Regression_Model A linear regression model for house price prediction. This code is using these packages, so please make sure your have install

ShawnWang 1 Nov 29, 2021
This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

Hazim Arafa 6 Dec 4, 2022
🔬 A curated list of awesome machine learning strategies & tools in financial market.

?? A curated list of awesome machine learning strategies & tools in financial market.

GeorgeZou 1.6k Dec 30, 2022
Price forecasting of SGB and IRFC Bonds and comparing there returns

Project_Bonds Project Title : Price forecasting of SGB and IRFC Bonds and comparing there returns. Introduction of the Project The 2008-09 global fina

Tishya S 1 Oct 28, 2021
Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

Telekom Open Source Software 17 Nov 20, 2022
A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

Stox 31 Dec 16, 2022
This repository contains the code to predict house price using Linear Regression Method

House-Price-Prediction-Using-Linear-Regression The dataset I used for this personal project is from Kaggle uploaded by aariyan panchal. Link of Datase

null 0 Jan 28, 2022
A machine learning project that predicts the price of used cars in the UK

Car Price Prediction Image Credit: AA Cars Project Overview Scraped 3000 used cars data from AA Cars website using Python and BeautifulSoup. Cleaned t

Victor Umunna 7 Oct 13, 2022
Ml based project which uses regression technique to predict the price.

Price-Predictor Ml based project which uses regression technique to predict the price. I have used various regression models and finds the model with

Garvit Verma 1 Jul 9, 2022
Avocado hass time series vs predict price

AVOCADO HASS TIME SERIES VÀ PREDICT PRICE Trước khi vào Heroku muốn giao diện đẹp mọi người chuyển giúp mình theo hình bên dưới https://avocado-hass.h

hieulmsc 3 Dec 18, 2021
Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Trading Tesla with Machine Learning and Sentiment Analysis An interactive program to train a Random Forest Classifier to predict Tesla daily prices us

Renato Votto 31 Nov 17, 2022
#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

30 Days Of Streamlit ?? This is the official repo of #30DaysOfStreamlit — a 30-day social challenge for you to learn, build and deploy Streamlit apps.

Streamlit 53 Jan 2, 2023
Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Artsem Zhyvalkouski 64 Nov 30, 2022
Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Amplo 10 May 15, 2022
LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is implemented based on PyTorch, and includes all the necessary steps or components related to traffic prediction into a systematic pipeline.

null 432 Jan 5, 2023
customer churn prediction prevention in telecom industry using machine learning and survival analysis

Telco Customer Churn Prediction - Plotly Dash Application Description This dash application allows you to predict telco customer churn using machine l

Benaissa Mohamed Fayçal 3 Nov 20, 2021
This repo includes some graph-based CTR prediction models and other representative baselines.

Graph-based CTR prediction This is a repository designed for graph-based CTR prediction methods, it includes our graph-based CTR prediction methods: F

Big Data and Multi-modal Computing Group, CRIPAC 47 Dec 30, 2022