TLA - Twitter Linguistic Analysis

Tool for linguistic analysis of communities

TLA is built using PyTorch, Transformers and several other State-of-the-Art machine learning techniques and it aims to expedite and structure the cumbersome process of collecting, labeling, and analyzing data from Twitter for a corpus of languages while providing detailed labeled datasets for all the languages. The analysis provided by TLA will also go a long way in understanding the sentiments of different linguistic communities and come up with new and innovative solutions for their problems based on the analysis. List of languages our library provides support for are listed as follows:

Language	Code	Language	Code
English	en	Hindi	hi
Swedish	sv	Thai	th
Dutch	nl	Japanese	ja
Turkish	tr	Urdu	ur
Indonesian	id	Portuguese	pt
French	fr	Chinese	zn-ch
Spanish	es	Persian	fa
Romainain	ro	Russian	ru

Features

Provides 16 labeled Datasets for different languages for analysis.
Implements Bert based architecture to identify languages.
Provides Functionalities to Extract,process and label tweets from twitter.
Provides a Random Forest classifier to implement sentiment analysis on any string.

Installation :

pip install --upgrade https://github.com/tusharsarkar3/TLA.git

Overview

Extract data

from TLA.Data.get_data import store_data
store_data('en',False)

This will extract and store the unlabeled data in a new directory inside data named datasets.

Label data

from TLA.Datasets.get_lang_data import language_data
df = language_data('en')
print(df)

This will print the labeled data that we have already collected.

Classify languages

Training

Training can be done in the following way:

from TLA.Lang_Classify.train import train_lang
train_lang(path_to_dataset,epochs)

Prediction

Inference is done in the following way:

from TLA.Lang_Classify.predict import predict
model = get_model(path_to_weights)
preds = predict(dataframe_to_be_used,model)

Analyse

Training

Training can be done in the following way:

from TLA.Analyse.train_rf import train_rf
train_rf(path_to_dataset)

This will store all the vectorizers and models in a seperate directory named saved_rf and saved_vec and they are present inside Analysis directory. Further instructions for training multiple languages is given in the next section which shows how to run the commands using CLI

Final Analysis

Analysis is done in the following way:

from TLA.Analysis.analyse import analyse_data 
analyse_data(path_to_weights)

This will store the final analysis as .csv inside a new directory named analysis.

Overview with Git

Installation another method

git clone https://github.com/tusharsarkar3/TLA.git

Extract data

Navigate to the required directory

cd Data

Run the following command:

python get_data.py --lang en --process True

Lang flag is used to input the language of the dataset that is required and process flag shows where pre-processing should be done before returning the data. Give the following codes in the lang flag wrt the required language:

Loading Dataset

To load a dataset run the following command in python.

df= pd.read_csv("TLA/TLA/Datasets/get_data_en.csv")

The command will return a dataframe consisting of the data for the specific language requested.

In the phrase get_data_en, en can be sunstituted by the desired language code to load the dataframe for the specific language.

Pre-Processing

To preprocess a given string run the following command.

In your terminal use code

cd Data

then run the command in python

from TLA.Data import Pre_Process_Tweets

df=Pre_Process_Tweets.pre_process_tweet(df)

Here the function pre_process_tweet takes an input as a dataframe of tweets and returns an output of a dataframe with the list of preprocessed words for a particular tweet next to the tweet in the dataframe.

Analysis

Training

To train a random forest classifier for the purpose of sentiment analysis run the following command in your terminal.

cd Analysis

then

python train.rf --path "path to your datafile" --train_all_datasets False

here the --path flag represents the path to the required dataset you want to train the Random Forest Classifier on the --train_all_datasets flag is a boolean which can be used to train the model on multiple datasets at once.

The output is a file with the a .pkl file extention saved in the folder at location "TLA\Analysis\saved_rf{}.pkl" The output for vectorization of is stored in a .pkl file in the directory "TLA\Analysis\saved_vec{}.pkl"

Get Sentiment

To get the sentiment of any string use the following code.

In your terminal type

cd Analysis

then in your terminal type

python get_sentiment.py --prediction "Your string for prediction to be made upon" --lang "en"

here the --prediction flag collects the string for which you want to get the sentiment for. the --lang represents the language code representing the language you typed your string in.

The output is a sentiment which is either positive or negative depending on your string.

Statistics

To get a comprehensive statistic on sentiment of datasets run the following command.

In your terminal type

cd Analysis

then

python analyse.py

This will give you an output of a table1.csv file at the location 'TLA\Analysis\analysis\table1.csv' comprising of statistics relating to the percentage of positive or negative tweets for a given language dataset.

It will also give a table2.csv file at 'TLA\Analysis\analysis\table2.csv' comprising of statistics for all languages combined.

Language Classification

Training

To train a model for language classfication on a given dataset run the following commands.

In your terminal run

cd Lang_Classify

then run

python train.py --data "path for your dataset" --model "path to weights if pretrained" --epochs 4

The --data flag requires the path to your training dataset.

The --model flag requires the path to the model you want to implement

The --epoch flag represents the epochs you want to train your model for.

The output is a file with a .pt extention named saved_wieghts_full.pt where your trained wieghst are stored.

Prediction

To make prediction on any given string Us ethe following code.

In your terminal type

cd Lang_Classify

then run the code

python predict.py --predict "Text/DataFrame for language to predicted" --weights " Path for the stored weights of your model "

The --predict flag requires the string you want to get the language for.

The --wieghts flag is the path for the stored wieghts you want to run your model on to make predictions.

The outputs is the language your string was typed in.

Results:

Performance of TLA ( Loss vs epochs)

Language	Total tweets	Positive Tweets Percentage	Negative Tweets Percentage
English	500	66.8	33.2
Spanish	500	61.4	38.6
Persian	50	52	48
French	500	53	47
Hindi	500	62	38
Indonesian	500	63.4	36.6
Japanese	500	85.6	14.4
Dutch	500	84.2	15.8
Portuguese	500	61.2	38.8
Romainain	457	85.55	14.44
Russian	213	62.91	37.08
Swedish	420	80.23	19.76
Thai	424	71.46	28.53
Turkish	500	67.8	32.2
Urdu	42	69.04	30.95
Chinese	500	80.6	19.4

Reference:

@misc{sarkar2021tla,
     title={TLA: Twitter Linguistic Analysis}, 
     author={Tushar Sarkar and Nishant Rajadhyaksha},
     year={2021},
     eprint={2107.09710},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

@misc{640cba8b-35cb-475e-ab04-62d079b74d13,
 title = {TLA: Twitter Linguistic Analysis},
 author = {Tushar Sarkar and Nishant Rajadhyaksha},
  journal = {Software Impacts},
 doi = {10.24433/CO.6464530.v1}, 
 howpublished = {\url{https://www.codeocean.com/}},
 year = 2021,
 month = {6},
 version = {v1}
}

Features to be added :

Access to more language
Creating GUI based system for better accesibility
Improving performance of the baseline model

Developed by Tushar Sarkar and Nishant Rajadhyaksha

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

40 Nov 30, 2022

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

1 Nov 24, 2021

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

4 Oct 15, 2022

An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

171 Nov 21, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Dec 30, 2022

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

46 Dec 14, 2022

Various capabilities for static malware analysis.

Malchive The malchive serves as a compendium for a variety of capabilities mainly pertaining to malware analysis, such as scripts supporting day to da

64 Nov 22, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

409 Oct 28, 2022

TLA - Twitter Linguistic Analysis

Related tags

Overview

TLA - Twitter Linguistic Analysis

Tool for linguistic analysis of communities

Features

Installation :

Overview

Overview with Git

Results:

Reference:

Features to be added :

Developed by Tushar Sarkar and Nishant Rajadhyaksha

You might also like...

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Various capabilities for static malware analysis.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

Owner

Tushar Sarkar

Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

This repository contains Python scripts for extracting linguistic features from Filipino texts.

This code is the implementation of Text Emotion Recognition (TER) with linguistic features

Twitter Sentiment Analysis using #tag, words and username

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps