This is the offline-training-pipeline for our project.

Last update: Apr 22, 2022

Related tags

Text Data & NLP DistilBert-offline-pipeline

Overview

offline-training-pipeline

This is the offline-training-pipeline for our project.

We adopt the offline training and online prediction Machine Learning System framework structure.

We used the recent DistilBERT pre-trained large-scale NLP language model and fine-tuned it for the downstream fake news classification task.

Initial fine-tune training dataset are adopted from CONSTRAINT workshop of AAAI21. For offline routine training and updating in the future, we will adopt the Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Fakenewsnet offers up-to-date datasets and is continuously been updated on a regular basis. We hope to track the lastest trend of popular fake news and broader fake news topic as well by doing offline-training of our model and achieve better performance in the online prediction.

References:

@misc{patwa2020fighting, title={Fighting an Infodemic: COVID-19 Fake News Dataset}, author={Parth Patwa and Shivam Sharma and Srinivas PYKL and Vineeth Guptha and Gitanjali Kumari and Md Shad Akhtar and Asif Ekbal and Amitava Das and Tanmoy Chakraborty}, year={2020}, eprint={2011.03327}, archivePrefix={arXiv}, primaryClass={cs.CL} }

@article{sanh2019distilbert, title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas}, journal={arXiv preprint arXiv:1910.01108}, year={2019} }

@article{shu2020fakenewsnet, title={Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media}, author={Shu, Kai and Mahudeswaran, Deepak and Wang, Suhang and Lee, Dongwon and Liu, Huan}, journal={Big data}, volume={8}, number={3}, pages={171--188}, year={2020}, publisher={Mary Ann Liebert, Inc., publishers 140 Huguenot Street, 3rd Floor New~…} }

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 7, 2023

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

6.4k Jan 9, 2023

This is the offline-training-pipeline for our project.

Related tags

Overview

offline-training-pipeline

You might also like...

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

A full spaCy pipeline and models for scientific/biomedical documents.

A full spaCy pipeline and models for scientific/biomedical documents.

DaCy: The State of the Art Danish NLP pipeline using SpaCy

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Pipeline for chemical image-to-text competition

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Owner

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

Partially offline multi-language translator built upon Huggingface transformers.

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Open-source offline translation library written in Python. Uses OpenNMT for translations

A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.