NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework

Related tags

Deep Learning TLM
Overview

NLP From Scratch Without Large-Scale Pretraining

This repository contains the code, pre-trained model checkpoints and curated datasets for our paper: NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework.

In our proposed framework, named TLM (task-driven language modeling), instead of training a language model over the entire general corpus and then finetuning it on task data, we first usetask data as queries to retrieve a tiny subset of the general corpus, and then perform joint learning on both the task objective and self-supervised language modeling objective.

Requirements

We implement our models and training loops based on the opensource products from HuggingFace. The core denpencies of this repository are listed in requirements.txt, which can be installed through:

pip install -r requirements.txt

All our experiments are conducted on a node with 8 A100 40GB SXM gpus. Different computational devices may result slightly different results from the reported ones.

Models and Datasets

We release the trained models on 8 tasks with 3 different scales, together with the task datasets and selected external data. Our released model checkpoints, datasets and the performance of each model for each task are listed in the following table.

AGNews Hyp. Help. IMDB ACL. SciERC Chem. RCT
Small 93.74 93.53 70.54 93.08 69.84 80.51 81.99 86.99
Medium 93.96 94.05 70.90 93.97 72.37 81.88 83.24 87.28
Large 94.36 95.16 72.49 95.77 72.19 83.29 85.12 87.50

The released models and datasets are compatible with HuggingFace's Transformers and Datasets. We provide an example script to evaluate a model checkpoints on a certain task, run

bash example_scripts/evaluate.sh

To get the evaluation results for SciERC with a small-scale model.

Training

We provide two example scripts to train a model from scratch, run

bash example_scripts/train.sh && bash example_scripts/finetune.sh

To train a small-scale model for SciERC. Here example_scripts/train.sh corresponds to the first stage training where the external data ratio and MLM weight are non-zero, and example_scripts/finetune.sh corresponds to the second training stage where no external data or self-supervised loss can be perceived by the model.

Citation

Please cite our paper if you use TLM in your work:

@misc{yao2021tlm,
title={NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework},
author={Yao, Xingcheng and Zheng, Yanan and Yang, Xiaocong and Yang, Zhilin},
year={2021}
}
Comments
  • Trying execute the data_selection script

    Trying execute the data_selection script

    Hi yaoxingcheng, Your work is so great. I tried to execute your data_selection script with your example source and target data. After finishing the execution, a selected data is created. It has 3 columns, these are "text","id" and "rank". image

    However, when I access your released dataset, I found that your dataset has one different column (it is the "label" column). image

    I wonder if the "rank" column created by your original script is similar to the “label” column in your released datasets. Sincerely, baominhlt

    opened by baominhlt 4
  • I don't notice the Task-driven function anywhere.

    I don't notice the Task-driven function anywhere.

    Hi yaoxingcheng,

    Thanks for releasing the code and paper for your task-driven language modeling approach.

    I tried running your code. But I don't notice the Task-driven functionality anywhere. Can you explain a little more the difference between having Task-driven and traditional LM training? thank you very much.

    Best, ChinhH framework

    opened by Huynh-Chinh 4
  • What GPU have you used for training TLM ?

    What GPU have you used for training TLM ?

    Hello,

    Great work ! I am quite interested in your work.

    I would like to know what kind of GPU have you used for training TLM ? From the Table 1, I see that it were 8 GPU with 42 hours. Are they 8 Nvidai V100 GPUs with 32 GB or something else ?

    Looking forwar to your answer.

    Thanks in advance.

    opened by hitchhicker 2
  • How to make target dataset from source database ?

    How to make target dataset from source database ?

    Hi yaoxingcheng,

    Thanks for releasing the code and paper for your task-driven language modeling approach. I have some basic question about dataset construction: I want to train a new model specifically roberta on my dataset. I already have my data as "my_source.csv". But i don't know how to build "my_tareget.csv". And I don't understand how to construct such data yet. Can you help me solve this problem?

    Best, ChinhH

    opened by Huynh-Chinh 1
  • How can we create selected.csv?

    How can we create selected.csv?

    I have install your requirement. However, when I ran your Data Selection code with your example source data and target data. The selected.csv was created but it was empty. Besides, can you explain for me what is the index name (in line 144 in data_selection.py)?

    opened by baominhlt 0
  • fix variables referenced before assignement

    fix variables referenced before assignement

    Current code will occur UnboundLocalError: local variable 'tr_loss' referenced before assignment. Same as variable loss_step. Fix the bug by assign the two variables before referenced.

    opened by zmzhang2000 0
  • “AttributeError: No huggingface_hub attribute hf_api”

    “AttributeError: No huggingface_hub attribute hf_api”

    Hi, Xingcheng! When I tried to run "bash example_scripts/evaluate.sh" on Python 3.9.15, I've got an error of “AttributeError: No huggingface_hub attribute hf_api”. I was wondering if this is caused by a version error of the related packages. Then I updated both the versions of transformers(4.25.1) and huggingface_hub(0.11.1), but the problem was still unresolved. Could you help me with this problem? Thank you!

    opened by Smurflyiaa 0
  • add web demo/models/datasets to ICML organization on Hugging Face

    add web demo/models/datasets to ICML organization on Hugging Face

    Hi, I see you have already submitted models for this paper on Hugging Face at https://huggingface.co/yxchar, congrats for the acceptance at ICML. We are having a event on Hugging Face for ICML 2022, where you can submit spaces(web demos), models, and datasets for papers for a chance to win prizes. For the existing models, you can clone them and push them to the ICML 2022 organization here: https://huggingface.co/ICML2022, after joining the organization using this link https://huggingface.co/organizations/ICML2022/share/BpynfJtfsOTktlmXYoKNqqCnyufKLFXuay, let me know if you need any help with the above steps, thanks

    opened by AK391 0
  • "the client failed to establish a connection"

    hi xingcheng, i try to use example script you provided, that is "curl -H ... ...ngrok.io/search", but it returned something like "Failed to complete tunnel connection". could you give me some help? thanks !!!

    opened by Jesse1eung 1
  • How to define small-scale

    How to define small-scale

    Hello! Can I ask what's the difference between small-scale and mideum-scale. Is it different because of the different values of k? And I also get f1 = 0.7687 instead of 0.8051 in small-scale Sci-ERC. Why would that happen? Btw,this is a great framework! Thank you!

    opened by ocmykr2 3
  • 关于data_selection

    关于data_selection

    感谢您的工作,对我有很大的启发 不过我对ElasticSearch有一些疑问,为什么不管我的target.csv有多少条数据(source.csv数据足够多),最后生成的selected.csv的数据条目总是小于或等于5000条,我在调试过程中,发现query_neighbours的值也总是小于5000的,是不是需要对ElasticSearch进行一些参数设置?

    opened by sunyilgdx 1
  • encounter ConnectionError when executing data_selection.sh

    encounter ConnectionError when executing data_selection.sh

    Hi Xingcheng, thanks for the great work and releasing the code.

    I first download and run ./bin/elasticsearch, then execute bash example_scripts/data_selection.sh. But I've encountered an error: elasticsearch.exceptions.ConnectionError: ConnectionError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))) caused by: ProtocolError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))

    Could you help me to fix this error? Is there anything I missed about elasticsearch?

    opened by fernando9torres 3
Owner
Xingcheng Yao
Undergraduate student at IIIS, Tsinghua University
Xingcheng Yao
An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

null 33 Jun 27, 2021
Pretraining Representations For Data-Efficient Reinforcement Learning

Pretraining Representations For Data-Efficient Reinforcement Learning Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, Ankesh Anand, Laurent Ch

Mila 40 Dec 11, 2022
Code for Efficient Visual Pretraining with Contrastive Detection

Code for DetCon This repository contains code for the ICCV 2021 paper "Efficient Visual Pretraining with Contrastive Detection" by Olivier J. Hénaff,

DeepMind 56 Nov 13, 2022
This repo is to be freely used by ML devs to check the GAN performances without coding from scratch.

GANs for Fun Created because I can! GOAL The goal of this repo is to be freely used by ML devs to check the GAN performances without coding from scrat

Sagnik Roy 13 Jan 26, 2022
ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

Hao Su's Lab, UCSD 48 Dec 30, 2022
OSLO: Open Source framework for Large-scale transformer Optimization

O S L O Open Source framework for Large-scale transformer Optimization What's New: December 21, 2021 Released OSLO 1.0. What is OSLO about? OSLO is a

TUNiB 280 Nov 24, 2022
DeepGNN is a framework for training machine learning models on large scale graph data.

DeepGNN Overview DeepGNN is a framework for training machine learning models on large scale graph data. DeepGNN contains all the necessary features in

Microsoft 45 Jan 1, 2023
XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks ACL 2020 Microsoft Research [Paper] [Video] Releasing [XtremeDistilTransf

Microsoft 125 Jan 4, 2023
An efficient 3D semantic segmentation framework for Urban-scale point clouds like SensatUrban, Campus3D, etc.

An efficient 3D semantic segmentation framework for Urban-scale point clouds like SensatUrban, Campus3D, etc.

Zou 33 Jan 3, 2023
A simple, fast, and efficient object detector without FPN

You Only Look One-level Feature (YOLOF), CVPR2021 A simple, fast, and efficient object detector without FPN. This repo provides an implementation for

null 789 Jan 9, 2023
EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling This is the official implementation for "Frustratingly Simple Pretraining Al

Atsuki Yamaguchi 31 Nov 18, 2022
Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

null 184 Dec 11, 2022
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

Aiden Nibali 36 Oct 30, 2022
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

Aiden Nibali 25 Jun 20, 2021
Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

null 37 Dec 8, 2022
null 190 Jan 3, 2023
An algorithm that handles large-scale aerial photo co-registration, based on SURF, RANSAC and PyTorch autograd.

An algorithm that handles large-scale aerial photo co-registration, based on SURF, RANSAC and PyTorch autograd.

Luna Yue Huang 41 Oct 29, 2022
A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

Appen Repos 86 Dec 7, 2022
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

Microsoft 7.6k Jan 1, 2023