SentAugment is a data augmentation technique for semi-supervised learning in NLP.

Meta Research

Last update: Dec 30, 2022

Related tags

Text Data & NLP SentAugment

Overview

SentAugment

SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structure the information of a very large bank of sentences. The large-scale sentence embedding space is then used to retrieve in-domain unannotated sentences for any language understanding task such that semi-supervised learning techniques like self-training and knowledge-distillation can be leveraged. This means you do not need to assume the presence of unannotated sentences to use semi-supervised learning techniques. In our paper Self-training Improves Pre-training for Natural Language Understanding, we show that SentAugment provides strong gains on multiple language understanding tasks when used in combination with self-training or knowledge distillation.

Dependencies

I. The large-scale bank of sentences

Our approach is based on a large bank of CommonCrawl web sentences. We use SentAugment to filter domain-specific unannotated data for semi-supervised learning NLP methods. This data can be found here and can be recovered from CommonCrawl by the ccnet repository. It consists of 5 billion sentences, each file containing 100M sentences. As an example, we are going to use 100M sentences from the first file:

mkdir data && cd data
wget http://www.statmt.org/cc-english/x01.cc.5b.tar.gz

Then untar files and put all sentences into a single file:

tar -xvf *.tar.gz
cat *.5b > keys.txt

Then, for fast indexing, create a memory map (mmap) of this text file:

python src/compress_text.py --input data/keys.txt &

We will use this data as the bank of sentences.

II. The SentAugment sentence embedding space (SASE)

Our sentence encoder is based on the Transformer implementation of XLM. It obtains state-of-the-art performance on several STS benchmarks. To use it, first clone XLM:

git clone https://github.com/facebookresearch/XLM

Then, download the SentAugment sentence encoder (SASE), and its sentencepiece model:

cd data
wget https://dl.fbaipublicfiles.com/sentaugment/sase.pth
wget https://dl.fbaipublicfiles.com/sentaugment/sase.spm

Then to embed sentences, you can run for instance:

input=data/keys.txt  # input text file
output=data/keys.pt  # output pytorch file

# Encode sentence from $input file and save it to $output
python src/sase.py --input $input --model data/sase.pth --spm_model data/sase.spm --batch_size 64 --cuda "True" --output $output

This will output a torch file containing sentence embeddings (dim=256).

III. Retrieving nearest neighbor sentences from a query

Now that you have constructed a sentence embedding space by encoding many sentences from CommonCrawl, you can leverage that "bank of sentences" with similarity search. From an input query sentence, you can retrieve nearest neighbors from the bank by running:

nn.txt & ">

bank=data/keys.txt.ref.bin64  # compressed text file (bank)
emb=data/keys.pt  # embeddings of sentences (keys)
K=10000  # number of sentences to retrieve per query

## encode input sentences as sase embedding
input=sentence.txt  # input file containing a few (query) sentences
python src/sase.py --input $input --model data/sase.pth --spm_model data/sase.spm --batch_size 64 --cuda "True" --output $input.pt

## use embedding to retrieve nearest neighbors
input=sentence.txt  # input file containing a few (query) sentences
python src/flat_retrieve.py --input $input.pt --bank $bank --emb data/keys.pt --K $K > nn.txt &

Sentences in nn.txt can be used for semi-supervised learning as unannotated in-domain data. They also provide good paraphrases (use the cosine similarity score to filter good paraphrase pairs).

In the next part, we provide fast nearest-neighbor indexes for faster retrieval of similar sentences.

IV. Fast K-nearest neighbor search

Fast K-nearest neighbor search is particularly important when considering a large bank of sentences. We use FAISS indexes to optimize the memory usage and query time.

IV.1 - The KNN index bestiary

For fast nearest-neighbor search, we provide pretrained FAISS indexes (see Table below). Each index enables fast NN search based on different compression schemes. The embeddings are compressed using for instance scalar quantization (SQ4 or SQ8), PCA reduction (PCAR: 14, 40, 256), and search is sped up with k-means clustering (32k or 262k). Please consider looking at the FAISS documentation for more information on indexes and how to train them.

FAISS index	#Sentences	#Clusters	Quantization	#PCAR	Machine	Size
`100M_1GPU_16GB`	100M	32768	SQ4	256	1GPU16	14GiB
`100M_1GPU_32GB`	100M	32768	SQ8	256	1GPU32	26GiB
`1B_1GPU_16GB`	1B	262144	SQ4	14	1GPU16	15GiB
`1B_1GPU_32GB`	1B	262144	SQ4	40	1GPU32	28GiB
`1B_8GPU_32GB`	1B	262144	SQ4	256	8GPU32	136GiB

We provide indexes that fit either on 1 GPU with 16GiB memory (1GPU16) up to a larger index that fits on 1 GPU with 32 GiB memory (1GPU32) and one that fits on 8 GPUs (32GB). Indexes that use 100M sentences are built from the first file "x01.cc.5b.tar.gz", and 1B indexes use the first ten files. All indexes are based on SASE embeddings.

IV.2 - How to use an index to query nearest neighbors

You can get K nearest neighbors for each sentence of an input text file by running:

nn.txt & ">

## encode input sentences as sase embedding
input=sentence.txt  # input file containing a few (query) sentences
python src/sase.py --input $input --model data/sase.pth --spm_model data/sase.spm --batch_size 64 --cuda "True" --output $input.pt

index=data/100M_1GPU_16GB.faiss.idx  # FAISS index path
input=sentences.pt  # embeddings of input sentences
bank=data/keys.txt  # text file with all the data (the compressed file keys.ref.bin64 should also be present in the same folder)
K=10  # number of sentences to retrieve per query
NPROBE=1024 # number of probes for querying the index

python src/faiss_retrieve.py --input $input --bank $bank --index $index --K $K --nprobe $NPROBE --gpu "True" > nn.txt &

This can also be used for paraphrase mining.

Reference

If you found the resources here useful, please consider citing our paper:

@article{du2020self,
  title={Self-training Improves Pre-training for Natural Language Understanding},
  author={Du, Jingfei and Grave, Edouard and Gunel, Beliz and Chaudhary, Vishrav and Celebi, Onur and Auli, Michael and Stoyanov, Ves and Conneau, Alexis},
  journal={arXiv preprint arXiv:2010.02194},
  year={2020}
}

License

See the LICENSE file for more details. The majority of SentAugment is licensed under CC-BY-NC. However, license information for PyTorch code is available at https://github.com/pytorch/pytorch/blob/master/LICENSE

Comments

GPU Memory issue in using Faiss index

I am able to use flat_retrieve.py for smaller files but not for keys.txt file which has 100M sentences. For that I am trying to use the pre-trained FAISS index. I downloaded 100M_1GPU_16GB which is supposed to fit in 1 16GB GPU. I am using a 32GB memory GPU but still getting OOM errors. Here is the stack trace (deleted some lines to make it short).

Reading FAISS index
 - index: data/100M_1GPU_16GB.faiss.idx
 - found 100000000 sentences of dim 256
 - setting nbprobe to 32
 - transfer index to 1 GPUs 
Traceback (most recent call last):
  File "src/faiss_retrieve.py", line 37, in <module>
    index = IndexLoad(args.index, args.nprobe, args.gpu)
...
RuntimeError: Error in virtual void* faiss::gpu::StandardGpuResourcesImpl::allocMemory(const faiss::gpu::AllocRequest&) at /home/conda/feedstock_root/build_artifacts/faiss-split_1636459943780/work/faiss/gpu/StandardGpuResources.cpp:452: Error: 'err == cudaSuccess' failed: StandardGpuResources: alloc fail type IVFLists dev 0 space Device stream 0x5633af889030 size 606208 bytes (cudaMalloc error out of memory [2])

opened by Amit-GH 1

The main branch seems to be out of date, referencing non-existent code

See:

from src.utils import AttrDict
from src.data.dictionary import Dictionary, BOS_WORD, EOS_WORD, PAD_WORD, UNK_WORD, MASK_WORD
from src.model.transformer import TransformerModel

opened by munael 1

indexTextQuery

when I run "flat_retrieve.py" , I got an error in indexing line 95 b[0:i].decode('utf-8'), have a error " utf-8" codec can't decode byte Oxca in position 8: invalid continuation byte.

I read the code, CompressText stored offset using np.int64 to keys.ref.bin64 file. But IndexTextOpen read the file( keys.ref.bin64) use np.uint32. because "if os.path.isfile(fname)" is always true.

if I change the IndexTextOpen read the file (keys.ref.bin64) use np.unit64, I got another errror for index is out bounds for axis 0

opened by luolanfeixue 1
some codes are missing

sase.py line 19-21

from src.utils import AttrDict from src.data.dictionary import Dictionary, BOS_WORD, EOS_WORD, PAD_WORD, UNK_WORD, MASK_WORD from src.model.transformer import TransformerModel

the github don't have src.utils src.data and src.model. could you push the code?

opened by luolanfeixue 1
Question about Roberta-small size

Hello,

As mentioned in the caption of Table 4, "We distill a RoBERTa-Large model of 24 layers into a RoBERTa-Small model with 100× less parameters.", is that means the size of your Roberta-small is 3.5M? I cannot find the model in the official repo of Roberta. May I ask if you can share the pretrained RoBERTa-Small model? Thanks.

Yiming

opened by MatthewCYM 0
Facing multiple issues while running src/flat_retrieve.py

Hello All, I am trying to run SentAugment as a part my project for clustering purposes but facing multiple issues trying to run it. I am using a part of the CommonCrawl data for this purpose.

Issue 1: File "src/flat_retrieve.py", line 37, in _, indices = torch.topk(scores, params.k, dim=0) # K x Q NameError: name 'params' is not defined

File "SentAugment/src/flat_retrieve.py", line 42, in for k in range(K): NameError: name 'K' is not defined

Proposed Solution:

Is it a bug? Should it be args.K instead of params.k and just K?

Issue 2: File "src/flat_retrieve.py", line 43, in print(IndexTextQuery(txt_mmap, ref_mmap, indices[k][qeury_idx])) File "/home/username/FAIRCluster/SentAugment/src/indexing.py", line 95, in IndexTextQuery return b[0:i].decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 8: invalid continuation byte

I followed all the steps mentioned but getting the following error when run the step 3. Can somebody help with that?

Issue 3: After removing the decode and trying to run it,

File "src/flat_retrieve.py", line 43, in print(IndexTextQuery(txt_mmap, ref_mmap, indices[k][qeury_idx])) File "/home/username/FAIRCluster/SentAugment/src/indexing.py", line 92, in IndexTextQuery while txt_mmap[p+i] != 10 and i < dim: File "/home/username/anaconda3/envs/envConda6/lib/python3.7/site-packages/numpy/core/memmap.py", line 331, in getitem res = super(memmap, self).getitem(index) IndexError: index 25580 is out of bounds for axis 0 with size 8000

Any guess on whats wrong here?

Thanks in advance. Any help appreciated!

opened by karthickpgunasekaran 1
High RAM usage during sentence encoding

Hello, I've been running the example in the readme with 100m sentences, and I've been trying to run the following code:

python src/sase.py --input $input --model data/sase.pth --spm_model data/sase.spm --batch_size 64 --cuda "True" --output $output

Within sase.py, the program reaches the loop at line 70 but never completes it. It processes about 12 million sentences before utilising all the RAM available (24GB) and the program quits. I've tried using different batch sizes, but clearly since the portion of code in question doesn't reference the batch size that can't be the issue.

Is there any workaround for this (besides working with fewer sentences)?

opened by LukasDBaker 0
Kernel dies during embedding of sentences

Hi, When I try to embed sentences for the 100M sentences using: python src/sase.py --input $input --model data/sase.pth --spm_model data/sase.spm --batch_size 64 --cuda "True" --output $output kernel dies automatically.

What is the hardware requirements to embed sentences?

opened by avinashsai 5

Owner

Meta Research

GitHub

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack ?? Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

2.2k Jan 3, 2023

This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

60 Nov 11, 2022

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge Correlation Explanation (CorEx) is a topic model that yields rich topics tha

592 Dec 18, 2022

[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

◥ Curriculum Labeling ◣ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

113 Dec 15, 2022

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

128 Dec 11, 2022

This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Neural Style Transfer Transition Video Processing By Brycen Westgarth and Tristan Jogminas Description This code extends the neural style transfer ima

110 Jan 7, 2023

Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

8 Dec 10, 2022

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022

📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Well-formed Limericks and Haikus with GPT2 ?? GPT-2 Rhyming Limerick and Haiku models using data augmentation In collaboration with Matthew Korahais &

2 May 26, 2022

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

105 Jan 3, 2023

A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

MONEYBALL - ChatBot Module: 4006CEM, Class: B, Group: 5 Contributors: Jonas Djondo Roshan Kc Cole Samson Daniel Rodrigues Ihteshaam Naseer Kind remind

1 Nov 18, 2021

Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

3.2k Dec 30, 2022

Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

2.6k Feb 18, 2021

:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

1.4k Feb 18, 2021

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 2, 2023

Ray-based parallel data preprocessing for NLP and ML.

Wrangl Ray-based parallel data preprocessing for NLP and ML. pip install wrangl # for latest pip install git+https://github.com/vzhong/wrangl See exa

33 Dec 27, 2022

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

1.5k Jan 2, 2023