Shared code for training sentence embeddings with Flax / JAX

Nils Reimers

Last update: Dec 30, 2022

Related tags

Text Data & NLP flax-sentence-embeddings

Overview

flax-sentence-embeddings

This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pairs.

You can add your code by creating a pull request.

Dataloading

Dowload data

You can download the data using this basic python script at the root of the project. Download should be completed in about 20 minutes given your connection speed. Total size on disk is arround 25G.

python dataset/download_data.py --dataset_list=datasets_list.tsv --data_path=PATH_TO_STORE_DATASETS

Dataloading

First implementation of the dataloader takes as input a single jsonl.gz file. It creates a pointer on the file such that samples are loaded one by one. The implementation is based on torch standard Dataloader and Dataset classes. The class supports num_worker>0 such that data loading is done in a background process on the CPU, i.e. the data is loaded and tokenized in parallel to training the network. This avoid to create a bottleneck from I/O and tokenization. The implementation currently return {'anchor': '...,' 'positive': '...'}

from dataset.dataset import IterableCorpusDataset

corpus_dataset = IterableCorpusDataset(
  file_path=os.path.join(PATH_TO_STORE_DATASETS, 'stackexchange_duplicate_questions_title_title.json.gz'), 
  batch_size=2,
  num_workers=2, 
  transform=None)

corpus_dataset_itr = iter(corpus_dataset)
next(corpus_dataset_itr)

# {'anchor': 'Can anyone explain all these Developer Options?',
#  'positive': 'what is the advantage of using the GPU rendering options in Android?'}

def collate(batch_input_str):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    batch = {'anchor': tokenizer.batch_encode_plus([b['anchor'] for b in batch_input_str], pad_to_max_length=True),
             'positive': tokenizer.batch_encode_plus([b['positive'] for b in batch_input_str], pad_to_max_length=True)}
    return batch

corpus_dataloader = DataLoader(
  corpus_dataset,
  batch_size=2,
  num_workers=2,
  collate_fn=collate,
  pin_memory=False,
  drop_last=True,
  shuffle=False)

print(next(iter(corpus_dataloader)))

# {'anchor': {'input_ids': [[101, 4531, 2019, 2523, 2090, 2048, 4725, 1997, 2966, 8830, 1998, 1037, 7142, 8023, 102, 0, 0, 0], [101, 1039, 1001, 10463, 5164, 1061, 2100, 2100, 24335, 26876, 11927, 4779, 4779, 2102, 2000, 3058, 7292, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}, 'positive': {'input_ids': [[101, 1045, 2031, 2182, 2007, 2033, 1010, 2048, 4725, 1997, 8830, 1025, 1037, 3115, 2729, 4118, 1010, 1998, 1037, 17009, 8830, 1012, 2367, 3633, 4374, 2367, 4118, 1010, 2049, 2035, 18154, 11095, 1012, 1045, 2572, 2667, 2000, 2424, 1996, 2523, 1997, 1996, 17009, 8830, 1998, 1037, 1005, 2092, 2108, 3556, 1005, 2029, 2003, 1037, 15973, 3643, 1012, 2054, 2003, 1996, 2190, 2126, 2000, 2424, 2151, 8924, 1029, 1041, 1012, 1043, 1012, 8833, 6553, 26237, 2944, 1029, 102], [101, 1045, 2572, 2667, 2000, 10463, 1037, 5164, 3058, 2046, 1037, 4289, 2005, 29296, 3058, 7292, 1012, 1996, 4289, 2003, 2066, 1024, 1000, 2297, 2692, 20958, 2620, 17134, 19317, 19317, 1000, 1045, 2228, 2023, 1041, 16211, 4570, 2000, 1061, 2100, 2100, 24335, 26876, 11927, 4779, 4779, 2102, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}

=======

Installation

Poetry

A Poetry toml is provided to manage dependencies in a virtualenv. Check https://python-poetry.org/

Once you've installed poetry, you can connect to virtual env and update dependencies:

poetry shell
poetry update
poetry install

requirements.txt

Someone on your platform should generate it once with following command.

poetry export -f requirements.txt --output requirements.txt

Rust compiler for hugginface tokenizers

Hugginface tokenizers require a Rust compiler so install one.

custom libs

If you want a specific version of any library, edit the pyproject.toml, add it and/or replace "*" by it.

Comments

Performance numbers

Hello Nils,

Is there a performance chart of all flax-sentence-embeddings models just like sentence-transformers models (https://www.sbert.net/docs/pretrained_models.html)?

opened by vamsibanda 3
Adding FastApi + Streamlit demo using flax-sentence-transformers

Adding a demo

First pull of the demo. Tomorrow I will better document the code and also add other models from the ones in flax-sentence-transformers.

Also instructions on how to run the demo locally will be added.

opened by omarespejel 0
Add more pooling options, and consolidate utils code to be re-used across various projects
Extended https://github.com/nreimers/flax-sentence-embeddings/pull/5 with more pooling options, and tests

Made the util functions ready to be used by the training scripts of various projects

Will publish a separate PR for integrating the above in the training scripts (so that everyone can re-use the same core modeling code, with the main delta being the datasets)
opened by navjotts 0
Training script for multi context training from ConveRT
Added Training script for multiple-context conversational model mentioned in conveRT paper. The code is adapted from @vasudevgupta7's code search. Updated the below

losses for the 3 objectives mentioned in the paper,

the interaction between the immediate context and its accompanying response,

the interaction of the response with up to N past contexts from the conversation history,

the interaction of the full context with the response

However, the paper doesn't mention how the three losses are combined ( weighted or simple average). I have done a simple average for now. If there is a better way to do this (please let me know), can be updated as needed.

Past contexts are concatenated ( instead of separated by [SEP] token), as mentioned in the paper and as implemented here. Contexts are sorted to have the most recent context first and so on

I have tested this on GPU and the script works. Will update this with multi-context evaluation and sync with other recent changes done to the code-search training script.

Suggestions or feedback on this PR are welcome.
opened by infinitylogesh 0
Added code for creating different combinations for StackExchange

This code saves the following combinations that pass certain quality checks for the StackExchange dataset: -> title, body combination -> title, highest_score_answer combination -> title + body, highest_score_answer combination -> title + body, highly_score_answer and low answer combinations

opened by manandey 0
add generic training script

This PR adds training script for code-search-net. Though script is written in a general fashion, so should be useful for other datasets as well.

@nreimers, feel free to merge if everything looks alright to you. I will create another PR to add more features like evaluation & logging stuff by tomorrow.

EDITED: I have tested it on TPUs. it works without any errors.

opened by thevasudevgupta 0
MultipleNegativesRankingLoss with hard negative support
Enable handling additional negatives as output. Positives are handled same.

Padded version of cross entropy loss where non label-specified samples are treated as negatives.
opened by trent-dev 0
added stackexchange posts+tags parser

copied from @dscripka's code https://github.com/dscripka/flax-sentence-embeddings/blob/main/data_processing/stackexchange/stackexchange_processing.py, apified with tag management.

opened by mandubian 0
MultipleNegativesRankingLoss with dummy training script
MultipleNegativesRankingLoss with dummy training script.

Will publish additional PRs for the following.

supporting additional negatives.

integration with real training script.

custom loss similar to CLiP.
opened by trent-dev 0
First version of multiple datasets dataloader

I implemented a version of IterableCorpusDataset able to manage multiple datasets. Moreover, this version samples examples from multiple datasets according to the 'T5' approach (probability distribution obtained by capping the length of datasets + temperature transformation). I removed the option to use the 'start' variable for now. I have a doubt about lines 31 and 32 (why are these required?). Maybe @AntoineSimoulin can help clarify. Looking forward for your feedback.

opened by mmuffo94 0
Modularizing LR Scheduler

Separated LR Schedular to accept different arguments (i.e. if it should include a warmup). This basic implementation only allows for a constant schedule with an optional warmup. More complex schedules can be added to this. Complementary tests (and coverage) have been added.

opened by zanussbaum 1

Owner

Nils Reimers

GitHub

Shared code for training sentence embeddings with Flax / JAX

Related tags

Overview

flax-sentence-embeddings

Dataloading

Dowload data

Dataloading

Installation

Poetry

requirements.txt

Rust compiler for hugginface tokenizers

custom libs

Comments

Adding a demo

Owner

Nils Reimers

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

Sentence Embeddings with BERT & XLNet

Sentence Embeddings with BERT & XLNet

SimCSE: Simple Contrastive Learning of Sentence Embeddings

InferSent sentence embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

Shared, streaming Python dict

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

A sentence aligner for comparable corpora