Shared code for training sentence embeddings with Flax / JAX

Overview

flax-sentence-embeddings

This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pairs.

You can add your code by creating a pull request.

Dataloading

Dowload data

You can download the data using this basic python script at the root of the project. Download should be completed in about 20 minutes given your connection speed. Total size on disk is arround 25G.

python dataset/download_data.py --dataset_list=datasets_list.tsv --data_path=PATH_TO_STORE_DATASETS

Dataloading

First implementation of the dataloader takes as input a single jsonl.gz file. It creates a pointer on the file such that samples are loaded one by one. The implementation is based on torch standard Dataloader and Dataset classes. The class supports num_worker>0 such that data loading is done in a background process on the CPU, i.e. the data is loaded and tokenized in parallel to training the network. This avoid to create a bottleneck from I/O and tokenization. The implementation currently return {'anchor': '...,' 'positive': '...'}

from dataset.dataset import IterableCorpusDataset

corpus_dataset = IterableCorpusDataset(
  file_path=os.path.join(PATH_TO_STORE_DATASETS, 'stackexchange_duplicate_questions_title_title.json.gz'), 
  batch_size=2,
  num_workers=2, 
  transform=None)

corpus_dataset_itr = iter(corpus_dataset)
next(corpus_dataset_itr)

# {'anchor': 'Can anyone explain all these Developer Options?',
#  'positive': 'what is the advantage of using the GPU rendering options in Android?'}

def collate(batch_input_str):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    batch = {'anchor': tokenizer.batch_encode_plus([b['anchor'] for b in batch_input_str], pad_to_max_length=True),
             'positive': tokenizer.batch_encode_plus([b['positive'] for b in batch_input_str], pad_to_max_length=True)}
    return batch

corpus_dataloader = DataLoader(
  corpus_dataset,
  batch_size=2,
  num_workers=2,
  collate_fn=collate,
  pin_memory=False,
  drop_last=True,
  shuffle=False)

print(next(iter(corpus_dataloader)))

# {'anchor': {'input_ids': [[101, 4531, 2019, 2523, 2090, 2048, 4725, 1997, 2966, 8830, 1998, 1037, 7142, 8023, 102, 0, 0, 0], [101, 1039, 1001, 10463, 5164, 1061, 2100, 2100, 24335, 26876, 11927, 4779, 4779, 2102, 2000, 3058, 7292, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}, 'positive': {'input_ids': [[101, 1045, 2031, 2182, 2007, 2033, 1010, 2048, 4725, 1997, 8830, 1025, 1037, 3115, 2729, 4118, 1010, 1998, 1037, 17009, 8830, 1012, 2367, 3633, 4374, 2367, 4118, 1010, 2049, 2035, 18154, 11095, 1012, 1045, 2572, 2667, 2000, 2424, 1996, 2523, 1997, 1996, 17009, 8830, 1998, 1037, 1005, 2092, 2108, 3556, 1005, 2029, 2003, 1037, 15973, 3643, 1012, 2054, 2003, 1996, 2190, 2126, 2000, 2424, 2151, 8924, 1029, 1041, 1012, 1043, 1012, 8833, 6553, 26237, 2944, 1029, 102], [101, 1045, 2572, 2667, 2000, 10463, 1037, 5164, 3058, 2046, 1037, 4289, 2005, 29296, 3058, 7292, 1012, 1996, 4289, 2003, 2066, 1024, 1000, 2297, 2692, 20958, 2620, 17134, 19317, 19317, 1000, 1045, 2228, 2023, 1041, 16211, 4570, 2000, 1061, 2100, 2100, 24335, 26876, 11927, 4779, 4779, 2102, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}

=======

Installation

Poetry

A Poetry toml is provided to manage dependencies in a virtualenv. Check https://python-poetry.org/

Once you've installed poetry, you can connect to virtual env and update dependencies:

poetry shell
poetry update
poetry install

requirements.txt

Someone on your platform should generate it once with following command.

poetry export -f requirements.txt --output requirements.txt

Rust compiler for hugginface tokenizers

  • Hugginface tokenizers require a Rust compiler so install one.

custom libs

  • If you want a specific version of any library, edit the pyproject.toml, add it and/or replace "*" by it.
Comments
  • Performance numbers

    Performance numbers

    Hello Nils,

    Is there a performance chart of all flax-sentence-embeddings models just like sentence-transformers models (https://www.sbert.net/docs/pretrained_models.html)?

    opened by vamsibanda 3
  • Adding FastApi + Streamlit demo using flax-sentence-transformers

    Adding FastApi + Streamlit demo using flax-sentence-transformers

    Adding a demo

    First pull of the demo. Tomorrow I will better document the code and also add other models from the ones in flax-sentence-transformers.

    Also instructions on how to run the demo locally will be added.

    opened by omarespejel 0
  • Add more pooling options, and consolidate utils code to be re-used across various projects

    Add more pooling options, and consolidate utils code to be re-used across various projects

    1. Extended https://github.com/nreimers/flax-sentence-embeddings/pull/5 with more pooling options, and tests
    2. Made the util functions ready to be used by the training scripts of various projects

    Will publish a separate PR for integrating the above in the training scripts (so that everyone can re-use the same core modeling code, with the main delta being the datasets)

    opened by navjotts 0
  • Training script for multi context training from ConveRT

    Training script for multi context training from ConveRT

    Added Training script for multiple-context conversational model mentioned in conveRT paper. The code is adapted from @vasudevgupta7's code search. Updated the below

    • losses for the 3 objectives mentioned in the paper,

      • the interaction between the immediate context and its accompanying response,
      • the interaction of the response with up to N past contexts from the conversation history,
      • the interaction of the full context with the response

      However, the paper doesn't mention how the three losses are combined ( weighted or simple average). I have done a simple average for now. If there is a better way to do this (please let me know), can be updated as needed.

    • Past contexts are concatenated ( instead of separated by [SEP] token), as mentioned in the paper and as implemented here. Contexts are sorted to have the most recent context first and so on

    I have tested this on GPU and the script works. Will update this with multi-context evaluation and sync with other recent changes done to the code-search training script.

    Suggestions or feedback on this PR are welcome.

    opened by infinitylogesh 0
  • Added code for creating different combinations for StackExchange

    Added code for creating different combinations for StackExchange

    This code saves the following combinations that pass certain quality checks for the StackExchange dataset: -> title, body combination -> title, highest_score_answer combination -> title + body, highest_score_answer combination -> title + body, highly_score_answer and low answer combinations

    opened by manandey 0
  • add generic training script

    add generic training script

    This PR adds training script for code-search-net. Though script is written in a general fashion, so should be useful for other datasets as well.

    @nreimers, feel free to merge if everything looks alright to you. I will create another PR to add more features like evaluation & logging stuff by tomorrow.

    EDITED: I have tested it on TPUs. it works without any errors.

    opened by thevasudevgupta 0
  • MultipleNegativesRankingLoss with hard negative support

    MultipleNegativesRankingLoss with hard negative support

    1. Enable handling additional negatives as output. Positives are handled same.
    2. Padded version of cross entropy loss where non label-specified samples are treated as negatives.
    opened by trent-dev 0
  • added stackexchange posts+tags parser

    added stackexchange posts+tags parser

    copied from @dscripka's code https://github.com/dscripka/flax-sentence-embeddings/blob/main/data_processing/stackexchange/stackexchange_processing.py, apified with tag management.

    opened by mandubian 0
  • MultipleNegativesRankingLoss with dummy training script

    MultipleNegativesRankingLoss with dummy training script

    MultipleNegativesRankingLoss with dummy training script.

    Will publish additional PRs for the following.

    1. supporting additional negatives.
    2. integration with real training script.
    3. custom loss similar to CLiP.
    opened by trent-dev 0
  • First version of multiple datasets dataloader

    First version of multiple datasets dataloader

    I implemented a version of IterableCorpusDataset able to manage multiple datasets. Moreover, this version samples examples from multiple datasets according to the 'T5' approach (probability distribution obtained by capping the length of datasets + temperature transformation). I removed the option to use the 'start' variable for now. I have a doubt about lines 31 and 32 (why are these required?). Maybe @AntoineSimoulin can help clarify. Looking forward for your feedback.

    opened by mmuffo94 0
  • Modularizing LR Scheduler

    Modularizing LR Scheduler

    Separated LR Schedular to accept different arguments (i.e. if it should include a warmup). This basic implementation only allows for a constant schedule with an optional warmup. More complex schedules can be added to this. Complementary tests (and coverage) have been added.

    opened by zanussbaum 1
Owner
Nils Reimers
Nils Reimers
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 6, 2023
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 277 Feb 18, 2021
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 2, 2023
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 4.2k Feb 18, 2021
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 7, 2023
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Facebook Research 2.2k Dec 27, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

null 34 Nov 24, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 7, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

null 116 Dec 27, 2022
Shared, streaming Python dict

UltraDict Sychronized, streaming Python dictionary that uses shared memory as a backend Warning: This is an early hack. There are only few unit tests

Ronny Rentner 192 Dec 23, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

null 49 Dec 17, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022