Sentence Embeddings with BERT & XLNet

Ubiquitous Knowledge Processing Lab

Last update: Jan 2, 2023

Related tags

Text Data & NLP sentence-transformers

Overview

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch

This framework provides an easy method to compute dense vector representations for sentences and paragraphs (also known as sentence embeddings). The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and are tuned specificially meaningul sentence embeddings such that sentences with similar meanings are close in vector space.

We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.

For the full documentation, see www.SBERT.net, as well as our publications:

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019)
Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (EMNLP 2020)
Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks (arXiv 2020)
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (arXiv 2020)

Installation

We recommend Python 3.6 or higher, PyTorch 1.6.0 or higher and transformers v3.1.0 or higher. The code does not work with Python 2.7.

Install with pip

Install the sentence-transformers with pip:

pip install -U sentence-transformers

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

PyTorch with CUDA If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

Getting Started

See Quickstart in our documenation.

This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.

First download a pretrained model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-distilroberta-base-v1')

Then provide some sentences to the model.

sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

And that's it already. We now have a list of numpy arrays with the embeddings.

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Pre-Trained Models

We provide a large list of Pretrained Models for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: SentenceTransformer('model_name').

» Full list of pretrained models

Training

This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.

See Training Overview for an introduction how to train your own embedding models. We provide various examples how to train models on various datasets.

Some highlights are:

Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
Multi-Lingual and multi-task learning
Evaluation during training to find optimal model
10+ loss-functions allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, constrative loss.

Performance

Our models are evaluated extensively and achieve state-of-the-art performance on various tasks. Further, the code is tuned to provide the highest possible speed.

Model	STS benchmark	SentEval
Avg. GloVe embeddings	58.02	81.52
BERT-as-a-service avg. embeddings	46.35	84.04
BERT-as-a-service CLS-vector	16.50	84.66
InferSent - GloVe	68.03	85.59
Universal Sentence Encoder	74.92	85.10
Sentence Transformer Models
nli-bert-base	77.12	86.37
nli-bert-large	79.19	87.78
stsb-bert-base	85.14	86.07
stsb-bert-large	85.29	86.66
stsb-roberta-base	85.44	-
stsb-roberta-large	86.39	-
stsb-distilbert-base	85.16	-

Application Examples

You can use this framework for:

and many more use-cases.

For all examples, see examples/applications.

Citing & Authors

If you find this repository helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

If you use one of the multilingual models, feel free to cite our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation:

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

If you use the code for data augmentation, feel free to cite our publication Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks:

@article{thakur-2020-AugSBERT,
    title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
    author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and  Gurevych, Iryna", 
    journal= "arXiv preprint arXiv:2010.08240",
    month = "10",
    year = "2020",
    url = "https://arxiv.org/abs/2010.08240",
}

The main contributors of this repository are:

Contact person: Nils Reimers, [email protected]

https://www.ukp.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Comments

Is it Multilingual?

Hello,

This might be a stupid question, but i wanted to know if I can use the clustering on German sentences? Will it work with the pre-trained model or do I need to train it on German data first?

Thanks.

opened by SouravDutta91 44
Fine-tune multilingual model for domain specific vocab

Thanks for the repository and for continuous updates.

Wanted to check if understood it correctly: Is it possible to continue fine-tuning one of the multilingual models for a specific domain? For example I can take 'xlm-r-distilroberta-base-paraphrase-v1' and fine-tune it on domain-related parallel data( English-other languages) with MultipleNegativesRankingLoss?

opened by langineer 30
Is it possible to encode by using multi-GPU?

Thanks for this beautiful package, it saves a lot of work to do semantic embedding. I am running a large size data base trying to transform docs into embedding matrix. When I was running with the code, it seemed only using single GPU to encode the sentence. Is there any way that I could do this by multi-GPU?

opened by z307287280 30
public.ukp.informatik.tu-darmstadt.de Unreachable

It looks like the server which hosts the pre-trained models (https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/) has been unavailable for a few hours now.

opened by Ganners 20
ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

After pip installing and trying to import SentenceTransformer I get this error: ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

When I look into the source code the only folder I have is models. I am missing evaluation, etc. Any Idea why?

opened by DavidBegert 20
Fine-tune underlying language model for SBERT
Hi,

I'd like to use SBERT model architecture for document similarity and topic modelling tasks. However, my data corpus is fairly specific to domain, and I suspect that SBERT will underperform as it was trained on generic WIki/Library corpuses. So, I wonder if there are any recommendation around fine-tuning of underlying language model for SBERT.

I envision that the overall process will be following:

Take pre-trained BERT model

Fine tune Language Model on domain-specific corpus

Then retrain SBERT model architecture on specific tasks (e.g. SNLI dataset/task)

Curious to hear thought on the approach and problem definition.
opened by vdabravolski 18

ModuleNotFoundError: No module named 'setuptools.command.build'

I am trying to pip install sentence transformers on my Macbook Pro with M1 chip. I am using:

pip install -U sentence-transformers

When I run this, I get this error saying:

ModuleNotFoundError: No module named 'setuptools.command.build'

Full output:

Defaulting to user installation because normal site-packages is not writeable
Collecting sentence-transformers
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py) ... done
Collecting transformers<5.0.0,>=4.6.0
  Using cached transformers-4.21.0-py3-none-any.whl (4.7 MB)
Collecting tqdm
  Using cached tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
Requirement already satisfied: torch>=1.6.0 in ./Library/Python/3.8/lib/python/site-packages (from sentence-transformers) (1.12.0)
Collecting torchvision
  Using cached torchvision-0.13.0-cp38-cp38-macosx_11_0_arm64.whl (1.2 MB)
Requirement already satisfied: numpy in ./Library/Python/3.8/lib/python/site-packages (from sentence-transformers) (1.23.1)
Collecting scikit-learn
  Using cached scikit_learn-1.1.1-cp38-cp38-macosx_12_0_arm64.whl (7.6 MB)
Collecting scipy
  Using cached scipy-1.8.1-cp38-cp38-macosx_12_0_arm64.whl (28.6 MB)
Collecting nltk
  Using cached nltk-3.7-py3-none-any.whl (1.5 MB)
Collecting sentencepiece
  Using cached sentencepiece-0.1.96.tar.gz (508 kB)
  Preparing metadata (setup.py) ... done
Collecting huggingface-hub>=0.4.0
  Using cached huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
Collecting requests
  Using cached requests-2.28.1-py3-none-any.whl (62 kB)
Collecting pyyaml>=5.1
  Using cached PyYAML-6.0.tar.gz (124 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: typing-extensions>=3.7.4.3 in ./Library/Python/3.8/lib/python/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.3.0)
Requirement already satisfied: filelock in ./Library/Python/3.8/lib/python/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.7.1)
Requirement already satisfied: packaging>=20.9 in ./Library/Python/3.8/lib/python/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (21.3)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Using cached tokenizers-0.12.1.tar.gz (220 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      Traceback (most recent call last):
        File "/Users/joeyoneill/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/Users/joeyoneill/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/Users/joeyoneill/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 146, in get_requires_for_build_wheel
          return self._get_build_requires(
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 127, in _get_build_requires
          self.run_setup()
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 142, in run_setup
          exec(compile(code, __file__, 'exec'), locals())
        File "setup.py", line 2, in <module>
          from setuptools_rust import Binding, RustExtension
        File "/private/var/folders/bg/ncfh283n4t39vqhvbd5n9ckh0000gn/T/pip-build-env-vjj6eow8/overlay/lib/python3.8/site-packages/setuptools_rust/__init__.py", line 1, in <module>
          from .build import build_rust
        File "/private/var/folders/bg/ncfh283n4t39vqhvbd5n9ckh0000gn/T/pip-build-env-vjj6eow8/overlay/lib/python3.8/site-packages/setuptools_rust/build.py", line 20, in <module>
          from setuptools.command.build import build as CommandBuild  # type: ignore[import]
      ModuleNotFoundError: No module named 'setuptools.command.build'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

	note: This error originates from a subprocess, and is likely not a problem with pip.

Can anybody tell me what I should do or what is wrong with what I am currently doing? I factory reset my Mac and re-downloaded everything but I still get this same error. I am stumped.

opened by joeyoneill 15

HTTPError: 403 Client Error:

I get a request error and I do not know why.


[W 2021-02-02 18:43:15,951] Trial 0 failed because of the following error: HTTPError('403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip',)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/optuna/_optimize.py", line 211, in _run_trial
    value_or_values = func(trial)
  File "<ipython-input-6-af5cb77f5b44>", line 40, in objective
    model = SentenceTransformer(model_name)  # distiluse-base-multilingual-cased-v2  distilbert-multilingual-nli-stsb-quora-ranking
  File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/SentenceTransformer.py", line 92, in __init__
    raise e
  File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/SentenceTransformer.py", line 75, in __init__
    http_get(model_url, zip_save_path)
  File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/util.py", line 201, in http_get
    req.raise_for_status()
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip

HTTPError: 403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip

opened by tide90 15

Batch cos_sim for community_detection?

I've been experimenting with the community_detection method but noticed I quickly get OOM errors if I use too large of embeddings.

Seeing how it uses cos_sim to computed all the embedding distances, do you think it would make sense to have the option for batching? I believe you will find other bottlenecks when iterating over the entries, but at least it will complete on larger embeddings.

opened by mmaybeno 13

'torch._C.PyTorchFileReader' object has no attribute'seek'

Hello,

I am using the following model for sentence similarity

https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual/tree/main

word_embedding_model = models.Transformer(bert_model_dir)  # , max_seq_length=512
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model], device=device_str)

But, I get this error:

Traceback (most recent call last):

  File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 306, in _check_seekable

    f.seek(f.tell())

AttributeError:'torch._C.PyTorchFileReader' object has no attribute'seek'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/home/work/anaconda/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1205, in from_pretrained

    state_dict = torch.load(resolved_archive_file, map_location="cpu")

  File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 584, in load

    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)

  File "/home/work/anaconda/lib/python3.6/site-packages/moxing/framework/file/file_io_patch.py", line 200, in _load

    _check_seekable(f)

  File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 309, in _check_seekable

    raise_err_msg(["seek", "tell"], e)

  File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 302, in raise_err_msg

    raise type(e)(msg)

AttributeError:'torch._C.PyTorchFileReader' object has no attribute'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead .

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "code/similarity.py", line 118, in <module>

    word_embedding_model = models.Transformer(bert_model_dir) #, max_seq_length=512

  File "/home/work/anaconda/lib/python3.6/site-packages/sentence_transformers/models/Transformer.py", line 30, in __init__

    self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)

  File "/home/work/anaconda/lib/python3.6/site-packages/transformers/models/auto/auto_factory.py", line 381, in from_pretrained

    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)

  File "/home/work/anaconda/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1208, in from_pretrained

    f"Unable to load weights from pytorch checkpoint file for'{pretrained_model_name_or_path}' "

OSError: Unable to load weights from pytorch checkpoint file for'/home/work/user-job-dir/input/pretrained_models/stsb-xlm-r-multilingual/' at'/home/work/user-job-dir/input /pretrained_models/stsb-xlm-r-multilingual/pytorch_model.bin'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

I checked on web but could not find any solution. What could be the problem? Thank you.

opened by deadsoul44 13

Getting SSL Error in downloading "distilroberta-base-paraphrase-v1" model embeddings

I am using google collab with PyTorch version 1.7.0+cu101 I am getting an SSL Error when I am trying to download "distilroberta-base-paraphrase-v1" model.

Code from sentence_transformers import SentenceTransformer model = SentenceTransformer('distilroberta-base-paraphrase-v1')

Error

SSLError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 599 body=body, headers=headers, --> 600 chunked=chunked) 601

24 frames SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)

During handling of the above exception, another exception occurred:

MaxRetryError Traceback (most recent call last) MaxRetryError: HTTPSConnectionPool(host='public.ukp.informatik.tu-darmstadt.de', port=443): Max retries exceeded with url: /reimers/sentence-transformers/v0.2/distilroberta-base-paraphrase-v1.zip (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))

During handling of the above exception, another exception occurred:

SSLError Traceback (most recent call last) SSLError: HTTPSConnectionPool(host='public.ukp.informatik.tu-darmstadt.de', port=443): Max retries exceeded with url: /reimers/sentence-transformers/v0.2/distilroberta-base-paraphrase-v1.zip (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))

During handling of the above exception, another exception occurred:

FileNotFoundError Traceback (most recent call last) /usr/lib/python3.6/shutil.py in rmtree(path, ignore_errors, onerror) 473 # lstat()/open()/fstat() trick. 474 try: --> 475 orig_st = os.lstat(path) 476 except Exception: 477 onerror(os.lstat, path, sys.exc_info())

FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/sentence_transformers/sbert.net_models_distilroberta-base-paraphrase-v1'

opened by rahuliitkgp31 13
model.fit results in nan

Hi,

I want to fine-tune SBERT with pre-trained weights of 'bert-base-uncased'. I follow this tutorial: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py using MultipleNegativesRankingLoss loss function.

When I do model.fit , the results are 'nan' everywhere.

here is my code: `root_model = AutoModel.from_pretrained('bert-base-uncased') output_dir = "/root/Automated_Assessment_(ETS)/Model/DRAFT/DRAFT_Bert_base_uncased" BERT_model = root_model.save_pretrained(output_dir) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') #('onlplab/alephbert-base') tokenizer.save_pretrained(output_dir)

learning_rate, batch_size, epochs = 2e-5, 8, 1

train_dataloader = datasets.NoDuplicatesDataLoader(train_data, batch_size=batch_size)
word_embedding_model = models.Transformer(output_dir, max_seq_length=512)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean') model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_loss = losses.MultipleNegativesRankingLoss(model) val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(val_data, batch_size=batch_size)

warmup_steps = math.ceil(len(train_dataloader) * epochs * 0.1) #10% of train data for warm-up logging.info("Warmup-steps: {}".format(warmup_steps))

output_file = 'output/sentence_similarity'+MODEL_NAME.replace("/", "-")+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S") sb_output_path = os.path.join(ref_saved_models_path, output_file)

model.fit(train_objectives=[(train_dataloader, train_loss)], evaluator=val_evaluator, epochs=epochs, evaluation_steps=int(len(train_dataloader)*0.1), warmup_steps=warmup_steps, output_path=sb_output_path, use_amp=False #Set to True, if your GPU supports FP16 operations ) `

here is a screenshot of the log:

I don't understand what am I doing wrong? Could you please help me?

opened by Abigail-gs 0

Dtype error when using Pooling + Dense layers with half precision

The models.Pooling layer seams to always output a 32-bit float as it's sentence_embedding. This leads to an dtype error when using a dense layer after the pooling layer when the model is in half precision mode via model.half()

Here is a minimal example:

from sentence_transformers import SentenceTransformer,models
from torch import nn

word_embedding_model = models.Transformer("sentence-transformers/all-MiniLM-L6-v2")
polling = models.Pooling(word_embedding_model.get_word_embedding_dimension(),"mean")
dense = models.Dense(word_embedding_model.get_word_embedding_dimension(), out_features=64, activation_function=nn.Tanh())

#This works as expected
sentence_transformer_without_dense = SentenceTransformer(modules=[word_embedding_model,polling])
sentence_transformer_without_dense.half()

print(sentence_transformer_without_dense.encode("Hello World"))

#This will throw an error
sentence_transformer_with_dense = SentenceTransformer(modules=[word_embedding_model,polling,dense])
sentence_transformer_with_dense.half()

print(sentence_transformer_with_dense.encode("Hello World"))

Is this the expected behaviour or a bug?

opened by LLukas22 0

How can I use models.Dense() layer with DenoisingAutoEncoderLoss()?

When creating a SentenceTransformer as follows:

word_embedding_model = Transformer(
  model_name_or_path=model_name_or_path, # "bert-base-uncased"
  max_seq_length=max_seq_length, # 384
  cache_dir=cache_dir,
  tokenizer_args=tokenizer_args, # {"truncation": True, "padding": "max_length, "max_length": 384}
  do_lower_case=do_lower_case, # True
  tokenizer_name_or_path=tokenizer_name_or_path # "bert-base-uncased"
 )

word_embedding_dimension = word_embedding_model.get_word_embedding_dimension()
pooling_mode = "cls"
pooling_model = Pooling(
    word_embedding_dimension=word_embedding_dimension,
    pooling_mode=pooling_mode,
)

in_features = pooling_model.get_sentence_embedding_dimension()
out_features = config["parameters"]["num_dense_dimensions"] # 256
dense_model = Dense(
    in_features=in_features,
    out_features=out_features,
    activation_function=nn.Tanh(),
)

modules = [
    word_embedding_model,
    pooling_model,
    dense_model,
]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cache_folder = os.path.join(cache_location)
model = SentenceTransformer(
    modules=modules,
    device=device,
    cache_folder=cache_folder,
)

And creating the following DenoisingAutoEncoderLoss:

train_loss = DenoisingAutoEncoderLoss(
    model=model,
    tie_encoder_decoder=tie_encoder_decoder, # True
)

With this training setting:

train_objectives = [
    (train_dataloader, train_loss)
]
evaluator = MSEEvaluator(
    source_sentences=source_sentences,
    target_sentences=target_sentences,
    teacher_model=model,
    show_progress_bar=True,
    batch_size=batch_size, # batch_size = 16
    name="job2vec",
    write_csv=True,
)

def free_memory(score, epoch, steps):
    torch.cuda.empty_cache()
    gc.collect()

epochs = config["hyperparameters"]["num_epochs"]
warmup_steps = config["hyperparameters"]["warmup_steps"]
evaluation_steps = batch_size * 32, # batch_size = 16
output_path = os.path.join(cache_location, "job2vec")
save_best_model = True
use_amp = True
callback = free_memory,
show_progress_bar = True
checkpoint_path = os.path.join(cache_location, "job2vec/checkpoints")
checkpoint_save_steps = len(train_dataloader)
model.fit(
    train_objectives=train_objectives,
    evaluator=evaluator,
    epochs=epochs,
    warmup_steps=warmup_steps,
    evaluation_steps=evaluation_steps,
    output_path=output_path,
    save_best_model=save_best_model,
    show_progress_bar=show_progress_bar,
    use_amp=use_amp,
    callback=callback,
    checkpoint_path=checkpoint_path,
    checkpoint_save_steps=checkpoint_save_steps,
)

Then the following error occurs:

Traceback (most recent call last):
  File "src/denoising_autoencoder.py", line 216, in <module> 
    main()
  File "src/denoising_autoencoder.py", line 213, in main
    train()
  File "src/denoising_autoencoder.py", line 209, in train
    checkpoint_save_steps=checkpoint_save_steps,
  File "venv/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 710, in fit
    loss_value = loss_model(features, labels)
  File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "venv/lib/python3.7/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py", line 119, in forward
    use_cache=False
  File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 1250, in forward
    return_dict=return_dict,
  File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 1031, in forward
    return_dict=return_dict,
  File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 617, in forward
    output_attentions,
  File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 529, in forward
    output_attentions,
  File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 433, in forward
    output_attentions,
  File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 298, in forward
    key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
  File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "venv/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (16x256 and 768x768)

How can I use the models.Dense() layer with the DenoisingAutoEncoderLoss?

opened by niquet 0

How to distill model with different tokenizer?

I am trying to train word embedding models to match embeddings from a sentence transformer, and using model_distillation won't cut it, because when running student_model.fit the model uses student's smart_batching_collate so the teacher model gets wrong tokens.

Has anybody worked on something similar? I don't see any workaround other than rewriting the SentenceTransformer.fit method, but maybe there's easier way to do this?

opened by lambdaofgod 0
Community detection algorithm can loop forever
If only one vector is passed community detection algorithm will loop forever.

I suggest adding

assert embeddings.shape[0] >= 2, "Embeddings should contain at least two vectors" assert embeddings.shape[0] >= min_community_size, "Number of vectors is less than specified min_community_size"

checks. (Can open a pull request for this)
opened by maiiabocharova 0
Override tokenizer args of sentencetransformer

How can we apply sliding window on sentencetranformer tokenizer. I want to be able to override return_overflowing_tokens=True and stride in the default tokenizer to enable the sliding window.

opened by datashinobi 0

Releases(v2.2.2)

v2.2.2(Jun 26, 2022)

huggingface_hub dropped support in version 0.5.0 for Python 3.6

This release fixes the issue so that huggingface_hub with version 0.4.0 and Python 3.6 can still be used.
Source code(tar.gz)
Source code(zip)
v2.2.1(Jun 23, 2022)
Version 0.8.1 of huggingface_hub introduces several changes that resulted in errors and warnings. This version of sentence-transformers fixes these issues.

Further, several improvements have been added / merged:

util.community_detection was improved: 1) It works in a batched mode to save memory, 2) Overlapping clusters are no longer dropped but removed by overlapping items, 3) The parameter init_max_size was removed and replaced by a heuristic to estimate the max size of clusters

#1581 the training dataset names can be saved in the model card

#1426 fix the text summarization example

#1487 Rekursive sentence-transformers models are now possible

#1522 Private models can now be loaded

#1551 DataLoaders can now have workers

#1565 Models are just checked on the hub if they don't exist in the cache. Fixes issues with connectivity issues

#1591 Example added how to stream encode larger datasets

Source code(tar.gz)
Source code(zip)
v2.2.0(Feb 10, 2022)
T5

You can now use the encoder from T5 to learn text embeddings. You can use it like any other transformer model:

from sentence_transformers import SentenceTransformer, models word_embedding_model = models.Transformer('t5-base', max_seq_length=256) pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension()) model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

See T5-Benchmark results - the T5 encoder is not the best model for learning text embeddings models. It requires quite a lot of training data and training steps. Other models perform much better, at least in the given experiment with 560k training triplets.

New Models

The models from the papers Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models and Large Dual Encoders Are Generalizable Retrievers have been added:

gtr-t5-base

gtr-t5-large

gtr-t5-xl

gtr-t5-xxl

sentence-t5-base

sentence-t5-large

sentence-t5-xl

sentence-t5-xxl

For benchmark results, see https://seb.sbert.net

Private Models

Thanks to #1406 you can now load private models from the hub:

model = SentenceTransformer("your-username/your-model", use_auth_token=True)
Source code(tar.gz)
Source code(zip)
v2.1.0(Oct 1, 2021)
This is a smaller release with some new features

MarginMSELoss

MarginMSELoss is a great method to train embeddings model with the help of a cross-encoder model. The details are explained here: MSMARCO - MarginMSE Training

You pass your training data in the format:

InputExample(texts=[query, positive, negative], label=cross_encoder.predict([query, positive])-cross_encoder.predict([query, negative])

MultipleNegativesSymmetricRankingLoss

MultipleNegativesRankingLoss computes the loss just in one way: Find the correct answer for a given question.

MultipleNegativesSymmetricRankingLoss also computes the loss in the other direction: Find the correct question for a given answer.

Breaking Change: CLIPModel

The CLIPModel is now based on the transformers model.

You can still load it like this:

model = SentenceTransformer('clip-ViT-B-32')

Older SentenceTransformers versions are now longer able to load and use the 'clip-ViT-B-32' model.

Added files on the hub are automatically downloaded

PR #1116 checks if you have all files in your local cache or if there are added files on the hub. If this is the case, it will automatically download them.

SentenceTransformers.encode() can return all values

When you set output_value=None for the encode method, all values (token_ids, token_embeddings, sentence_embedding) will be returned.
Source code(tar.gz)
Source code(zip)
v2.0.0(Jun 24, 2021)
Models hosted on the hub

All pre-trained models are now hosted on the Huggingface Models hub.

Our pre-trained models can be found here: https://huggingface.co/sentence-transformers

But you can easily share your own sentence-transformer model on the hub and have other people easily access it. Simple upload the folder and have people load it via:

model = SentenceTransformer('[your_username]/[model_name]')

For more information, see: Sentence Transformers in the Hugging Face Hub

Breaking changes

There should be no breaking changes. Old models can still be loaded from disc. However, if you use one of the provided pre-trained models, it will be downloaded again in version 2 of sentence transformers as the cache path has slightly changed.

Find sentence-transformer models on the Hub

You can filter the hub for sentence-transformers models: https://huggingface.co/models?filter=sentence-transformers

Add the sentence-transformers tag to you model card so that others can find your model.

Widget & Inference API

A widget was added to sentence-transformers models on the hub that lets you interact directly on the models website: https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2

Further, models can now be used with the Accelerated Inference API: Send you sentences to the API and get back the embeddings from the respective model.

Save Model to Hub

A new method was added to the SentenceTransformer class: save_to_hub.

Provide the model name and the model is saved on the hub.

Here you find the explanation from transformers how the hub works: Model sharing and uploading

Automatic Model Card

When you save a model with save or save_to_hub, a README.md (also known as model card) is automatically generated with basic information about the respective SentenceTransformer model.

New Models

Several new sentence embedding models have been added, which are much better than the previous model: Sentence Embedding Models

Some new models for semantic search based on MS MARCO have been added: MSMARCO Models

The training script for these MS MARCO models have been released as well: Train MS MARCO Bi-Encoder v3

Source code(tar.gz)
Source code(zip)
v1.2.1(Jun 24, 2021)

Final release of version 1: Makes v1 of sentence-transformers forward compatible with models from version 2 of sentence-transformers.
Source code(tar.gz)
Source code(zip)
v1.2.0(May 12, 2021)
Unsupervised Sentence Embedding Learning

New methods integrated to train sentence embedding models without labeled data. See Unsupervised Learning for an overview of all existent methods.

New methods:

CT: Integration of Semantic Re-Tuning With Contrastive Tension (CT) to tune models without labeled data

CT_In-Batch_Negatives: A modification of CT using in-batch negatives

SimCSE: An unsupervised sentence embedding learning method by Gao et al.

Pre-Training Methods

MLM: An example script to run Masked-Language-Modeling (MLM). Running MLM on your custom data before supervised training can significantly improve the performances. Further, MLM also works well for domain trainsfer: You first train on your custom data, and then train with e.g. NLI or STS data.

Training Examples

Paraphrase Data: In our paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation we have shown that training on paraphrase data is powerful. In that folder we provide collections of different paraphrase datasets and scripts to train on it.

NLI with MultipleNegativeRankingLoss: A dedicated example how to use MultipleNegativeRankingLoss for training with NLI data, which leads to a significant performance boost.

New models

New NLI & STS models: Following the Paraphrase Data training example we published new models trained on NLI and NLI+STS data. Training code is available: training_nli_v2.py.

| Model-Name | STSb-test performance | | --- | :---: | | Previous best models | | | nli-bert-large | 79.19 | | stsb-roberta-large | 86.39 | | New v2 models | | | nli-mpnet-base-v2 | 86.53 | | stsb-mpnet-base-v2 | 88.57 |

New MS MARCO model for Semantic Search: Hofstätter et al. optimized the training procedure on the MS MARCO dataset. The resulting model is integrated as msmarco-distilbert-base-tas-b and improves the performance on the MS MARCO dataset from 33.13 to 34.43 MRR@10

New Functions

SentenceTransformer.fit() Checkpoints: The fit() method now allows to save checkpoints during the training at a fixed number of steps. More info

Pooling-mode as string: You can now pass the pooling-mode to models.Pooling() as string:
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean')

Valid values are mean/max/cls.

NoDuplicatesDataLoader: When using the MultipleNegativesRankingLoss, one should avoid to have duplicate sentences in the same sentence. This data loader simplifies this task and ensures that no duplicate entries are in the same batch.~~~~

Source code(tar.gz)
Source code(zip)
v1.1.0(Apr 21, 2021)
Unsupervised Sentence Embedding Learning

This release integrates methods that allows to learn sentence embeddings without having labeled data:

TSDAE: TSDAE is using a denoising auto-encoder to learn sentence embeddings. The method has been presented in our recent paper and achieves state-of-the-art performance for several tasks.

GenQ: GenQ uses a pre-trained T5 system to generate queries for a given passage. It was presented in our recent BEIR paper and works well for domain adaptation for (semantic search)[https://www.sbert.net/examples/applications/semantic-search/README.html]

New Models - SentenceTransformer

MSMARCO Dot-Product Models: We trained models using the dot-product instead of cosine similarity as similarity function. As shown in our recent BEIR paper, models with cosine-similarity prefer the retrieval of short documents, while models with dot-product prefer retrieval of longer documents. Now you can choose what is most suitable for your task.

MSMARCO MiniLM Models: We uploaded some models based on MiniLM: It uses just 384 dimensions, is faster than previous models and achieves nearly the same performance

New Models - CrossEncoder

MSMARCO Re-ranking-Models v2: We trained new significantly faster and significantly better CrossEncoder re-ranking models on the MSMARCO dataset. It outperforms BERT-large models in terms of accuracy while being 18 times faster. Trainingcode is available

New Features

You can now pass to the CrossEncoder class a default_activation_function, that is applied on-top of the output logits generated by the class.

You can now pre-process images for the CLIP Model. Soon I will release a tutorial how to fine-tune the CLIP Model with your data.

Source code(tar.gz)
Source code(zip)
v1.0.4(Apr 1, 2021)

It was not possible to fine-tune and save the CLIPModel. This release fixes it. CLIPModel can now be saved like any other model by calling model.save(path)
Source code(tar.gz)
Source code(zip)
v1.0.3(Mar 22, 2021)

v1.0.3 - Patch for util.paraphrase_mining method
Source code(tar.gz)
Source code(zip)
v1.0.2(Mar 19, 2021)
v1.0.2 - Patch for CLIPModel, new Image Examples

Bugfix in CLIPModel: Too long inputs raised a RuntimeError. Now they are truncated.

New util function: util.paraphrase_mining_embeddings, to find most similar embeddings in a matrix

Image Clustering and Duplicate Image Detection examples added: more info

Source code(tar.gz)
Source code(zip)
v1.0.0(Mar 18, 2021)
This release brings many new improvements and new features. Also, the version number scheme is updated. Now we use the format x.y.z with x: for major releases, y: smaller releases with new features, z: bugfixes

Text-Image-Model CLIP

You can now encode text and images in the same vector space using the OpenAI CLIP Model. You can use the model like this:

from sentence_transformers import SentenceTransformer, util from PIL import Image #Load CLIP model model = SentenceTransformer('clip-ViT-B-32') #Encode an image: img_emb = model.encode(Image.open('two_dogs_in_snow.jpg')) #Encode text descriptions text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night']) #Compute cosine similarities cos_scores = util.cos_sim(img_emb, text_emb) print(cos_scores)

More Information IPython Demo Colab Demo

Examples how to train the CLIP model on your data will be added soon.

New Models

Add v3 models trained for semantic search on MS MARCO: MS MARCO Models v3

First models trained on Natural Questions dataset for Q&A Retrieval: Natural Questions Models v1

Add DPR Models from Facebook for Q&A Retrieval: DPR-Models

New Features

The Asym Model can now be used as the first model in a SentenceTransformer modules list.

Sorting when encoding changes: Previously, we encoded from short to long sentences. Now we encode from long to short sentences. Out-of-memory errors will then happen at the start. Also the approximation on the duration of the encode process is more precise

Improvement of the util.semantic_search method: It now uses the much faster torch.topk function. Further, you can define which scoring function should be used

New util methods: util.dot_score computes the dot product of two embedding matrices. util.normalize_embeddings will normalize embeddings to unit length

New parameter for SentenceTransformer.encode method: normalize_embeddings if set to true, it will normalize embeddings to unit length. In that case the faster util.dot_score can be used instead of util.cos_sim to compute cosine similarity scores.

If you specify in models.Transformer(do_lower_case=True) when creating a new SentenceTransformer, then all input will be lower cased.

New Examples

Add example for model quantization on CPUs (smaller models, faster run-time): model_quantization.py

Start to add example how to train SBERT models without training data: unsupervised learning. We start with an example for Query Generation to train a semantic search model.

Bugfixes

Encode method now correctly returns token_embeddings if output_value='token_embeddings' is defined

Bugfix of the LabelAccuracyEvaluator

Bugfix of removing tensors off the CPU if you specified encode(sent, convert_to_tensor=True). They now stay on the GPU

Breaking changes:

SentenceTransformer.encode-Methode: Removed depcreated parameters is_pretokenized and num_workers

Source code(tar.gz)
Source code(zip)
v0.4.1(Jan 4, 2021)
Refactored Tokenization

Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.

Usage of the SentencesDataset no longer needed for training. You can pass your train examples directly to the DataLoader:

train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8), InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)] train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

If you use a custom torch DataSet class: The dataset class must now return InputExample objects instead of tokenized texts

Class SentenceLabelDataset has been updated to new tokenization flow: It returns always two or more InputExamples with the same label

Asymmetric Models Add new models.Asym class that allows different encoding of sentences based on some tag (e.g. query vs paragraph). Minimal example:

word_embedding_model = models.Transformer(base_model, max_seq_length=250) pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension()) d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity()) d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity()) asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]}) model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model]) ##Your input examples have to look like this: inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1) ##Encoding (Note: Mixed inputs are not allowed) model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])

Inputs that have the key 'QRY' will be passed through the d1 dense layer, while inputs with they key 'DOC' through the d2 dense layer. More documentation on how to design asymmetric models will follow soon.

New Namespace & Models for Cross-Encoder Cross-Encoder are now hosted at https://huggingface.co/cross-encoder. Also, new pre-trained models have been added for: NLI & QNLI.

Logging Log messages now use a custom logger from logging thanks to PR #623. This allows you which log messages you want to see from which components.

Unit tests A lot more unit tests have been added, which test the different components of the framework.
Source code(tar.gz)
Source code(zip)
v0.4.0(Dec 22, 2020)
Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.

New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.

New application example for information retrieval and question answering retrieval. Together with respective pre-trained models

Source code(tar.gz)
Source code(zip)
v0.3.9(Nov 18, 2020)
This release only include some smaller updates:

Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1

As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.

model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.

The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs

The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.

Source code(tar.gz)
Source code(zip)
v0.3.8(Oct 19, 2020)
Add support training and using CrossEncoder

Data Augmentation method AugSBERT added

New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1

New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1

Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.

New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.

Smaller changes:

Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator

SentenceTransformer.encode method detaches tensors from compute graph

SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty

Source code(tar.gz)
Source code(zip)
v0.3.7(Sep 29, 2020)
Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working

Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation

Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.

Minor changes:

Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.

Added models.Normalize() to allow the normalization of embeddings to unit length

Source code(tar.gz)
Source code(zip)
v0.3.6(Sep 11, 2020)

Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2

This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.
Source code(tar.gz)
Source code(zip)
v0.3.5(Sep 1, 2020)
The old FP16 training code in model.fit() was replaced by using Pytorch 1.6.0 automatic mixed precision (AMP). When setting model.fit(use_amp=True), AMP will be used. On suitable GPUs, this leads to a significant speed-up while requiring less memory.

Performance improvements in paraphrase mining & semantic search by replacing np.argpartition with torch.topk

If a sentence-transformer model is not found, it will fall back to huggingface transformers repository and create it with mean pooling.

Fixing huggingface transformers to version 3.0.2. Next release will make it compatible with huggingface transformers 3.1.0

Several bugfixes: Downloading of files, mutli-GPU-encoding

Source code(tar.gz)
Source code(zip)
v0.3.4(Aug 24, 2020)
The documentation is substantially improved and can be found at: www.SBERT.net - Feedback welcome

The dataset to hold training InputExamples (dataset.SentencesDataset) now uses lazy tokenization, i.e., examples are tokenized once they are needed for a batch. If you set num_workers to a positive integer in your DataLoader, tokenization will happen in a background thread. This substantially increases the start-up time for training.

model.encode() uses also a PyTorch DataSet + DataLoader. If you set num_workers to a positive integer, tokenization will happen in the background leading to faster encoding speed for large corpora.

Added functions and an example for mutli-GPU encoding - This method can be used to encode a corpus with multiple GPUs in parallel. No multi-GPU support for training yet.

Removed parallel_tokenization parameters from encode & SentencesDatasets - No longer needed with lazy tokenization and DataLoader worker threads.

Smaller bugfixes

Breaking changes:

Renamed evaluation.BinaryEmbeddingSimilarityEvaluator to evaluation.BinaryClassificationEvaluator

Source code(tar.gz)
Source code(zip)
v0.3.3(Aug 6, 2020)
New Functions

Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets

Tokenization of datasets for training can now run in parallel (Linux Only)

New example for Quora Duplicate Questions Retrieval: See examples-folder

Many small improvements for training better models for Information Retrieval

Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder

Added new Evaluators for ParaphraseMining and InformationRetrieval

evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy

model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors

New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/

New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/

Breaking Changes

The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.

Source code(tar.gz)
Source code(zip)
v0.3.2(Jul 23, 2020)
This is a minor release. There should be no breaking changes.

ParallelSentencesDataset: Datasets are tokenized on-the-fly, saving some start-up time

util.pytorch_cos_sim - Method. New method to compute cosine similarity with pytorch. About 100 times faster than scipy cdist. semantic_search.py example has been updated accordingly.

SentenceTransformer.encode: New parameter: convert_to_tensor. If set to true, encode returns one large pytorch tensor with your embeddings

Source code(tar.gz)
Source code(zip)
v0.3.1(Jul 22, 2020)
This is a minor update that changes some classes for training & evaluating multilingual sentence embedding methods.

The examples for training multi-lingual sentence embeddings models have been significantly extended. See docs/training/multilingual-models.md for details. An automatic script that downloads suitable data and extends sentence embeddings to multiple languages has been added.

The following classes/files have been changed:

datasets/ParallelSentencesDataset.py: The dataset with parallel sentences is encoded on-the-fly, reducing the start-up time for extending a sentence embedding model to new languages. An embedding cache can be configure to store previously computed sentence embeddings during training.

New evaluation files:

evaluation/MSEEvaluator.py - breaking change. Now, this class expects lists of strings with parallel (translated) sentences. The old class has been renamed to MSEEvaluatorFromDataLoader.py

evaluation/EmbeddingSimilarityEvaluatorFromList.py - Semantic Textual Similarity data can be passed as lists of strings & scores

evaluation/MSEEvaluatorFromDataFrame.py - MSE Evaluation of teacher and student embeddings based on data in a data frame

evaluation/MSEEvaluatorFromDataLoader.py - MSE Evaluation if data is passed as a data loader

Bugfixes:

model.encode() failed to sort sentences by length. This function has been fixed to boost encoding speed by reducing overhead of padding tokens.

Source code(tar.gz)
Source code(zip)
v0.3.0(Jul 9, 2020)

This release updates HuggingFace transformers to v3.0.2. Transformers did some breaking changes to the tokenization API. This (and future) versions will not be compatible with HuggingFace transfomers v2.

There are no known breaking changes for existent models or existent code. Models trained with version 2 can be loaded without issues.

New Loss Functions

Thanks to PR #299 and #176 several new loss functions: Different triplet loss functions and ContrastiveLoss
Source code(tar.gz)
Source code(zip)
v0.2.6(Apr 16, 2020)
The release update huggingface/transformers to the release v2.8.0.

New Features

models.Transformer: The Transformer-Model can now load any huggingface transformers model, like BERT, RoBERTa, XLNet, XLM-R, Elextra... It is based on the AutoModel from HuggingFace. You now longer need the architecture specific models (like models.BERT, models.RoBERTa) any more. It also works with the community models.

Multilingual Training: Code is released for making mono-lingual sentence embeddings models mutli-lingual. See training_multilingual.py for an example. More documentation and details will follow soon.

WKPooling: Adding a pytorch implementation of SBERT-WK. Note, due to an inefficient implementation in pytorch of QR decomposition, WKPooling can only be run on the CPU, which makes it about 40 slower than mean pooling. For some models WKPooling improves the performance, for other don't.

WeightedLayerPooling: A new pooling layer that uses representations from all transformer layers and learns a weighted sum of them. So far no improvement compared to only averaging the last layer.

New pre-trained models released. Every available model is document in a google Spreadsheet for an easier overview.

Minor changes

Clean-up of the examples folder.

Model and tokenizer arguments can now be passed to the according transformers models.

Previous version had some issues with RoBERTa and XLM-RoBERTa, that the wrong special characters were added. Everything is fixed now and relies on huggingface transformers for the correct addition of special characters to the input sentences.

Breaking changes

STSDataReader: The default parameter values have been changed, so that it expects the sentences in the first two columns and the score in the third column. If you want to load the STS benchmkark dataset, you can use the STSBenchmarkDataReader.

Source code(tar.gz)
Source code(zip)
v0.2.5(Jan 10, 2020)
huggingface/transformers was updated to version 2.3.0

Changes:

ALBERT works (bug was fixed in transformers). Does not yield improvements compared to BERT / RoBERTA

T5 added (does not run on GPU due to a bug in transformers). Does not yield improvements compared to BERT / RoBERTA

CamemBERT added

XML-RoBERTa added

Source code(tar.gz)
Source code(zip)
v0.2.4(Dec 6, 2019)
This version update the underlying HuggingFace Transformer package to v2.2.1.

Changes:

DistilBERT and ALBERT modules added

Pre-trained models for RoBERTa and DistilBERT uploaded

Some smaller bug-fixes

Source code(tar.gz)
Source code(zip)
v0.2.3(Aug 20, 2019)
No breaking changes. Just update with pip install -U sentence-transformers

Bugfixes:

SentenceTransformers can now be used with Windows (threw an exception before about invalid tensor types before)

Outputs a warning if seq. length for BERT / RoBERTa is too long

Improvements:

A flag can be set to hide the progress bar when a dataset is convert or an evaluator is executed

Source code(tar.gz)
Source code(zip)
v0.2.2(Aug 19, 2019)
Updated pytorch-transformers to v1.1.0. Adding support for RoBERTa model.

Bugfixes:

Critical bugfix for SoftmaxLoss: Classifier weights were not optimized in previous version

Minor fix for including the timestamp of the output folders

Source code(tar.gz)
Source code(zip)
v0.2.1(Aug 16, 2019)

This is a minor fix: Packages were not correctly defined for pypi
Source code(tar.gz)
Source code(zip)