Language-Agnostic SEntence Representations

Overview

LASER Language-Agnostic SEntence Representations

LASER is a library to calculate and use multilingual sentence embeddings.

NEWS

  • 2019/11/08 CCMatrix is available: Mining billions of high-quality parallel sentences on the WEB [8]
  • 2019/07/31 Gilles Bodard and Jérémy Rapin provided a Docker environment to use LASER
  • 2019/07/11 WikiMatrix is available: bitext extraction for 1620 language pairs in WikiPedia [7]
  • 2019/03/18 switch to BSD license
  • 2019/02/13 The code to perform bitext mining is now available

CURRENT VERSION:

  • We now provide an encoder which was trained on 93 languages, written in 23 different alphabets [6]. This includes all European languages, many Asian and Indian languages, Arabic, Persian, Hebrew, ..., as well as various minority languages and dialects.
  • We provide a test set for more than 100 languages based on the Tatoeba corpus.
  • Switch to PyTorch 1.0

All these languages are encoded by the same BiLSTM encoder, and there is no need to specify the input language (but tokenization is language specific). According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages.

We have also some evidence that the encoder can generalizes to other languages which have not been seen during training, but which are in a language family which is covered by other languages.

A detailed description how the multilingual sentence embeddings are trained can be found in [6], together with an extensive experimental evaluation.

Dependencies

  • Python 3.6
  • PyTorch 1.0
  • NumPy, tested with 1.15.4
  • Cython, needed by Python wrapper of FastBPE, tested with 0.29.6
  • Faiss, for fast similarity search and bitext mining
  • transliterate 1.10.2, only used for Greek (pip install transliterate)
  • jieba 0.39, Chinese segmenter (pip install jieba)
  • mecab 0.996, Japanese segmenter
  • tokenization from the Moses encoder (installed automatically)
  • FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)

Installation

  • set the environment variable 'LASER' to the root of the installation, e.g. export LASER="${HOME}/projects/laser"
  • download encoders from Amazon s3 by bash ./install_models.sh
  • download third party software by bash ./install_external_tools.sh
  • download the data used in the example tasks (see description for each task)

Applications

We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks").

For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.

License

LASER is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.

Supported languages

Our model was trained on the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.

References

[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.

[3] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space ACL, July 2018

[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, XNLI: Cross-lingual Sentence Understanding through Inference, EMNLP, 2018.

[5] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.

[6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.

[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

Comments
  • The script to download cc_matrix is not usable

    The script to download cc_matrix is not usable

    In this file

    (https://github.com/facebookresearch/LASER/blob/master/tasks/CCMatrix/dl_cc_matrix.py) line 9 (from cc_net.process_wet_file import CCSegmentsReader)

    The code raises an error.

    It seems there is no CCSegmentsReader to import. (https://github.com/facebookresearch/cc_net/blob/master/cc_net/process_wet_file.py)

    Am I doing something wrong?

    opened by Morizeyao 33
  • Python subprocess.Popen “OSError: [Errno 12] Cannot allocate memory”

    Python subprocess.Popen “OSError: [Errno 12] Cannot allocate memory”

    This problems happens when running multiple concurrent calls to the process of calculating embeddings in TokenLine

    def TokenLine(line, lang='en', lower_case=True, romanize=False):
        assert lower_case, 'lower case is needed by all the models'
        roman = lang if romanize else 'none'
        tok = check_output(
                REM_NON_PRINT_CHAR
                + '|' + NORM_PUNC + lang
                + '|' + DESCAPE
                + '|' + MOSES_TOKENIZER + lang
                + ('| python3.6 -m jieba -d ' if lang == 'zh' else '')
                + ('|' + MECAB + '/bin/mecab -O wakati -b 50000 ' if lang == 'ja' else '')
                + '|' + ROMAN_LC + roman,
                input=line,
                encoding='UTF-8',
                shell=True)
        return tok.strip()
    

    Here is the detailed strack trace:

      File "/usr/lib/python3.6/subprocess.py", line 1275, in _execute_child
        restore_signals, start_new_session, preexec_fn)
    OSError: [Errno 12] Cannot allocate memory
    INFO:default:called LaserEmbeddingHandler...
    ERROR:default:[Errno 12] Cannot allocate memory
    Traceback (most recent call last):
      File "/tornado_api/handlers/embeddingLaserHandler.py", line 195, in post
        embeddings = vector_embedding.embedding_line(model=model_laser,lang=lang,bpe_codes=QUALITY_MODEL_PATH + "/93langs.fcodes",input_text=text)
      File "/tornado_api/deeplearning/vector_embedding.py", line 50, in embedding_line
        embeddings.append(t.result()[0].tolist() )
      File "/usr/lib/python3.6/concurrent/futures/_base.py", line 432, in result
        return self.__get_result()
      File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
        raise self._exception
      File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/tornado_api/deeplearning/similarity_search.py", line 99, in pipeline
        lower_case=lower_case)
      File "/tornado_api/deeplearning/lib/text_processing.py", line 62, in TokenLine
        shell=True)
      File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
        **kwargs).stdout
      File "/usr/lib/python3.6/subprocess.py", line 403, in run
        with Popen(*popenargs, **kwargs) as process:
      File "/usr/lib/python3.6/subprocess.py", line 709, in __init__
        restore_signals, start_new_session)
      File "/usr/lib/python3.6/subprocess.py", line 1275, in _execute_child
        restore_signals, start_new_session, preexec_fn)
    OSError: [Errno 12] Cannot allocate memory
    INFO:default:called LaserEmbeddingHandler...
    ERROR:default:[Errno 12] Cannot allocate memory
    Traceback (most recent call last):
      File "/tornado_api/handlers/embeddingLaserHandler.py", line 195, in post
        embeddings = vector_embedding.embedding_line(model=model_laser,lang=lang,bpe_codes=QUALITY_MODEL_PATH + "/93langs.fcodes",input_text=text)
      File "/tornado_api/deeplearning/vector_embedding.py", line 50, in embedding_line
        embeddings.append(t.result()[0].tolist() )
      File "/usr/lib/python3.6/concurrent/futures/_base.py", line 432, in result
        return self.__get_result()
      File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
        raise self._exception
      File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/tornado_api/deeplearning/similarity_search.py", line 99, in pipeline
        lower_case=lower_case)
      File "/tornado_api/deeplearning/lib/text_processing.py", line 62, in TokenLine
        shell=True)
      File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
        **kwargs).stdout
      File "/usr/lib/python3.6/subprocess.py", line 403, in run
        with Popen(*popenargs, **kwargs) as process:
      File "/usr/lib/python3.6/subprocess.py", line 709, in __init__
        restore_signals, start_new_session)
      File "/usr/lib/python3.6/subprocess.py", line 1275, in _execute_child
        restore_signals, start_new_session, preexec_fn)
    OSError: [Errno 12] Cannot allocate memory
    
    opened by loretoparisi 14
  • Similarity between Embeddings

    Similarity between Embeddings

    Hi, I am trying to use LASER embedding in order to calculate similarity between sentences. I tried with inner product and results are not bad, but maybe others are better. What similarity would you recommend? Thanks

    opened by simonefrancia 14
  • Looking for clear instructions on building the mecab package ...

    Looking for clear instructions on building the mecab package ...

    The "pip install" failed but the external tools installer said it was better to build from source. Unfortunately, I haven't been able to find clear documentation for how to do this.

    Any help would be appreciated.

    opened by ohmeow 11
  • Wget with 429 Errors

    Wget with 429 Errors

    While trying to download the corpora, we are receiving 429 errors:

    wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiPedia.wuu-zh.tsv
    --2019-07-15 20:47:56--  https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiPedia.wuu-zh.tsv
    Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.20.22.166, 104.20.6.166, 2606:4700:10::6814:6a6, ...
    Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.20.22.166|:443... connected.
    HTTP request sent, awaiting response... 429 Too Many Requests
    2019-07-15 20:47:56 ERROR 429: Too Many Requests.
    
    opened by alvations 10
  • Mecab not installed in docker -- JAPANESE issue

    Mecab not installed in docker -- JAPANESE issue

    Hi, I am trying to get sentence embeddings on Japanese via the docker. However, the output is empty since Mecab is not installed.

    Output in python: {'content': JAPANESE_SENTENCE, 'embedding': []}

    Output in shell:

     - Encoder: loading /app/LASER/models/bilstm.93langs.2018-12-26.pt
     - Tokenizer: content.txt in language ja
    /bin/sh: 1: /app/LASER/tools-external/mecab/bin/mecab: not found
    WARNING: No known abbreviations for language 'ja', attempting fall-back to English version...
     - fast BPE: processing tok
     - Encoder: bpe to out.raw
     - Encoder: 0 sentences in 0s
    

    Any idea on how to deal with that? :)

    opened by MastafaF 9
  • cannot install fastbpe

    cannot install fastbpe

    hi,

    i noticed that "FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)"

    but it seems that there is no fastbpe,

    and then i tried to install it manually, but it said that : g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast clang: error: no such file or directory: 'fastBPE/main.cc' clang: error: no input files

    could anyone know how to solve this problem ? thanks !

    opened by guangyuli-uoe 7
  • CCMatrix : The downloaded data from the script doesn't match the expected data sizes

    CCMatrix : The downloaded data from the script doesn't match the expected data sizes

    I have downloaded the "v1.0.beta" version of the data and get all the language pairs and I got these numbers:

    image

    I don't know why the numbers aren't the same? Is this means that the "v1.0" will include the missing data?

    Also when will "v1.0" be available?

    Also regardless the missing languages like English, Japanese and Chinese The other language sentences don't fully aligned for example in "ar-fr" we have:

    • 3,195,877 Sentences in fr
    • 3,134,825 Sentences in ar
    • Only 680,546 Sentences are aligned Which is 20% only of the data size Why not the full data aligned? Is this will be solved in "v1.0"?

    @gwenzek

    opened by EL-SHREIF 6
  • No English sentence is provided in CC Matrix data and some language pairs are missing.

    No English sentence is provided in CC Matrix data and some language pairs are missing.

    Hi

    I've downloaded files listed in https://dl.fbaipublicfiles.com/laser/CCMatrix/v1.0.beta/list.txt and extracted the entries in the language information column (e.g., cs-ko/ko) in there.

    Then, I've found that there is no English sentence (i.e., "bla-en/en" or "en-bla/en").

    Furthermore, some popular language pairs like es-pt (Spanish and Portuguese) are missing.

    Is there something wrong or these pairs are not provided intentionally?

    opened by TomokiMatsuno 6
  • What is the best way to compare short sentences (1-4 words) in different languages?

    What is the best way to compare short sentences (1-4 words) in different languages?

    I use LASER to achieve this with the combination of cosine similarity and it works fine for near or exact translations, but I get medium percentages (40-50%) between irrelevant translations. I use SentenceEncoder, TokenLine to tokenize each sentence, BPEfastApplyLine, I am calculating embeddings and then cosine similarity.

    opened by vadal 6
  • Embedding tasks stop at pre-processing phase.

    Embedding tasks stop at pre-processing phase.

    I'm currently using LASOR to embed my documents. My command is:

    ./embed.sh ~/source/samsung/vecalign/bleualign_data/overlaps.vi ~/source/samsung/vecalign/bleualign_data/overlaps.vi.emb [vi] 
    

    But I got the error. image.

    Can anyone face the problem? How can I fix that?

    opened by Vietdung113 5
  • add padded indices in max_token count in embed.py

    add padded indices in max_token count in embed.py

    The origin implementation does not accurately get the number of tokens in batches, since it does not take into account of the padded tokens in token count, which is fixed in the PR.

    CLA Signed 
    opened by JiajunBao 0
  • requirements.txt with compatible third-party versions

    requirements.txt with compatible third-party versions

    Please add a requirements.txt with compatible third-party versions or some means to create conda evnironments. Installing the latest versions breaks the bucc.sh script and it's tedious to debug and figure out the mismatches of installed libraries.

    I have run these commands:

    pip install torch==1.0
    pip install faiss faiss-gpu
    pip install scipy numpy
    pip install cython==0.29.6
    pip install fairseq==0.12.1
    pip install tabulate
    pip install pandas
    pip install jieba
    pip install transliterate==1.10.2
    pip install tensorboardX
    

    but I still get an Attribute error when running

    from faiss._swigfaiss import delete_FloatVector
    
    opened by senisioi 2
  • Question related to paper cc-matrix + code

    Question related to paper cc-matrix + code

    Hi, At the end of section 4.3 it says there is a special procedure for high resource languages where only fwd scores are calculated.

    in the code if we set "fwd" instead of "max" it is supposed to calculate only the fwd stuff BUT when scoring it need both the x2y_mean AND the y2x_mean, the later is ot being calculated.

    So what should it be in the scoring formula ?

    Many thanks.

    opened by vince62s 1
  • mine_bitexts results

    mine_bitexts results

    I tried to mine from two wmt news files (en, de) - generated the embeddings with bilstm.93langs.2018-12-26.pt - then mine_bitexts.py with defaults settings ("mine", "ratio") but the results is very bad even though the margin seems good: top lines: any clue ?

    1.6953904090950678 Affordable and flexible fee payment structure. Inzwischen sieht das schon ganz anders aus. 1.6196460840402447 Having toured as Beyoncé"s bassist and assistant musical director, she also has her own career as a soloist and songwriter, not to mention a new, much younger audience. Er schreibt: "Mit unglaublicher Trauer muss ich die Nachricht über den Tod meiner wunderschönen Tochter Maia teilen. 1.6135828150470815 As their popularity continues to increase, we will probably end up more focused on what happens after our favourite reality shows than what happens on them. Der Kreativität seien mithin keine Grenzen gesetzt, betont er. 1.566564394910487 With steroid medication making little difference, Karina realised that she was suffering from topical steroid addiction (TSA) and topical steroid withdrawal (TSW), when the skin reacts adversely after long-term use of topical steroids is stopped, Timo Werner kam hingegen nur auf sechs Tore und wurde mit seiner Chancenverschwendung in England phasenweise zur Witzfigur. 1.555322152863677 "I would like to thank the management and backroom team for their unwavering support and commitment and the clubs, supporters, Club Iarmhi and Westmeath County Board. SN/www.picturedesk.com Vizekanzler Werner Kogler und Klubchefin Sigrid Maurer. 1.5525301272930254 The next two corners of finance to feel the invasion of quants will be the corporate bond market - where systematic strategies are now beginning to spread - and private equity, Rattray predicts. Ismaning - Die Dramatik mit den Ismaninger Abschlussversuchen zeichnete sich in der ersten Hälfte noch nicht ab, denn die war nur wenig aufregend. Unter dem Strich hatten die Dachauer etwas mehr vom Spiel, Ismaning lauerte auf Konter und de facto neutralisierten sich alle. Der FCI präsentierte sich im defensiven Verhalten deutlich besser als zuletzt gegen Hallbergmoos und ließ nur Schüsse zu, die Torwart Radic sicher hatte. 1.5299819314567646 He's not a typical freshman, though. Das teilten Polizei und Staatsanwaltschaft am Sonntag gemeinsam mit. 1.5084004481504976 Move over hard seltzer, a new beverage is poised to become the drink of summer. Die gesellschaftliche Spaltung Israels, so viel steht fest, kann mit dieser Wahl nicht überwunden werden. 1.5046071137197825 The store is advising customers to visit its website in the first instance, before making a trip to store, where there is also an opportunity to utilise its click and collect service. If you choose this option, the team will have your items ready for you to collect within one hour. Extreme Änderungen wolle er aber nicht vornehmen: "Jetzt noch mal alles über den Haufen zu schmeißen, wäre auch nicht zielführend". Freilich werde er weiterhin jungen Spielern "eine Plattform geben, sich zu zeigen. 1.4632964789078255 We are psychologists, data scientists and HR consultants who screen, select, develop, and engage talent worldwide. Wem in der Pandemie etwas langweilig geworden und das Rasenmähen oder andere laute Gartenarbeiten eine willkommene Abwechslung ist, sollte nicht übermütig an die Sache herangehen. Denn bei Geräten wie einem motorbetriebenen Rasenmäher müssen die Ruhezeiten eingehalten sowie die Sonntags- und Feiertagsruhe respektiert werden. Haben die Nachbarn es sich nach dem Mittagessen gerade draußen bequem gemacht, sollte man besser noch etwas warten, bevor man den Rasenmäher aufheulen lässt.

    opened by vince62s 5
  • [Help Wanted] Generate LASER embeddings for a large number of sentences (15.7 million)

    [Help Wanted] Generate LASER embeddings for a large number of sentences (15.7 million)

    For my university FYP project related to text simplification, there's a requirement for me to generate LASER embeddings for a large number of sentences. (15.7 million) However when I try to generate LASER embeddings using the SentenceEncoder in the embed.py, the program stays fully utilized for around 12 hours and then exits without any error (I assume it is because of the high CPU and GPU utilization). I'm using the SentenceEncoder in the following way.

    Initialize the SentenceEncoder with the following params. I'm using the pretrained encoder (models/bilstm.93langs.2018-12-26.pt )

    SentenceEncoder(encoder_path, max_tokens=3000, cpu=False, verbose=True)
    

    And then generate LASER embeddings as follows.

    embeddings = encoder.encode_sentences(read_lines(bpe_filepath))
    

    I tried to execute the setup with above params in a GCP compute engine with 16 cores with 102 GB memory and 1 Nvidia Tesla T4 GPU. The CPU utilization reaches 100% while the GPU utilization is somewhere around 90%. It stays like that for around 12 hours and exits without any error. (no error in nohup.out).

    Any idea about what could go wrong ? I'm stucked at this point for several weeks and really appreciate if someone can help me.

    cc @hoschwenk

    opened by NomadXD 0
Owner
Facebook Research
Facebook Research
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 277 Feb 18, 2021
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Facebook Research 2.3k Jan 8, 2023
[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

Learning Signal-Agnostic Manifolds of Neural Fields This is the uncleaned code for the paper Learning Signal-Agnostic Manifolds of Neural Fields. The

null 60 Dec 12, 2022
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 2, 2023
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 1, 2023
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 4.2k Feb 18, 2021
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 4.7k Feb 17, 2021
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 7, 2023
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Facebook Research 2.2k Dec 27, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

null 49 Dec 17, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022