A single model that parses Universal Dependencies across 75 languages.

Overview

UDify

MIT License

UDify is a single model that parses Universal Dependencies (UPOS, UFeats, Lemmas, Deps) jointly, accepting any of 75 supported languages as input (trained on UD v2.3 with 124 treebanks). This repository accompanies the paper, "75 Languages, 1 Model: Parsing Universal Dependencies Universally," providing tools to train a multilingual model capable of parsing any Universal Dependencies treebank with high accuracy. This project also supports training and evaluating for the SIGMORPHON 2019 Shared Task #2, which achieved 1st place in morphology tagging (paper can be found here).

Integration with SpaCy is supported by Camphr.

UDify Model Architecture

The project is built using AllenNLP and PyTorch.

Getting Started

Install the Python packages in requirements.txt. UDify depends on AllenNLP and PyTorch. For Windows OS, use WSL. Optionally, install TensorFlow to get access to TensorBoard to get a rich visualization of model performance on each UD task.

pip install -r ./requirements.txt

Download the UD corpus by running the script

bash ./scripts/download_ud_data.sh

or alternatively download the data from universaldependencies.org and extract into data/ud-treebanks-v2.3/, then run scripts/concat_ud_data.sh to generate the multilingual UD dataset.

Training the Model

Before training, make sure the dataset is downloaded and extracted into the data directory and the multilingual dataset is generated with scripts/concat_ud_data.sh. To train the multilingual model (fine-tune UD on BERT), run the command

python train.py --config config/ud/multilingual/udify_bert_finetune_multilingual.json --name multilingual

which will begin loading the dataset and model before training the network. The model metrics, vocab, and weights will be saved under logs/multilingual. Note that this process is highly memory intensive and requires 16+ GB of RAM and 12+ GB of GPU memory (requirements are half if fp16 is enabled in AllenNLP, but this requires custom changes to the library). The training may take 20 or more days to complete all 80 epochs depending on the type of your GPU.

Training on Other Datasets

An example config is given for fine-tuning on just English EWT. Just run:

python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

To run your own dataset, copy config/ud/multilingual/udify_bert_finetune_multilingual.json and modify the following json parameters:

  • train_data_path, validation_data_path, and test_data_path to the paths of the dataset conllu files. These can be optionally null.
  • directory_path to data/vocab/ /vocabulary .
  • warmup_steps and start_step to be equal to the number of steps in the first epoch. A good initial value is in the range 100-1000. Alternatively, run the training script first to see the number of steps to the right of the progress bar.
  • If using just one treebank, optionally add xpos to the tasks list.

Viewing Model Performance

One can view how well the models are performing by running TensorBoard

tensorboard --logdir logs

This should show the currently trained model as well as any other previously trained models. The model will be stored in a folder specified by the --name parameter as well as a date stamp, e.g., logs/multilingual/2019.07.03_11.08.51.

Pretrained Models

Pretrained models can be found here. This can be used for predicting conllu annotations or for fine-tuning. The link contains the following:

  • udify-model.tar.gz - The full UDify model archive that can be used for prediction with predict.py. Note that this model has been trained for extra epochs, and may differ slightly from the model shown in the original research paper.
  • udify-bert.tar.gz - The extracted BERT weights from the UDify model, in huggingface transformers (pytorch-pretrained-bert) format.

Predicting Universal Dependencies from a Trained Model

To predict UD annotations, one can supply the path to the trained model and an input conllu-formatted file:

python predict.py <archive> <input.conllu> <output.conllu> [--eval_file results.json]

For instance, predicting the dev set of English EWT with the trained model saved under logs/model.tar.gz and UD treebanks at data/ud-treebanks-v2.3 can be done with

python predict.py logs/model.tar.gz  data/ud-treebanks-v2.3/UD_English-EWT/en_ewt-ud-dev.conllu logs/pred.conllu --eval_file logs/pred.json

and will save the output predictions to logs/pred.conllu and evaluation to logs/pred.json.

Configuration Options

  1. One can specify the type of device to run on. For a single GPU, use the flag --device 0, or --device -1 for CPU.
  2. To skip waiting for the dataset to be fully loaded into memory, use the flag --lazy. Note that the dataset won't be shuffled.
  3. Resume an existing training run with --resume .
  4. Specify a config file with --config .

SIGMORPHON 2019 Shared Task

A modification to the basic UDify model is available for parsing morphology in the SIGMORPHON 2019 Shared Task #2. The following paper describes the model in more detail: "Cross-Lingual Lemmatization and Morphology Tagging with Two-Stage Multilingual BERT Fine-Tuning".

Training is similar to UD, just run download_sigmorphon_data.sh and then use the configuration file under config/sigmorphon/multilingual, e.g.,

python train.py --config config/sigmorphon/multilingual/udify_bert_sigmorphon_multilingual.json --name sigmorphon

FAQ

  1. When fine-tuning, my scores/metrics show poor performance.

It should take about 10 epochs to start seeing good scores coming from all the metrics, and 80 epochs to be competitive with UDPipe Future.

One caveat is that if you use a subset of treebanks for fine-tuning instead of all 124 UD v2.3 treebanks, you must modify the configuration file. Make sure to tune the learning rate scheduler to the number of training steps. Copy the udify_bert_finetune_multilingual.json config and modify the "warmup_steps" and "start_step" values. A good initial choice would be to set both to be equal to the number of training batches of one epoch (run the training script first to see the batches remaining, to the right of the progress bar).

Have a question not listed here? Open a GitHub Issue.

Citing This Research

If you use UDify for your research, please cite this work as:

@inproceedings{kondratyuk-straka-2019-75,
    title = {75 Languages, 1 Model: Parsing Universal Dependencies Universally},
    author = {Kondratyuk, Dan and Straka, Milan},
    booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
    year = {2019},
    address = {Hong Kong, China},
    publisher = {Association for Computational Linguistics},
    url = {https://www.aclweb.org/anthology/D19-1279},
    pages = {2779--2795}
}
Comments
  • predict.py to work with raw text files

    predict.py to work with raw text files

    Hello!

    First of all, thank you for the research and shared code, it's immensely helpful.

    I wanted to know if there's an easy way for me to make predict.py work with raw text files, since this seems like the purpose of the architecture. Is there a reason my input files have to conform to the CoNLL -U format besides calculating evaluation metrics?

    opened by drunkinlove 6
  • Updating conllu library

    Updating conllu library

    Hi Dan, I see in the code and in #5 that updating the conllu library is on the agenda.

    I have made a few modifications on my forked version of UDify. From what I understand, parser.py contains some source code from the conllu library with a few modifications, mainly to handle multi-word tokens, where the desired output (example from fr_gsd-ud-train.conllu) looks like:

    multiword_ids ['3-4', '72-73', '87-88', '105-106', '110-111', '121-122']
    multiword_forms ['du', 'des', 'des', 'des', 'du', 'du']
    

    In my forked version, I am still using the conllu library to return the annotation but do the MWT processing in a subsequent step in a process_MWTs function. In this version, I confirmed that the outputs are the same:

    multiword_ids ['3-4', '72-73', '87-88', '105-106', '110-111', '121-122']
    multiword_forms ['du', 'des', 'des', 'des', 'du', 'du']
    

    I have done another few checks to make sure the data is the same, where updated is the forked version and original is the current version e.g.:

    cat fr_gsd_original/vocabulary/tokens.txt | md5sum
    e80f1f1e341fc5734c8f3a3d1c779c55 
    cat fr_gsd_updated/vocabulary/tokens.txt | md5sum
    e80f1f1e341fc5734c8f3a3d1c779c55
    

    There are a few benefits I can see from this:

    1. Supports most recent conllu library.
    2. Reduces the amount of code needed in parser.py

    There are probably more elegant ways of going about MWT processing but I just thought I'd post it here in case you find it helpful. If you do, I can do more tests and once confirming behaviour is exactly the same, I can submit a PR.

    opened by jbrry 5
  • Training udify model for Russian.

    Training udify model for Russian.

    Hello, I am training a udify model only for Russian where I train only on the Russian data from UD2.3. However, I am running into the following issue. The same code runs fine on other languages from UD. Traceback (most recent call last): File "train.py", line 69, in train_model(train_params, serialization_dir, recover=bool(args.resume)) File "/usr1/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/allennlp/commands/train.py", line 226, in train_model cache_prefix) File "/usr1/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/allennlp/training/trainer_pieces.py", line 65, in from_params model = Model.from_params(vocab=vocab, params=params.pop('model')) File "/usr1/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 365, in from_params return subclass.from_params(params=params, **extras) File "/usr1/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 386, in from_params kwargs = create_kwargs(cls, params, **extras) File "/usr1/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 133, in create_kwargs kwargs[name] = construct_arg(cls, name, annotation, param.default, params, **extras) File "/usr1/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 257, in construct_arg value_dict[key] = value_cls.from_params(params=value_params, **subextras) File "/usr1/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 365, in from_params return subclass.from_params(params=params, **extras) File "/usr1/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 388, in from_params return cls(**kwargs) # type: ignore File "/usr1/home/user/udify/udify/models/tag_decoder.py", line 106, in init div_value=4.0) File "/usr1/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/torch/nn/modules/adaptive.py", line 116, in init raise ValueError("cutoffs should be a sequence of unique, positive " ValueError: cutoffs should be a sequence of unique, positive integers sorted in an increasing order, where each value is between 1 and n_classes-1

    opened by Aditi138 3
  • Can I load the model using HuggingFace AutoModel ?

    Can I load the model using HuggingFace AutoModel ?

    Hello,

    Is it possible to load the UDify bert-based (udify-bert-tar-gz) model using AutoModel class of HuggingFace library ? When downloading the model, the vocab.txt file was missing , is it the same as bert-multilingual-base ?

    Thanks in advance

    opened by Hadjerkhd 2
  • Training seems not no begin

    Training seems not no begin

    I am trying to reproduce the experiment, but it look as if the training process stay stuck at start :

    2019-07-09 17:15:45,951 - INFO - allennlp.training.trainer - Beginning training. 2019-07-09 17:15:45,951 - INFO - allennlp.training.trainer - Epoch 0/79 2019-07-09 17:15:45,951 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 19202.28 2019-07-09 17:15:46,225 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 1694 2019-07-09 17:15:46,226 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 37 2019-07-09 17:15:46,231 - INFO - allennlp.training.trainer - Training 0%| | 0/46617 [00:00<?, ?it/s]

    After a night, the progress bar has not moved at all.

    Cpu usage is 100% for 1 core, memory use is slightly increasing, and gpus are not working.

    Could you please indicate which versions of python, allennlp and pytorch you are using ?

    Mine are python=3.6 allennlp==0.8.4 pytorch-pretrained-bert==0.6.1 pytorch=1.0.0

    opened by ghpu 2
  • predict.py to work with .conllu files NOT annotated for dependencies?

    predict.py to work with .conllu files NOT annotated for dependencies?

    Hi there,

    I was wondering whether there was a way for me to use predict.py with my corpus data (.conllu) which is not annotated for dependencies, but is annotated for POS. My goal is not to calculate evaluation metrics, at the moment, but rather have my pretrained model give me predictions on dependencies to hopefully get a head start with dependency annotations. I am working on an underdocumented language and would like to have a first row of dependencies predictions that I would then go back to, verify and update to create the GOLD standard for my language.

    Is there a reason my input file have to conform to the conllu format other than for evaluation metrics? My issue seems to be that my "head" and "deprel" columns are not integers but simply "_" because they're empty. I would preferably like to keep the .conllu format of my input file as it contains POS information already which could give me better predictions.

    Thank you for the research, it's super helpful, especially for underdocumented languages.

    Here is my error message : image

    opened by lmompela 1
  • "Training on other datasets" directions missing required flag

    This command in the "Training on other datasets" of the README causes the following error:

    (python3) gneubig@ogma:~/work/udify$ python train.py --config config/ud/en/udify_bert_train_en_ewt.json --name en_ewt
    Traceback (most recent call last):
      File "train.py", line 46, in <module>
        train_path = glob.glob(pathname).pop()
    IndexError: pop from empty list
    

    Adding the --dataset_dir resolves the error.

    python train.py --config config/ud/en/udify_bert_train_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/
    
    opened by neubig 1
  • Continuing training on a new data.

    Continuing training on a new data.

    I have a udify model on one dataset and I want to continue training on a new dataset. I used --resume option giving the serialization directory of the model trained on first dataset. However, that didn't work, even after first epoch the model seemed to have reset its parameters and started training from scratch. I also used a lower learning rate in the same config file but it didn't work. Is there anything I am doing wrong?

    opened by Aditi138 1
  • Utilize conllu python library

    Utilize conllu python library

    This PR addresses #12 and uses the upstream conllu library to retrieve conllu annotations. In a post-processing step, the token ids of multi-word tokens and elided tokens are set to None so that these annotations won't be used for prediction. The multiword token forms and multiword token ids are stored as normal so that behaviour is the same in the predictor.

    opened by jbrry 1
  • Fix UdifyPredictor.predict

    Fix UdifyPredictor.predict

    HI, thanks for this great repo! I modified some lines of code to UdifyPredictor.predict work. For example:

    arc = load_archive("logs/udify-model.tar.gz")
    p = util.Predictor.from_archive(arc, "udify_predictor")
    p.predict("Hello world!")
    
    opened by tamuhey 1
  • Scalar mix

    Scalar mix

    I was not able to use scalar mix option by changing combine_layers to mix from all. mix_embedding is set to 12. Is there anything else that need to change in the config file?

    opened by niless 1
  • How to run the UDify+Lang experiments?

    How to run the UDify+Lang experiments?

    Is there an example config somewhere showing how to fine-tune on a specific treebank using BERT weights saved from fine-tuning on all UD treebanks combined (using the saved pretrained models)? Corresponding to the UDify+Lang experiments in table 2 in the paper.

    opened by Lguyogiro 0
  • Issue with AllenNLP integration causes predict to not work (ArrayField.empty_field)

    Issue with AllenNLP integration causes predict to not work (ArrayField.empty_field)

    When I try to run a clean checkout of UDify, I get the following error:

    (udify-venv) fran@tlazolteotl /var/lib/home/fran/source/udify $ python predict.py udify-model.tar.gz  data/UD_Kiche-IU/quc_iu-ud-test.conllu logs/pred.conllu --eval_file logs/pred.json
    Traceback (most recent call last):
      File "predict.py", line 14, in <module>
        from allennlp.models.archival import archive_model
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/allennlp/models/__init__.py", line 6, in <module>
        from allennlp.models.model import Model
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/allennlp/models/model.py", line 16, in <module>
        from allennlp.data import Instance, Vocabulary
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/allennlp/data/__init__.py", line 1, in <module>
        from allennlp.data.dataset_readers.dataset_reader import DatasetReader
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/allennlp/data/dataset_readers/__init__.py", line 10, in <module>
        from allennlp.data.dataset_readers.ccgbank import CcgBankDatasetReader
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/allennlp/data/dataset_readers/ccgbank.py", line 9, in <module>
        from allennlp.data.dataset_readers.dataset_reader import DatasetReader
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 8, in <module>
        from allennlp.data.instance import Instance
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/allennlp/data/instance.py", line 3, in <module>
        from allennlp.data.fields.field import DataArray, Field
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/allennlp/data/fields/__init__.py", line 7, in <module>
        from allennlp.data.fields.array_field import ArrayField
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/allennlp/data/fields/array_field.py", line 10, in <module>
        class ArrayField(Field[numpy.ndarray]):
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/allennlp/data/fields/array_field.py", line 50, in ArrayField
        @overrides
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/overrides/overrides.py", line 88, in overrides
        return _overrides(method, check_signature, check_at_runtime)
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/overrides/overrides.py", line 114, in _overrides
        _validate_method(method, super_class, check_signature)
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/overrides/overrides.py", line 135, in _validate_method
        ensure_signature_is_compatible(super_method, method, is_static)
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/overrides/signature.py", line 93, in ensure_signature_is_compatible
        ensure_return_type_compatibility(super_type_hints, sub_type_hints, method_name)
      File "/mnt/partuuid-46caa556-c2c4-eb47-907a-5d2092050724/var/lib/home/fran/source/udify-venv/lib/python3.7/site-packages/overrides/signature.py", line 288, in ensure_return_type_compatibility
        f"{method_name}: return type `{sub_return}` is not a `{super_return}`."
    TypeError: ArrayField.empty_field: return type `None` is not a `<class 'allennlp.data.fields.field.Field'>`.
    
    opened by ftyers 3
  • Using other transformer models

    Using other transformer models

    Hello,

    I am trying to use the XLMRoberta model instead of BERT and I made the following changes to the bert_pretrained.py:

    from transformers import XLMRobertaTokenizer
    from transformers import XLMRobertaModel, XLMRobertaConfig
    

    However, I get the following error:

    super().__init__(vocab=bert_tokenizer.vocab, AttributeError: 'XLMRobertaTokenizer' object has no attribute 'vocab'

    Any guidance would be much appreciated!

    opened by VasilisTz1 2
  • training a udify model only for Korean

    training a udify model only for Korean

    Hello, I am training a udify model only for Korean where I train only on the Korean data from UD2.3. However, I am running into the following issue. The same code runs fine on other languages from UD. Traceback (most recent call last): File "train.py", line 113, in train_model(train_params, serialization_dir, recover=bool(args.resume)) File "/home/user/anaconda3/envs/dependency_parse/lib/python3.7/site-packages/allennlp/commands/train.py", line 226, in train_model cache_prefix) File "/home/user/anaconda3/envs/dependency_parse/lib/python3.7/site-packages/allennlp/training/trainer_pieces.py", line 65, in from_params model = Model.from_params(vocab=vocab, params=params.pop('model')) File "/home/user/anaconda3/envs/dependency_parse/lib/python3.7/site-packages/allennlp/common/from_params.py", line 365, in from_params return subclass.from_params(params=params, **extras) File "/home/user/anaconda3/envs/dependency_parse/lib/python3.7/site-packages/allennlp/common/from_params.py", line 386, in from_params kwargs = create_kwargs(cls, params, **extras) File "/home/user/anaconda3/envs/dependency_parse/lib/python3.7/site-packages/allennlp/common/from_params.py", line 133, in create_kwargs kwargs[name] = construct_arg(cls, name, annotation, param.default, params, **extras) File "/home/user/anaconda3/envs/dependency_parse/lib/python3.7/site-packages/allennlp/common/from_params.py", line 257, in construct_arg value_dict[key] = value_cls.from_params(params=value_params, **subextras) File "/home/user/anaconda3/envs/dependency_parse/lib/python3.7/site-packages/allennlp/common/from_params.py", line 365, in from_params return subclass.from_params(params=params, **extras) File "/home/user/anaconda3/envs/dependency_parse/lib/python3.7/site-packages/allennlp/common/from_params.py", line 388, in from_params return cls(**kwargs) # type: ignore File "/home/user/udify-master/udify/models/tag_decoder.py", line 105, in init div_value=4.0) File "/home/user/anaconda3/envs/dependency_parse/lib/python3.7/site-packages/torch/nn/modules/adaptive.py", line 133, in init raise ValueError("cutoffs should be a sequence of unique, positive " ValueError: cutoffs should be a sequence of unique, positive integers sorted in an increasing order, where each value is between 1 and n_classes-1

    opened by euhkim 2
  • Prediction of multi-word expression

    Prediction of multi-word expression

    Is it possible to predict multi-word expression (MWE) from raw text? I run predict.py with option --raw_text to find that MWE cannot be predicted.

    For example, in Italy, "della" is abbreviation of "di la" and UD annotates such token like as follows:

    31-32	della	_	_	_	_	_	_	_	_
    31	di	di	ADP	E	_	35	case	35:case	_
    32	la	il	DET	RD	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	35	det	35:det	_
    

    However, the output of UDify is something like this:

    31	della	della	ADP	_	_	3	case	_	_
    

    I hope to obtain the conllu output with proper MWE. Are there any way to realize it?

    opened by gifdog97 0
Owner
Dan Kondratyuk
Machine Learning, NLP, and Computer Vision. I love a fresh challenge—be it a math problem, a physics puzzle, or programming quandary.
Dan Kondratyuk
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

null 241 Jan 4, 2023
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
A method to generate speech across multiple speakers

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

Facebook Archive 873 Dec 15, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

hyperuniversality investment opportunity: what if we could run multiple architectures in a single file, again apple universal binaries, but worse how

luna 2 Oct 19, 2021
The proliferation of disinformation across social media has led the application of deep learning techniques to detect fake news.

Fake News Detection Overview The proliferation of disinformation across social media has led the application of deep learning techniques to detect fak

Kushal Shingote 1 Feb 8, 2022
Official Stanford NLP Python Library for Many Human Languages

Stanza: A Python NLP Library for Many Human Languages The Stanford NLP Group's official Python NLP library. It contains support for running various ac

Stanford NLP 6.4k Jan 2, 2023
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 6, 2023
Official Stanford NLP Python Library for Many Human Languages

Stanza: A Python NLP Library for Many Human Languages The Stanford NLP Group's official Python NLP library. It contains support for running various ac

Stanford NLP 5.2k Feb 12, 2021
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 142 Dec 21, 2022
Official Stanford NLP Python Library for Many Human Languages

Stanza: A Python NLP Library for Many Human Languages The Stanford NLP Group's official Python NLP library. It contains support for running various ac

Stanford NLP 5.2k Feb 17, 2021
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 121 Jan 6, 2021
Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

CoML 762 Dec 29, 2022
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Natan Yellin 47 Sep 10, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022