Code and models used in "MUSS Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

Facebook Research

Last update: Dec 29, 2022

Related tags

Deep Learning muss

Overview

Multilingual Unsupervised Sentence Simplification

Code and pretrained models to reproduce experiments in "MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

Prerequisites

Linux with python 3.6 or above.

Installing

git clone [email protected]:facebookresearch/muss.git
cd muss/
pip install -e .

How to use

Some scripts might still contain a few bugs, if you notice anything wrong, feel free to open an issue or submit a Pull Request.

Simplify sentences from a file using pretrained models

# English
python scripts/simplify.py scripts/examples.en --model-name muss_en_wikilarge_mined
# French
python scripts/simplify.py scripts/examples.fr --model-name muss_fr_mined
# French
python scripts/simplify.py scripts/examples.es --model-name muss_es_mined

Pretrained models should be downloaded automatically, but you can also find them here:
muss_en_wikilarge_mined
muss_en_mined
muss_fr_mined
muss_es_mined

Mine the data

python scripts/mine_sequences.py

Train the models

python scripts/train_model.py

Evaluate simplifications

Please head over to EASSE for Sentence Simplification evaluation.

License

The MUSS license is CC-BY-NC. See the LICENSE file for more details.

Authors

Louis Martin ([email protected])

Citation

If you use MUSS in your research, please cite MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases

@article{martin2021muss,
  title={MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases},
  author={Martin, Louis and Fan, Angela and de la Clergerie, {\'E}ric and Bordes, Antoine and Sagot, Beno{\^\i}t},
  journal={arXiv preprint arXiv:2005.00352},
  year={2021}
}

Comments

Could you provide spm_tokenizer and kenlm model

@lru_cache(maxsize=10)
def get_spm_tokenizer(model_dir):
    merges_file = model_dir / 'spm_tokenizer-merges.txt'
    vocab_file = model_dir / 'spm_tokenizer-vocab.json'
    return SentencePieceBPETokenizer(vocab_file=str(vocab_file), merges_file=str(merges_file))


@lru_cache(maxsize=10)
def get_kenlm_model(model_dir):
    model_file = model_dir / 'kenlm_model.arpa'
    return kenlm.Model(str(model_file))

I find dataset generation need spm_tokenizer and kenlm_model.arpa, could you provide them?

opened by akafen 11

train model failed
I change cluster "local" to "debug" in scripts/train_model.py and I run the command "python3 scripts/train_models.py' ,but fail The error :

fairseq-train /home/liuyijiao/muss/resources/datasets/_d41b33752d58c3fa688aef596b98df2b/fairseq_preprocessed_complex-simple --task translation --source-lang complex --target-lang simple --save-dir /home/liuyijiao/muss/experiments/fairseq/slurmjob_DEBUG_139908269653632/checkpoints --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --update-freq 16 --arch mbart_large --dropout 0.3 --weight-decay 0.0 --clip-norm 0.1 --share-all-embeddings --no-epoch-checkpoints --save-interval 999999 --validate-interval 999999 --max-update 50000 --save-interval-updates 100 --keep-interval-updates 1 --patience 10 --max-sentences 64 --seed 708 --distributed-world-size 8 --distributed-port 11733 --fp16 --restore-file '/home/liuyijiao/muss/resources/models/mbart/model.pt' --task 'translation_from_pretrained_bart' --source-lang 'complex' --target-lang 'simple' --encoder-normalize-before --decoder-normalize-before --label-smoothing 0.2 --dataset-impl 'mmap' --optimizer 'adam' --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --min-lr -1 --total-num-update 40000 --attention-dropout 0.1 --weight-decay 0.0 --max-tokens 1024 --update-freq 2 --log-format 'simple' --log-interval 2 --reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler --langs 'ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN' --layernorm-embedding --ddp-backend 'no_c10d' usage: train_models.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--criterion {sentence_ranking,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,sentence_prediction,cross_entropy,ctc,legacy_masked_lm_loss,masked_lm,adaptive_loss,nat_loss,composite_loss,wav2vec,vocab_parallel_cross_entropy}] [--tokenizer {nltk,moses,space}] [--bpe {byte_bpe,subword_nmt,sentencepiece,gpt2,characters,bert,hf_byte_bpe,bytes,fastbpe}] [--optimizer {sgd,adagrad,nag,adadelta,lamb,adafactor,adamax,adam}] [--lr-scheduler {inverse_sqrt,tri_stage,reduce_lr_on_plateau,triangular,polynomial_decay,cosine,fixed}] [--scoring {sacrebleu,bleu,wer,chrf}] [--task TASK] [--num-workers NUM_WORKERS] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,no_c10d}] [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] [--broadcast-buffers] [--distributed-wrapper {DDP,SlowMo}] [--slowmo-momentum SLOWMO_MOMENTUM] [--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE] [--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES] [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE] [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}] [--zero-sharding {none,os}] [--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM] [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--min-lr MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL] [--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES] [--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-save-optimizer-state] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience PATIENCE] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D] [--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N] [--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR] [--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos] [--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding] [--no-scale-embedding] [--no-cross-attention] [--cross-self-attention] [--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D] [--pooler-dropout D] [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--spectral-norm-classification-head] [--label-smoothing D] [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam] [--force-anneal N] [--warmup-updates N] [--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [-s SRC] [-t TARGET] [--load-alignments] [--left-pad-source BOOL] [--left-pad-target BOOL] [--max-source-positions N] [--max-target-positions N] [--upsample-primary UPSAMPLE_PRIMARY] [--truncate-source] [--num-batch-buckets N] [--eval-bleu] [--eval-bleu-detok EVAL_BLEU_DETOK] [--eval-bleu-detok-args JSON] [--eval-tokenized-bleu] [--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]] [--eval-bleu-args JSON] [--eval-bleu-print-samples] --langs LANG [--prepend-bos] data train_models.py: error: unrecognized arguments: --max-sentences 64 fairseq_prepare_and_train failed after 0.87s. fairseq_train_and_evaluate_with_parametrization failed after 0.87s.

The code:

for exp_name, kwargs in tqdm(kwargs_dict.items()): executor = get_executor( cluster='debug', slurm_partition='priority', submit_decorators=[print_function_name, print_args, print_job_id, print_result, print_running_time], timeout_min=2 * 24 * 60, slurm_comment='EMNLP Arxiv deadline May 1st', gpus_per_node=kwargs['train_kwargs']['ngpus'], nodes=1, slurm_constraint='volta32gb', name=exp_name, ) for i in range(5): job = executor.submit(fairseq_train_and_evaluate_with_parametrization, **kwargs) jobs_dict[exp_name].append(job) [job.result() for jobs in jobs_dict.values() for job in jobs]

When cluster is "local" ,train fail too
opened by akafen 11
Generate multiple Output sentences using the simplify.py script.

Hello, I am trying to generate multiple simplifications using simplify.py. I understand the simplify.py uses _fairseq_generate function where you can specify num_hypothesis and best. I increase the num_hypothesis = 12 and nbest = 5. But I am still getting a single simplification whereas it should output multiple simplifications since nbest = 5.

Can you guide me how to generate more simplifications ? @louismartin

opened by Atharva-Phatak 9
mining paraphrases fails with time-out error

Hi,

I'm currently trying to generate paraphrase corpora from cc_net using mine_sequences.py script. As a test run, I was hoping to mine sentence pairs from just a couple of cc_net corpus files (e.g. 0000/en_head.json.gz, 0001/en_head.json.gz). However, the job fails during mining.

I haven't been able to find any answers in the issues so far and would appreciate any guidance on solving this! I've attached the script's output and relevant log/error files.

Disclaimer: I was trying to run this on a single NVIDIA GeForce GTX TITAN X (12GB). Not sure if that would make a difference. Are there any minimum hardware/system requirements?

Thanks in advance!

mine_sequences.out.txt 31066_0_log.out.txt 31066_0_log.err.txt

opened by tannonk 8

get_easse_report_from_exp_dir Failed

Hello, I am running train_mode.py but get_easse_report_from_exp_dir fail. This is the error


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-15-5dc913ca0d14> in <module>()
----> 1 result = fairseq_train_and_evaluate_with_parametrization(**kwargs)

13 frames
/content/drive/MyDrive/muss/muss/fairseq/main.py in fairseq_train_and_evaluate_with_parametrization(dataset, **kwargs)
    228     kwargs['preprocessor_kwargs'] = recommended_preprocessors_kwargs
    229     # Evaluation
--> 230     scores = print_running_time(fairseq_evaluate_and_save)(exp_dir, **kwargs)
    231     score = combine_metrics(scores['bleu'], scores['sari'], scores['fkgl'], kwargs.get('metrics_coefs', [0, 1, 0]))
    232     # TODO: This is a redundant hack with what happens in fairseq_evaluate_and_save (predict_files and evaluate_kwargs), it should be fixed

/content/drive/MyDrive/muss/muss/utils/helpers.py in wrapped_func(*args, **kwargs)
    468         function_name = getattr(func, '__name__', repr(func))
    469         with log_action(function_name):
--> 470             return func(*args, **kwargs)
    471 
    472     return wrapped_func

/content/drive/MyDrive/muss/muss/fairseq/main.py in fairseq_evaluate_and_save(exp_dir, **kwargs)
    104     print(f'scores={scores}')
    105     report_path = exp_dir / 'easse_report.html'
--> 106     shutil.move(get_easse_report_from_exp_dir(exp_dir, **kwargs), report_path)
    107     print(f'report_path={report_path}')
    108     predict_files = kwargs.get(

/content/drive/MyDrive/muss/muss/fairseq/main.py in get_easse_report_from_exp_dir(exp_dir, **kwargs)
     97 def get_easse_report_from_exp_dir(exp_dir, **kwargs):
     98     simplifier = fairseq_get_simplifier(exp_dir, **kwargs)
---> 99     return get_easse_report(simplifier, **kwargs.get('evaluate_kwargs', {'test_set': 'asset_valid'}))
    100 
    101 

/content/drive/MyDrive/muss/muss/evaluation/general.py in get_easse_report(simplifier, test_set, orig_sents_path, refs_sents_paths)
     40         orig_sents_path=orig_sents_path,
     41         refs_sents_paths=refs_sents_paths,
---> 42         report_path=report_path,
     43     )
     44     return report_path

/usr/local/lib/python3.7/dist-packages/easse/cli.py in report(test_set, sys_sents_path, orig_sents_path, refs_sents_paths, report_path, tokenizer, lowercase, metrics)
    302         lowercase=lowercase,
    303         tokenizer=tokenizer,
--> 304         metrics=metrics,
    305     )
    306 

/usr/local/lib/python3.7/dist-packages/easse/report.py in write_html_report(filepath, *args, **kwargs)
    477 def write_html_report(filepath, *args, **kwargs):
    478     with open(filepath, 'w') as f:
--> 479         f.write(get_html_report(*args, **kwargs) + '\n')
    480 
    481 

/usr/local/lib/python3.7/dist-packages/easse/report.py in get_html_report(orig_sents, sys_sents, refs_sents, test_set, lowercase, tokenizer, metrics)
    471             doc.stag('hr')
    472             with doc.tag('div', klass='container-fluid'):
--> 473                 doc.asis(get_qualitative_examples_html(orig_sents, sys_sents, refs_sents))
    474     return indent(doc.getvalue())
    475 

/usr/local/lib/python3.7/dist-packages/easse/report.py in get_qualitative_examples_html(orig_sents, sys_sents, refs_sents)
    154             sample_generator = sorted(
    155                 zip(orig_sents, sys_sents, zip(*refs_sents)),
--> 156                 key=lambda args: sort_key(*args),
    157             )
    158             # Samples displayed by default

/usr/local/lib/python3.7/dist-packages/easse/report.py in <lambda>(args)
    154             sample_generator = sorted(
    155                 zip(orig_sents, sys_sents, zip(*refs_sents)),
--> 156                 key=lambda args: sort_key(*args),
    157             )
    158             # Samples displayed by default

/usr/local/lib/python3.7/dist-packages/easse/report.py in <lambda>(c, s, refs)
     91         (
     92             'Best simplifications according to SARI',
---> 93             lambda c, s, refs: -corpus_sari([c], [s], [refs]),
     94             lambda value: f'SARI={-value:.2f}',
     95         ),

/usr/local/lib/python3.7/dist-packages/easse/sari.py in corpus_sari(*args, **kwargs)
    264 
    265 def corpus_sari(*args, **kwargs):
--> 266     add_score, keep_score, del_score = get_corpus_sari_operation_scores(*args, **kwargs)
    267     return (add_score + keep_score + del_score) / 3

/usr/local/lib/python3.7/dist-packages/easse/sari.py in get_corpus_sari_operation_scores(orig_sents, sys_sents, refs_sents, lowercase, tokenizer, legacy, use_f1_for_deletion, use_paper_version)
    254     refs_sents = [[utils_prep.normalize(sent, lowercase, tokenizer) for sent in ref_sents] for ref_sents in refs_sents]
    255 
--> 256     stats = compute_ngram_stats(orig_sents, sys_sents, refs_sents)
    257 
    258     if not use_paper_version:

/usr/local/lib/python3.7/dist-packages/easse/sari.py in compute_ngram_stats(orig_sents, sys_sents, refs_sents)
    110     assert all(
    111         len(ref_sents) == len(orig_sents) for ref_sents in refs_sents
--> 112     ), "Reference sentences don't have the shape (n_references, n_samples)"
    113     add_sys_correct = [0] * NGRAM_ORDER
    114     add_sys_total = [0] * NGRAM_ORDER

AssertionError: Reference sentences don't have the shape (n_references, n_samples)

I printed out where the error occurs and it showed that

len(refs_sents)=1
len(ref_sents)=10
len(orig_sents)=1

which I suppose should be like this?

len(refs_sents)=10
len(ref_sents)=1
len(orig_sents)=1

I am not sure how to make this change happen without impacting the outcome of the code. I'll appreciate any advice. Thank you in advance!

opened by pelican9 8

AttributeError: module 'faiss' has no attribute 'METRIC_L2_DIST
I'm getting the following error when I execute python scripts/mine_sequences.py .

Traceback (most recent call last): File "scripts/mine_sequences.py", line 118, in <module> train_sentences, get_index_name(), get_embeddings, faiss.METRIC_L2_DIST, base_index_dir AttributeError: module 'faiss' has no attribute 'METRIC_L2_DIST

Getting this error in both faiss-gpu and faiss-cpu. However when I use it as faiss.METRIC_L2 (without DIST), it works fine. Any idea about the issue ?
opened by NomadXD 8
Add portuguese language to muss

In this PR I adapt the muss to the Portuguese language. In addition, it adds documentation informing how the process of adapting the code to the new language was, which steps were taken and I present some values of time and computational cost required (PortugueseModel_en.md).
CLA Signed

opened by AssisRaphael 7
[Help wanted] Fine tune pre-trained muss(mbart-large-cc25) with mined paraphrases (si_LK)
For my final year research project, I'm using this approach as my baseline and trying to do text simplification for Sinhala language (native language used in Sri Lanka). I don't have enough infrastructure to run the cc_net pipeline to crawl language data but I used several already existing sources (15.7 million plain sentences) and mined them to get paraphrases. Now as the next stage I'm trying to fine tune mbart using the mined paraphrases. I went through the train_model.py and train_paper_models.py and implemented a similar script for Sinhala by making changes to the related methods.

from muss.fairseq.main import fairseq_train_and_evaluate_with_parametrization from muss.mining.training import get_mbart_kwargs sin15M = 'sin15M' kwargs = get_mbart_kwargs(dataset=sin15M, language='si', use_access=False) kwargs['train_kwargs']['ngpus'] = 1 # Set this from 8 to 1 for local training kwargs['train_kwargs']['max_tokens'] = 512 # Lower this number to prevent OOM result = fairseq_train_and_evaluate_with_parametrization(**kwargs)

I tried to run this for a small sample of 6000 train sentences, 750 test and valid sentences in a nvidia tesla t4 with 16GB of GPU memory. I get OOM issues when I tried the kwargs['train_kwargs']['max_tokens'] with values 512,256,64,32,16,8. With 4 it says no dataset found. Any idea what could go wrong here ???

PS - I did not change anything internally related to the models. Only changed the helper methods to get datasets.
opened by NomadXD 7
simplify.py: error: unrecognized arguments: Notebooks/muss/resources/models/muss_fr_mined/model.pt

Hello when I run this command

!python '/content/drive/MyDrive/Colab Notebooks/muss/scripts/simplify.py' '/content/drive/MyDrive/Colab Notebooks/muss/scripts/examples.fr' --model-name muss_fr_mined

in Google Colab I get this error

INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt Downloading... ... 100% - 6204 MB - 10.12 MB/s - 612s Extracting... usage: simplify.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--criterion {wav2vec,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,adaptive_loss,nat_loss,sentence_ranking,legacy_masked_lm_loss,composite_loss,cross_entropy,ctc,sentence_prediction,masked_lm,vocab_parallel_cross_entropy}] [--tokenizer {nltk,moses,space}] [--bpe {gpt2,subword_nmt,hf_byte_bpe,fastbpe,sentencepiece,characters,bert,bytes,byte_bpe}] [--optimizer {adagrad,adafactor,sgd,lamb,adamax,adadelta,nag,adam}] [--lr-scheduler {fixed,reduce_lr_on_plateau,triangular,cosine,tri_stage,polynomial_decay,inverse_sqrt}] [--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,no_c10d}] [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] [--broadcast-buffers] [--distributed-wrapper {DDP,SlowMo}] [--slowmo-momentum SLOWMO_MOMENTUM] [--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE] [--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES] [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE] [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}] [--zero-sharding {none,os}] [--path PATH] [--remove-bpe [REMOVE_BPE]] [--quiet] [--model-overrides MODEL_OVERRIDES] [--results-path RESULTS_PATH] [--beam N] [--nbest N] [--max-len-a N] [--max-len-b N] [--min-len N] [--match-source-len] [--no-early-stop] [--unnormalized] [--no-beamable-mm] [--lenpen LENPEN] [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]] [--sacrebleu] [--score-reference] [--prefix-size PS] [--no-repeat-ngram-size N] [--sampling] [--sampling-topk PS] [--sampling-topp PS] [--constraints [{ordered,unordered}]] [--temperature N] [--diverse-beam-groups N] [--diverse-beam-strength N] [--diversity-rate N] [--print-alignment] [--print-step] [--lm-path PATH] [--lm-weight N] [--iter-decode-eos-penalty N] [--iter-decode-max-iter N] [--iter-decode-force-max-iter] [--iter-decode-with-beam N] [--iter-decode-with-external-reranker] [--retain-iter-history] [--retain-dropout] [--retain-dropout-modules RETAIN_DROPOUT_MODULES [RETAIN_DROPOUT_MODULES ...]] [--decoding-format {unigram,ensemble,vote,dp,bs}] [--force-anneal N] [--lr-shrink LS] [--warmup-updates N] [-s SRC] [-t TARGET] [--load-alignments] [--left-pad-source BOOL] [--left-pad-target BOOL] [--max-source-positions N] [--max-target-positions N] [--upsample-primary UPSAMPLE_PRIMARY] [--truncate-source] [--num-batch-buckets N] [--eval-bleu] [--eval-bleu-detok EVAL_BLEU_DETOK] [--eval-bleu-detok-args JSON] [--eval-tokenized-bleu] [--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]] [--eval-bleu-args JSON] [--eval-bleu-print-samples] --langs LANG [--prepend-bos] data simplify.py: error: unrecognized arguments: Notebooks/muss/resources/models/muss_fr_mined/model.pt

opened by oskrmiguel 6

$UnicodeEncodeError: 'ascii' codec can't encode character '\u2010' in position 48: ordinal not in range(128)$

UnicodeEncodeError: 'ascii' codec can't encode character '\u2010' in position 48: ordinal not in range(128)

Hi @louismartin,

Thanks for this publishing this work, really nice!

Regarding my inquiry, this is similar to the issue I've posted on easse. On this model, I'm also getting encoding errors when simplifying a list of sentences. I've added the same encoding='utf-8' and it stopped reporting this error.

Here are my diff files:

diff --git a/muss/preprocessors.py b/muss/preprocessors.py
index ee3dd86..6d438d5 100644
--- a/muss/preprocessors.py
+++ b/muss/preprocessors.py
@@ -131 +131 @@ class AbstractPreprocessor(ABC):
-        with open(output_filepath, 'w') as f:
+        with open(output_filepath, 'w', encoding='utf-8') as f:
@@ -139 +139 @@ class AbstractPreprocessor(ABC):
-        with open(output_filepath, 'w') as f:
+        with open(output_filepath, 'w', encoding='utf-8') as f:

diff --git a/muss/utils/helpers.py b/muss/utils/helpers.py
index 25210d8..78a5f41 100644
--- a/muss/utils/helpers.py
+++ b/muss/utils/helpers.py
@@ -91 +91 @@ def open_files(filepaths, mode='r'):
-        files = [Path(filepath).open(mode) for filepath in filepaths]
+        files = [Path(filepath).open(mode, encoding='utf-8') for filepath in filepaths]
@@ -137 +137 @@ def write_lines(lines, filepath=None):
-    with filepath.open('w') as f:
+    with filepath.open('w', encoding='utf-8') as f:
@@ -148 +148 @@ def yield_lines(filepath, gzipped=False, n_lines=None):
-    with open_function(filepath, 'rt') as f:
+    with open_function(filepath, 'rt', encoding='utf-8') as f:
@@ -325 +325 @@ def log_std_streams(filepath):
-    log_file = open(filepath, 'w')
+    log_file = open(filepath, 'w', encoding='utf-8')

diff --git a/scripts/simplify.py b/scripts/simplify.py
index 464129c..95a6231 100644
--- a/scripts/simplify.py
+++ b/scripts/simplify.py
@@ -26,0 +27,2 @@ if __name__ == '__main__':
+        s = s.encode('utf-8')
+        c = c.encode('utf-8')

I'll appreciate if you could add these fixes :)

Thanks,

Laura

opened by lmvasque 6

Training error: hydra error. Parameter lr_scheduler.total_num_update=null

Hi, I'm trying to train muss, but got an hydra error:

fairseq_prepare_and_train... exp_dir=/scratch1/fer201/muss/muss-git/experiments/fairseq/local_1634083552326 fairseq-train /scratch1/fer201/muss/muss-git/resources/datasets/_9585ac127caca9d7160a28f1d8180050/fairseq_preprocessed_complex-simple --task translation --source-lang complex --target-lang simple --save-dir /scratch1/fer201/muss/muss-git/experiments/fairseq/local_1634083552326/checkpoints --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 500 --update-freq 128 --arch bart_large --dropout 0.1 --weight-decay 0.0 --clip-norm 0.1 --share-all-embeddings --no-epoch-checkpoints --save-interval 999999 --validate-interval 999999 --max-update 20000 --save-interval-updates 100 --keep-interval-updates 1 --patience 10 --batch-size 64 --seed 917 --distributed-world-size 1 --distributed-port 15798 --fp16 --restore-file /scratch1/fer201/muss/muss-git/resources/models/bart.large/model.pt --max-tokens 512 --truncate-source --layernorm-embedding --share-all-embeddings --share-decoder-input-output-embed --reset-optimizer --reset-dataloader --reset-meters --required-batch-size-multiple 1 --label-smoothing 0.1 --attention-dropout 0.1 --weight-decay 0.01 --optimizer 'adam' --adam-betas '(0.9, 0.999)' --adam-eps 1e-08 --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --find-unused-parameters fairseq_prepare_and_train failed after 4.45s. Traceback (most recent call last): File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 513, in _apply_overrides_to_config OmegaConf.update(cfg, key, value, merge=True) File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/omegaconf.py", line 613, in update root.setattr(last_key, value) File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 285, in setattr raise e File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 282, in setattr self.__set_impl(key, value) File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 266, in __set_impl self._set_item_impl(key, value) File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 398, in _set_item_impl self._validate_set(key, value) File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 143, in _validate_set self._validate_set_merge_impl(key, value, is_assign=True) File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 156, in _validate_set_merge_impl self._format_and_raise( File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/_utils.py", line 694, in format_and_raise _raise(ex, cause) File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace omegaconf.errors.ValidationError: child 'lr_scheduler.total_num_update' is not Optional full_key: lr_scheduler.total_num_update reference_type=Optional[PolynomialDecayLRScheduleConfig] object_type=PolynomialDecayLRScheduleConfig

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/scratch1/fer201/muss/muss-git/scripts/train_model.py", line 21, in result = fairseq_train_and_evaluate_with_parametrization(**kwargs) File "/scratch1/fer201/muss/muss-git/muss/fairseq/main.py", line 224, in fairseq_train_and_evaluate_with_parametrization exp_dir = print_running_time(fairseq_prepare_and_train)(dataset, **kwargs) File "/scratch1/fer201/muss/muss-git/muss/utils/helpers.py", line 470, in wrapped_func return func(*args, **kwargs) File "/scratch1/fer201/muss/muss-git/muss/fairseq/main.py", line 74, in fairseq_prepare_and_train fairseq_train(preprocessed_dir, exp_dir=exp_dir, **train_kwargs) File "/scratch1/fer201/muss/muss-git/muss/utils/training.py", line 60, in wrapped_func return func(*args, **kwargs) File "/scratch1/fer201/muss/muss-git/muss/fairseq/base.py", line 127, in fairseq_train train.cli_main() File "/scratch1/fer201/muss/fairseq-git/fairseq_cli/train.py", line 496, in cli_main cfg = convert_namespace_to_omegaconf(args) File "/scratch1/fer201/muss/fairseq-git/fairseq/dataclass/utils.py", line 389, in convert_namespace_to_omegaconf composed_cfg = compose("config", overrides=overrides, strict=False) File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/experimental/compose.py", line 31, in compose cfg = gh.hydra.compose_config( File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 507, in compose_config cfg = self.config_loader.load_configuration( File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 151, in load_configuration return self._load_configuration( File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 277, in _load_configuration ConfigLoaderImpl._apply_overrides_to_config(config_overrides, cfg) File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 520, in _apply_overrides_to_config raise ConfigCompositionException( hydra.errors.ConfigCompositionException: Error merging override lr_scheduler.total_num_update=null

opened by odigab 5
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 1

Need help to get better performances

Hello,

It is the first time that I use this project. I tried to use an example from the README but I have a question about the execution speed. The execution time is about from 40 to 60 seconds. I have put timers in the muss code to find which part of the code is spending this time. It seems that it is the call to generate.cli_main() at line 188 of muss/fairseq/base.py file.

Could you explain me if this duration is standard or if I should get better performances? Is there something I can do to speed up this?

The example I have tried is: time python scripts/simplify.py scripts/examples.fr --model-name muss_fr_mined

To test muss, I have created a docker image that is deployed on a GPU node (T1-45 from OVH) of a Kubernetes cluster. The code is available here: muss-docker-debug and the image is pushed here: https://hub.docker.com/r/cleyrop/muss-debug. The GPU node caracteristics are:

45 GB RAM
8 vCores (2.1 GHz)
400 GB SSD
2,000 Mbit/s
Tesla V100

Here are the traces from the execution:

~/muss$ time python scripts/simplify.py scripts/examples.fr --model-name muss_fr_mined
  0%|                                                                                                                                                                                                                                                     | 0/1 [00:00<?, ?it/s]/home/muss/.local/lib/python3.7/site-packages/fairseq/search.py:140: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  beams_buf = indices_buf // vocab_size
/home/muss/.local/lib/python3.7/site-packages/fairseq/sequence_generator.py:651: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  unfin_idx = idx // beam_size
--------------------------------------------------------------------------------                                                                                                                                                                                                
Original:   Cette phrase est extrêmement compliquée à comprendre.
Simplified: Cette phrase est très difficile à comprendre.
--------------------------------------------------------------------------------
Original:   La souris est mangée par le chat.
Simplified: La souris est mangée par le chien.
--------------------------------------------------------------------------------
Original:   Facile à lire et à comprendre (FALC) désigne un ensemble de règles ayant pour finalité de rendre l'information facile à lire et à comprendre.
Simplified: Facile à lire et à comprendre (FALC) est un ensemble de règles visant à rendre l'information facile à comprendre et à lire.
--------------------------------------------------------------------------------
Original:   L'altruisme efficace vise à adopter une démarche analytique afin d’identifier les meilleurs moyens d’avoir un impact positif sur le monde.
Simplified: L'altruisme efficace est une démarche analytique visant à identifier les meilleurs moyens d'avoir un impact positif sur le monde.

real   0m48.031s
user   0m22.117s
sys    0m25.155s

opened by plugandplay 18

Code and models used in "MUSS Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

Related tags

Overview

Multilingual Unsupervised Sentence Simplification

Prerequisites

Installing

How to use

Simplify sentences from a file using pretrained models

Mine the data

Train the models

Evaluate simplifications

License

Authors

Citation

Comments

Patching CVE-2007-4559

Owner

Facebook Research

AI virtual gym is an AI program which can be used to exercise and can be used to see if we are doing the exercises

Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.

Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Code for pre-training CharacterBERT models (as well as BERT models).

This project contains an implemented version of Face Detection using OpenCV and Mediapipe. This is a code snippet and can be used in projects.

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

This repo contains the code and data used in the paper "Wizard of Search Engine: Access to Information Through Conversations with Search Engines"

A module that used for encrypt code which includes RSA and AES

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Code for STFT Transformer used in BirdCLEF 2021 competition.