Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

THUNLP-MT

Last update: Dec 15, 2022

Related tags

Overview

Mask-Align: Self-Supervised Neural Word Alignment

This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment.

@inproceedings{chen2021maskalign,
   title={Mask-Align: Self-Supervised Neural Word Alignment},
   author={Chi Chen and Maosong Sun and Yang Liu},
   booktitle={Association for Computational Linguistics (ACL)},
   year={2021}
}

The implementation is built on top of THUMT.

Introduction
Prerequisites
Usage
Configs
Visualization
Contact

Introduction

Mask-Align is a self-supervised neural word aligner. It parallelly masks out each target token and predicts it conditioned on both source and the remaining target tokens. The source token that contributes most to recovering a masked target token will be aligned to that target token.

Prerequisites

PyTorch
NLTK
remi *
pyecharts *
pandas *
matplotlib *
seaborn *

*: optional, only used for Visualization.

Usage

Data Preparation

To get the data used in our paper, you can follow the instructions in https://github.com/lilt/alignment-scripts.

To train an aligner with your own data, you should pre-process it yourself. Usually this includes tokenization, BPE, etc. You can find a simple guide here.

Now we have the pre-processed parallel training data (train.src, train.tgt), validation data (optional) (valid.src, valid.tgt) and test data (test.src, test.tgt). An example 3-sentence German–English parallel training corpus is:

# train.src
wiederaufnahme der sitzungsperiode
frau präsidentin , zur geschäfts @@ordnung .
ich bitte sie , sich zu einer schweigeminute zu erheben .

# train.tgt
resumption of the session
madam president , on a point of order .
please rise , then , for this minute ' s silence .

The next step is to shuffle the training set, which proves to be helpful for improving the results.

python thualign/scripts/shuffle_corpus.py --corpus train.src train.tgt

The resulting files train.src.shuf and train.tgt.shuf rearrange the sentence pairs randomly.

Then we need to generate vocabulary from the training set.

python thualign/scripts/build_vocab.py train.src.shuf vocab.train.src
python thualign/scripts/build_vocab.py train.tgt.shuf vocab.train.tgt

The resulting files vocab.train.src.txt and vocab.train.tgt.txt are final source and target vocabularies used for model training.

Training

All experiments are configured via config files in thualign/configs, see Configs for more details.. We provide an example config file thualign/configs/user/example.config. You can easily use it by making three changes:

change device_list, update_cycle and batch_size to match your machine configuration;
change exp_dir and output to your own experiment directory
change train/valid/test_input and vocab to your data paths;

When properly configured, you can use the following command to train an alignment model described in the config file

bash thualign/bin/train.sh -s thualign/configs/user/example.config

or more simply

bash thualign/bin/train.sh -s example

The configuration file is an INI file and is parsed through configparser. By adding a new section, you can easily customize some configs while keep other configs unchanged.

[DEFAULT]
...

[small_budget]
batch_size = 4500
update_cycle = 8
device_list = [0]
half = False

Use -e option to run this small_budget section

bash thualign/bin/train.sh -s example -e small_budget

You can also monitor the training process through tensorboard

tensorboard --logdir=[output]

Test

After training, the following command can be used to generate attention weights (-g), generate data for attention visualization (-v), and test its AER (-t) if test_ref is provided.

bash thualign/bin/test.sh -s [CONFIG] -e [EXP] -gvt

For example, to test the model trained with the configs in example.config

bash thualign/bin/test.sh -s example -gvt

You might get the following output

alignment-soft.txt: 14.4% (87.7%/83.5%/9467)

The alignment results (alignment.txt) along with other test results are stored in [output]/test by default.

Configs

Most of the configuration of Mask-Align is done through configuration files in thualign/configs. The model reads the basic configs first, followed by the user-defined configs.

Basic Config

Predefined configs for experiments to use.

base.config: basic configs for training, validation and test
model.config: define different models with their hyperparameters

User Config

Customized configs that must describe the following configuration and maybe other experiment-specific parameters:

train/valid/test_input: paths of input parallel corpuses
vocab: paths of vocabulary files generated from thualign/scripts/build_vocab.py
output: path to save the model outputs
model: which model to use
batch_size: the batch size (number of tokens) used in the training stage.
update_cycle: the number of iterations for updating model parameters. The default value is 1. If you have only 1 GPU and want to obtain the same translation performance with using 4 GPUs, simply set this parameter to 4. Note that the training time will also be prolonged.
device_list: the list of GPUs to be used in training. Use the nvidia-smi command to find unused GPUs. If the unused GPUs are gpu0 and gpu1, set this parameter as device_list=[0,1].
half: set this to True if you wish to use half-precision training. This will speeds up the training procedure. Make sure that you have the GPUs with half-precision support.

Here is a minimal experiment config:

### thualign/configs/user/example.config
[DEFAULT]

train_input = ['train.src', 'train.tgt']
valid_input = ['valid.src', 'valid.tgt']
vocab = ['vocab.src.txt', 'vocab.tgt.txt']
test_input = ['test.src', 'test.tgt']
test_ref = test.talp

exp_dir = exp
label = agree_deen
output = ${exp_dir}/${label}

model = mask_align

batch_size = 9000
update_cycle = 1
device_list = [0,1,2,3]
half = True

Visualization

To better understand and analyze the model, Mask-Align supports the following two types of visulizations.

Training Visualization

Add eval_plot = True in your config file to turn on visualization during training. This will plot 5 attention maps from evaluation in the tensorboard.

These packages are required for training visualization:

pandas
matplotlib
seaborn

Attention Visualization

Use -v in the test command to generate alignment_vizdata.pt first. It is stored in [output]/test by default. To visualize it, using this script

python thualign/scripts/visualize.py [output]/test/alignment_vizdata.pt [--port PORT]

This will start a local service that plots the attention weights for all the test sentence pairs. You can access it through a web browser.

These packages are required for training visualization:

remi
pyecharts

Contact

If you have questions, suggestions and bug reports, please email [email protected].

Comments

issue on analyzed in your paper

I notice the "Predict and Alignment" part in your paper. You divided tokens into four categories :cPcA wPcA cPwA wPwA。 Can you explain how to calculated them ?

opened by sdongchuanqi 1
chinese to english result issue

I have problem with reproducing the chinese to english result in paper (13.8%) .The best result I did is (14.6%) I use the LDC dataset ,do BPE to the train.ch and train.en ,then do suff and then clean the single word sentence . Using Chinese-English evaluation set for test and dev, I preprocess chinese with BPE , and do nothing with English. And I use the example config,I modify the path and set batchsize to 5000 and updatacycle to 2. I would be grateful,if you can help me to reproduce the result

opened by sdongchuanqi 1

TypeError: Can't instantiate abstract class MapDataset with abstract methods _inputs, set_inputs

When i run bash thualign/bin/train.sh -s thualign/configs/user/example.config I have got this error message:

Traceback (most recent call last):
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 390, in <module>
    cli_main()
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 384, in cli_main
    nprocs=world_size)
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 363, in process_fn
    main(local_args)
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 280, in main
    dataset = data.AlignmentPipeline.get_train_dataset(params.train_input, params)
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/data/pipeline.py", line 351, in get_train_dataset
    dataset = dataset.map(map_obj)
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/data/dataset.py", line 82, in map
    return MapDataset(self, fn)
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/typing.py", line 1223, in __new__
    return _generic_new(cls.__next_in_mro__, cls, *args, **kwds)
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/typing.py", line 1184, in _generic_new
    return base_cls.__new__(cls)
TypeError: Can't instantiate abstract class MapDataset with abstract methods _inputs, set_inputs`

My python version is 3.6. torch version is 1.8.1 Do you have any solution? Thanks!

opened by lisasiyu 1

issue on Ro-en reproduce

I get Ro-en data form 'https://github.com/lilt/alignment-scripts/tree/master/preprocess'. And the origninal train set was split into train set and valid set. Then I maked joined bpe with 40k merges, and shuf and clean the sentences with length of 1 in train set. Using 36K token batchsize and settings in your paper. I got the same result of ch-en en-de en-fr,but I got 20.4(19.5 in your paper ) on Ro-en. Is there anything wrong ?

Many thanks for your reply.

opened by sdongchuanqi 0

Torch issues

I'm wondering the environment type used in this project. I used cuda 11.0 and torch 1.7.1 and there are some errors occered.

2022-03-22 18:49:37.663 -- Process 2 terminated with the following error:
2022-03-22 18:49:37.663 Traceback (most recent call last):
2022-03-22 18:49:37.663 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
2022-03-22 18:49:37.663 fn(i, *args)
2022-03-22 18:49:37.663 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 367, in process_fn
2022-03-22 18:49:37.663 main(local_args)
2022-03-22 18:49:37.663 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 325, in main
2022-03-22 18:49:37.663 loss, log_info = model(features)
2022-03-22 18:49:37.663 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 18:49:37.663 result = self.forward(*input, **kwargs)
2022-03-22 18:49:37.663 File "/jizhi/jizhi2/worker/trainer/thualign/models/agreement_wrapper.py", line 62, in forward
2022-03-22 18:49:37.663 b_loss, b_log_output = self.b_model.cal_loss(b_logits, inverse_features["target"], inverse_features["target_mask"])
2022-03-22 18:49:37.663 File "/jizhi/jizhi2/worker/trainer/thualign/models/mask_align.py", line 182, in cal_loss
2022-03-22 18:49:37.663 loss = self.criterion(net_output, labels)
2022-03-22 18:49:37.663 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 18:49:37.663 result = self.forward(*input, **kwargs)
2022-03-22 18:49:37.663 File "/jizhi/jizhi2/worker/trainer/thualign/modules/losses.py", line 26, in forward
2022-03-22 18:49:37.663 loss = log_probs[batch_idx, labels]
2022-03-22 18:49:37.663 IndexError: tensors used as indices must be long, byte or bool tensors

2022-03-22 16:27:17.919 -- Process 2 terminated with the following error:
2022-03-22 16:27:17.919 Traceback (most recent call last):
2022-03-22 16:27:17.919 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
2022-03-22 16:27:17.919 fn(i, *args)
2022-03-22 16:27:17.919 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 367, in process_fn
2022-03-22 16:27:17.919 main(local_args)
2022-03-22 16:27:17.919 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 325, in main
2022-03-22 16:27:17.919 loss, log_info = model(features)
2022-03-22 16:27:17.919 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 16:27:17.919 result = self.forward(*input, **kwargs)
2022-03-22 16:27:17.919 File "/jizhi/jizhi2/worker/trainer/thualign/models/agreement_wrapper.py", line 53, in forward
2022-03-22 16:27:17.919 f_state = self.f_model.encode(features, f_state)
2022-03-22 16:27:17.919 File "/jizhi/jizhi2/worker/trainer/thualign/models/mask_align.py", line 108, in encode
2022-03-22 16:27:17.919 inputs = torch.nn.functional.embedding(src_seq, self.src_embedding)
2022-03-22 16:27:17.919 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
2022-03-22 16:27:17.919 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2022-03-22 16:27:17.919 RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

I modify them by adding .long() and it works.

opened by chrisgao7 0

Cuda issues

I'm wondering the environment type used in this project. I run this project with default config on 4 V100. I used cuda 11.0 and torch 1.7.1 and there are some errors occered.

> bash thualign/bin/train.sh -s thualign/configs/user/example.config

2022-03-22 17:53:44.193 -- Process 3 terminated with the following error:
2022-03-22 17:53:44.193 Traceback (most recent call last):
2022-03-22 17:53:44.193 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
2022-03-22 17:53:44.193 fn(i, *args)
2022-03-22 17:53:44.193 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 367, in process_fn
2022-03-22 17:53:44.193 main(local_args)
2022-03-22 17:53:44.193 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 325, in main
2022-03-22 17:53:44.193 loss, log_info = model(features)
2022-03-22 17:53:44.193 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 17:53:44.193 result = self.forward(*input, **kwargs)
2022-03-22 17:53:44.193 File "/jizhi/jizhi2/worker/trainer/thualign/models/agreement_wrapper.py", line 62, in forward
2022-03-22 17:53:44.193 b_loss, b_log_output = self.b_model.cal_loss(b_logits, inverse_features["target"], inverse_features["target_mask"])
2022-03-22 17:53:44.193 File "/jizhi/jizhi2/worker/trainer/thualign/models/mask_align.py", line 182, in cal_loss
2022-03-22 17:53:44.193 loss = self.criterion(net_output, labels)
2022-03-22 17:53:44.193 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 17:53:44.193 result = self.forward(*input, **kwargs)
2022-03-22 17:53:44.193 File "/jizhi/jizhi2/worker/trainer/thualign/modules/losses.py", line 40, in forward
2022-03-22 17:53:44.193 sum_probs = torch.sum(log_probs.to(torch.float32), dim=-1)
2022-03-22 17:53:44.193 RuntimeError: CUDA out of memory. Tried to allocate 13.96 GiB (GPU 3; 31.75 GiB total capacity; 22.23 GiB already allocated; 7.34 GiB free; 23.07 GiB reserved in total by PyTorch)

Later I reduce the batch size and in the middle of the train, my process will be terminated.

2022-03-22 20:22:44.202 Traceback (most recent call last):
2022-03-22 20:22:44.202 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 395, in <module>
2022-03-22 20:22:44.202 cli_main()
2022-03-22 20:22:44.202 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 388, in cli_main
2022-03-22 20:22:44.202 torch.multiprocessing.spawn(process_fn, args=(parsed_args,),
2022-03-22 20:22:44.202 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
2022-03-22 20:22:44.202 return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
2022-03-22 20:22:44.202 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
2022-03-22 20:22:44.202 while not context.join():
2022-03-22 20:22:44.202 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
2022-03-22 20:22:44.202 raise Exception(
2022-03-22 20:22:44.202 Exception: process 0 terminated with signal SIGSEGV

opened by chrisgao7 0

Operation about bpe

Did you apply bpe in your train data? What do you mean by "We used a joint source and target Byte Pair Encoding (BPE) (Sennrich et al., 2016) with 40k merge operations." in your artical sector 3.1?

opened by chrisgao7 1

Issue with namespace using train.sh

Hi,

I'm trying to run the training script with Python 3.8.10 and torch==1.10.2+cu113, and I obtain the following error:

>> bash thualign/bin/train.sh -s mask_align -e agree_deen
running mask_align
Traceback (most recent call last):
  File "/net/aistaff/sarti/Mask-Align/thualign/bin/trainer.py", line 21, in <module>
    import thualign.data as data
  File "/net/aistaff/sarti/Mask-Align/thualign/data/__init__.py", line 5, in <module>
    from thualign.data.dataset import Dataset, TextLineDataset
  File "/net/aistaff/sarti/Mask-Align/thualign/data/dataset.py", line 51, in <module>
    class Dataset(IterableDataset):
  File "/net/aistaff/sarti/Mask-Align/venv/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 273, in __new__
    return super().__new__(cls, name, bases, namespace, **kwargs)  # type: ignore[call-overload]
  File "/usr/lib/python3.8/abc.py", line 85, in __new__
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
  File "/net/aistaff/sarti/Mask-Align/venv/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 373, in _dp_init_subclass
    raise TypeError("Expected 'Iterator' as the return annotation for `__iter__` of {}"
TypeError: Expected 'Iterator' as the return annotation for `__iter__` of Dataset, but found thualign.data.iterator.Iterator

Do you have a specific pinned version of torch to make the script work?

opened by gsarti 2

Training issure

There is an error during the training when I used my training data. However, the training steps didn't stopped. Do you know what it's going on with it?

opened by michelleqyhqyh 0
Model cannot converge

I try to train a mask_align model with default config in the repo (only change data paths) and DE-EN training data from https://github.com/lilt/alignment-scripts. In some of training steps the losses are nan and at end of training the loss increases from about 7 to 70.

epoch = 5, step = 49980, loss: nan, f_loss: nan, b_loss: nan, agree_loss: nan, entropy_loss: nan (0.246 sec) epoch = 5, step = 49990, loss: 64.210, f_loss: 67.750, b_loss: 60.188, agree_loss: 0.000, entropy_loss: 0.241 (0.507 sec) epoch = 5, step = 50000, loss: 69.115, f_loss: 72.500, b_loss: 65.312, agree_loss: 0.000, entropy_loss: 0.240 (0.652 sec)

opened by theoqian 1

Owner

THUNLP-MT

Machine Translation Group, Natural Language Processing Lab at Tsinghua University (THUNLP). Please refer to https://github.com/thunlp for more NLP resources.

GitHub

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

478 Dec 25, 2022

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

5 Sep 13, 2022

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

9 Jun 27, 2022

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 3, 2022

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

20 Dec 12, 2022

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

90 Dec 27, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

43 Dec 28, 2022

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

43 Dec 28, 2022

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

49 Dec 26, 2022

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 1, 2022

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 3, 2023

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

29 Oct 16, 2022

Findings of ACL 2021

Assessing Dialogue Systems with Distribution Distances [arXiv][code] We propose to measure the performance of a dialogue system by computing the distr

16 Feb 24, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

540 Dec 30, 2022

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

50 Dec 21, 2022

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

45 Nov 29, 2022

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

2k Jan 1, 2023

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Related tags

Overview

Mask-Align: Self-Supervised Neural Word Alignment

Contents

Introduction

Prerequisites

Usage

Data Preparation

Training

Test

Configs

Basic Config

User Config

Visualization

Training Visualization

Attention Visualization

Contact

Comments

Owner

THUNLP-MT

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Findings of ACL 2021

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

ACL'2021: Learning Dense Representations of Phrases at Scale

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).