Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Institute of Computational Perception

Last update: Dec 29, 2022

Related tags

Deep Learning wechsel

Overview

WECHSEL

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

arXiv: https://arxiv.org/abs/2112.06598

Models from the paper are available on the HuggingFace Hub:

Installation

We distribute a Python Package via PyPI:

pip install wechsel

Alternatively, clone the repository, install requirements.txt and run the code in wechsel/.

Example usage

Transferring English roberta-base to Swahili:

import torch
from transformers import AutoModel, AutoTokenizer
from datasets import load_dataset
from wechsel import WECHSEL, load_embeddings

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

target_tokenizer = source_tokenizer.train_new_from_iterator(
    load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"],
    vocab_size=len(source_tokenizer)
)

wechsel = WECHSEL(
    load_embeddings("en"),
    load_embeddings("sw"),
    bilingual_dictionary="swahili"
)

target_embeddings, info = wechsel.apply(
    source_tokenizer,
    target_tokenizer,
    model.get_input_embeddings().weight.detach().numpy(),
)

model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings)

# use `model` and `target_tokenizer` to continue training in Swahili!

Bilingual dictionaries

We distribute 3276 bilingual dictionaries from English to other languages for use with WECHSEL in dicts/.

Citation

Please cite WECHSEL as

@misc{minixhofer2021wechsel,
      title={WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models}, 
      author={Benjamin Minixhofer and Fabian Paischer and Navid Rekabsaz},
      year={2021},
      eprint={2112.06598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

You might also like...

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

59 Dec 28, 2022

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

18 Dec 9, 2022

For AILAB: Cross Lingual Retrieval on Yelp Search Engine

Cross-lingual Information Retrieval Model for Document Search Train Phase CUDA_VISIBLE_DEVICES="0,1,2,3" \ python -m torch.distributed.launch --nproc_

104 Nov 12, 2022

code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

SHOT++ Code for our TPAMI submission "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer" that is ext

75 Dec 16, 2022

Transfer-Learn is an open-source and well-documented library for Transfer Learning.

Transfer-Learn is an open-source and well-documented library for Transfer Learning. It is based on pure PyTorch with high performance and friendly API. Our code is pythonic, and the design is consistent with torchvision. You can easily develop new algorithms, or readily apply existing algorithms.

2.2k Jan 3, 2023

Transfer style api - An API to use with Tranfer Style App, where you can use two image and transfer the style

Transfer Style API It's an API to use with Tranfer Style App, where you can use

1 Feb 13, 2022

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus General info This is

71 Oct 25, 2022

Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch

Cross Transformers - Pytorch (wip) Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch Install $ pip install cross-t

40 Dec 22, 2022

Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

This is the official implementation of our paper Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR), which has been accepted by WSDM2022.

81 Dec 29, 2022

Comments

Exact training corpus

Hi @bminixhofer

thanks for sharing your work. Could you provide more details on the training corpus?

In the paper, you write

we restrict the amount of training data to subsets of 4GiB from the OSCAR corpus (Ortiz Suárez et al., 2019)

What exact subset do you use? The unshuffled dedub versions (e.g., unshuffled_deduplicated_de)? Any random n samples with a specific seed? Or just the first/last n rows?

Best, Malte

opened by malteos 9

RuntimeError: expected scalar type Double but found Float

Behavior

After converting the model with WECHSEL method, RuntimeError: expected scalar type Double but found Float error occurs.
README's en -> swahili example yields the error, as well as en -> korean conversion.

Replicating error

""" Example Code on README.md from WECHSEL"""
import torch
from transformers import AutoModel, AutoTokenizer
from datasets import load_dataset
from wechsel import WECHSEL, load_embeddings

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

# check whether model accepts the input
source_input = source_tokenizer("Checking functionality of original model", return_tensors="pt")
model(**source_input)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0444,  0.0719, -0.0131,  ..., -0.0682, -0.0579,  0.0079],
[ 0.0839, -0.0649,  0.0547,  ...,  0.2830,  0.1913,  0.3524],
[ 0.1558,  0.1616,  0.1473,  ..., -0.0187,  0.1893,  0.4051],
...

# converting the model with WECHSEL class
target_tokenizer = source_tokenizer.train_new_from_iterator(
    load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"],
    vocab_size=len(source_tokenizer)
)

wechsel = WECHSEL(
    load_embeddings("en"),
    load_embeddings("sw"),
    bilingual_dictionary="swahili"
)

target_embeddings, info = wechsel.apply(
    source_tokenizer,
    target_tokenizer,
    model.get_input_embeddings().weight.detach().numpy(),
)

model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings)

Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
100%|██████████| 50/50 [00:34<00:00,  1.43it/s]

# use `model` and `target_tokenizer` to continue training in Swahili!
inputs = target_tokenizer("سَوَاحِلِىّ", return_tensors='pt')
print(inputs)

{'input_ids': tensor([[    0, 25945,   144,   182,  9465,   144,   182,  5796,   201,   144,
           184,  4191,   144,   184,  9708,   144,   185,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

# put swahili tensor inputs to the model
model(**inputs)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/var/folders/f8/9hn0rsx125vf87jp8_skr1l40000gn/T/ipykernel_11153/2767806960.py in <module>
      3 
      4 # assign double for all inputs
----> 5 model(**inputs)

~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    842         head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
    843 
--> 844         embedding_output = self.embeddings(
    845             input_ids=input_ids,
    846             position_ids=position_ids,

~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
    136             position_embeddings = self.position_embeddings(position_ids)
    137             embeddings += position_embeddings
--> 138         embeddings = self.LayerNorm(embeddings)
    139         embeddings = self.dropout(embeddings)
    140         return embeddings

~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/torch/nn/modules/normalization.py in forward(self, input)
    187 
    188     def forward(self, input: Tensor) -> Tensor:
--> 189         return F.layer_norm(
    190             input, self.normalized_shape, self.weight, self.bias, self.eps)
    191 

~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/torch/nn/functional.py in layer_norm(input, normalized_shape, weight, bias, eps)
   2345             layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
   2346         )
-> 2347     return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
   2348 
   2349 

RuntimeError: expected scalar type Double but found Float

opened by snoop2head 2

Questions about the use of the model

Is the above code a direct run out of a model that can be used directly? Or does it need tuning or further training?Like

from transformers import pipeline
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("new")
model = AutoModelForMaskedLM.from_pretrained("new")


unmasker = pipeline("fill-mask",model=model,tokenizer=tokenizer,device = 0)
from pprint import pprint

pprint(unmasker(f"I come for {unmasker.tokenizer.mask_token} in last time！"))

Will the above application code come out with a good result? Based on your experience?

opened by ScottishFold007 1

Owner

Institute of Computational Perception

Johannes Kepler University

GitHub

Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

Wietse de Vries • Martijn Bartelds • Malvina Nissim • Martijn Wieling Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

5 Aug 2, 2021

PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

2.7k Dec 27, 2022

EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at [email protected]

9 Oct 28, 2022

Weight initialization schemes for PyTorch nn.Modules

nninit Weight initialization schemes for PyTorch nn.Modules. This is a port of the popular nninit for Torch7 by @kaixhin. ##Update This repo has been

69 Jan 26, 2021

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

43 Nov 7, 2022

Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

xTune Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning. Environment DockerFile: dancingsoul/pytorch:xTune Install the f

42 Dec 9, 2022

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

43 Nov 7, 2022

Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

multilingual-mrc-isdg Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph". This r

5 Dec 7, 2022

Meta Representation Transformation for Low-resource Cross-lingual Learning

MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning This repo hosts the code for MetaXL, published at NAACL 2021. [Meta

36 Aug 17, 2022

Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021)

L1-Refinement Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021) ?? A more detailed readme is co

4 Jun 9, 2021

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Related tags

Overview

WECHSEL

Installation

Example usage

Bilingual dictionaries

Citation

You might also like...

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

For AILAB: Cross Lingual Retrieval on Yelp Search Engine

code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

Transfer-Learn is an open-source and well-documented library for Transfer Learning.

Transfer style api - An API to use with Tranfer Style App, where you can use two image and transfer the style

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch

Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

Comments

Exact training corpus

RuntimeError: expected scalar type Double but found Float

Behavior

Replicating error

Questions about the use of the model

Owner

Institute of Computational Perception

Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

PyTorch original implementation of Cross-lingual Language Model Pretraining.

EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

Weight initialization schemes for PyTorch nn.Modules

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

Meta Representation Transformation for Low-resource Cross-lingual Learning

Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021)