Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Overview

WECHSEL

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

arXiv: https://arxiv.org/abs/2112.06598

Models from the paper are available on the HuggingFace Hub:

Installation

We distribute a Python Package via PyPI:

pip install wechsel

Alternatively, clone the repository, install requirements.txt and run the code in wechsel/.

Example usage

Transferring English roberta-base to Swahili:

import torch
from transformers import AutoModel, AutoTokenizer
from datasets import load_dataset
from wechsel import WECHSEL, load_embeddings

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

target_tokenizer = source_tokenizer.train_new_from_iterator(
    load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"],
    vocab_size=len(source_tokenizer)
)

wechsel = WECHSEL(
    load_embeddings("en"),
    load_embeddings("sw"),
    bilingual_dictionary="swahili"
)

target_embeddings, info = wechsel.apply(
    source_tokenizer,
    target_tokenizer,
    model.get_input_embeddings().weight.detach().numpy(),
)

model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings)

# use `model` and `target_tokenizer` to continue training in Swahili!

Bilingual dictionaries

We distribute 3276 bilingual dictionaries from English to other languages for use with WECHSEL in dicts/.

Citation

Please cite WECHSEL as

@misc{minixhofer2021wechsel,
      title={WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models}, 
      author={Benjamin Minixhofer and Fabian Paischer and Navid Rekabsaz},
      year={2021},
      eprint={2112.06598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
You might also like...
This is the official implementation of
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

For AILAB: Cross Lingual Retrieval on Yelp Search Engine

Cross-lingual Information Retrieval Model for Document Search Train Phase CUDA_VISIBLE_DEVICES="0,1,2,3" \ python -m torch.distributed.launch --nproc_

code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

SHOT++ Code for our TPAMI submission "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer" that is ext

Transfer-Learn is an open-source and well-documented library for Transfer Learning.
Transfer-Learn is an open-source and well-documented library for Transfer Learning.

Transfer-Learn is an open-source and well-documented library for Transfer Learning. It is based on pure PyTorch with high performance and friendly API. Our code is pythonic, and the design is consistent with torchvision. You can easily develop new algorithms, or readily apply existing algorithms.

Transfer style api - An API to use with Tranfer Style App, where you can use two image and transfer the style

Transfer Style API It's an API to use with Tranfer Style App, where you can use

 UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus
UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus General info This is

Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch
Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch

Cross Transformers - Pytorch (wip) Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch Install $ pip install cross-t

Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

This is the official implementation of our paper Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR), which has been accepted by WSDM2022.

Comments
  • Exact training corpus

    Exact training corpus

    Hi @bminixhofer

    thanks for sharing your work. Could you provide more details on the training corpus?

    In the paper, you write

    we restrict the amount of training data to subsets of 4GiB from the OSCAR corpus (Ortiz Suárez et al., 2019)

    What exact subset do you use? The unshuffled dedub versions (e.g., unshuffled_deduplicated_de)? Any random n samples with a specific seed? Or just the first/last n rows?

    Best, Malte

    opened by malteos 9
  • RuntimeError: expected scalar type Double but found Float

    RuntimeError: expected scalar type Double but found Float

    Behavior

    • After converting the model with WECHSEL method, RuntimeError: expected scalar type Double but found Float error occurs.
    • README's en -> swahili example yields the error, as well as en -> korean conversion.

    Replicating error

    """ Example Code on README.md from WECHSEL"""
    import torch
    from transformers import AutoModel, AutoTokenizer
    from datasets import load_dataset
    from wechsel import WECHSEL, load_embeddings
    
    source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
    model = AutoModel.from_pretrained("roberta-base")
    
    # check whether model accepts the input
    source_input = source_tokenizer("Checking functionality of original model", return_tensors="pt")
    model(**source_input)
    
    BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0444,  0.0719, -0.0131,  ..., -0.0682, -0.0579,  0.0079],
    [ 0.0839, -0.0649,  0.0547,  ...,  0.2830,  0.1913,  0.3524],
    [ 0.1558,  0.1616,  0.1473,  ..., -0.0187,  0.1893,  0.4051],
    ...
    
    # converting the model with WECHSEL class
    target_tokenizer = source_tokenizer.train_new_from_iterator(
        load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"],
        vocab_size=len(source_tokenizer)
    )
    
    wechsel = WECHSEL(
        load_embeddings("en"),
        load_embeddings("sw"),
        bilingual_dictionary="swahili"
    )
    
    target_embeddings, info = wechsel.apply(
        source_tokenizer,
        target_tokenizer,
        model.get_input_embeddings().weight.detach().numpy(),
    )
    
    model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings)
    
    Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
    Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
    100%|██████████| 50/50 [00:34<00:00,  1.43it/s]
    
    # use `model` and `target_tokenizer` to continue training in Swahili!
    inputs = target_tokenizer("سَوَاحِلِىّ", return_tensors='pt')
    print(inputs)
    
    {'input_ids': tensor([[    0, 25945,   144,   182,  9465,   144,   182,  5796,   201,   144,
               184,  4191,   144,   184,  9708,   144,   185,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
    
    # put swahili tensor inputs to the model
    model(**inputs)
    
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    /var/folders/f8/9hn0rsx125vf87jp8_skr1l40000gn/T/ipykernel_11153/2767806960.py in <module>
          3 
          4 # assign double for all inputs
    ----> 5 model(**inputs)
    
    ~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
       1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1101                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1102             return forward_call(*input, **kwargs)
       1103         # Do not call functions when jit is used
       1104         full_backward_hooks, non_full_backward_hooks = [], []
    
    ~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
        842         head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
        843 
    --> 844         embedding_output = self.embeddings(
        845             input_ids=input_ids,
        846             position_ids=position_ids,
    
    ~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
       1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1101                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1102             return forward_call(*input, **kwargs)
       1103         # Do not call functions when jit is used
       1104         full_backward_hooks, non_full_backward_hooks = [], []
    
    ~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
        136             position_embeddings = self.position_embeddings(position_ids)
        137             embeddings += position_embeddings
    --> 138         embeddings = self.LayerNorm(embeddings)
        139         embeddings = self.dropout(embeddings)
        140         return embeddings
    
    ~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
       1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1101                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1102             return forward_call(*input, **kwargs)
       1103         # Do not call functions when jit is used
       1104         full_backward_hooks, non_full_backward_hooks = [], []
    
    ~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/torch/nn/modules/normalization.py in forward(self, input)
        187 
        188     def forward(self, input: Tensor) -> Tensor:
    --> 189         return F.layer_norm(
        190             input, self.normalized_shape, self.weight, self.bias, self.eps)
        191 
    
    ~/.pyenv/versions/3.8.3/envs/wechsel/lib/python3.8/site-packages/torch/nn/functional.py in layer_norm(input, normalized_shape, weight, bias, eps)
       2345             layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
       2346         )
    -> 2347     return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
       2348 
       2349 
    
    RuntimeError: expected scalar type Double but found Float
    
    opened by snoop2head 2
  • Questions about the use of the model

    Questions about the use of the model

    image

    Is the above code a direct run out of a model that can be used directly? Or does it need tuning or further training?Like

    from transformers import pipeline
    from transformers import AutoModelForMaskedLM, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("new")
    model = AutoModelForMaskedLM.from_pretrained("new")
    
    
    unmasker = pipeline("fill-mask",model=model,tokenizer=tokenizer,device = 0)
    from pprint import pprint
    
    pprint(unmasker(f"I come for {unmasker.tokenizer.mask_token} in last time!"))
    

    Will the above application code come out with a good result? Based on your experience?

    opened by ScottishFold007 1
Owner
Institute of Computational Perception
Johannes Kepler University
Institute of Computational Perception
Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

Wietse de Vries • Martijn Bartelds • Malvina Nissim • Martijn Wieling Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

Wietse de Vries 5 Aug 2, 2021
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at [email protected]

null 9 Oct 28, 2022
Weight initialization schemes for PyTorch nn.Modules

nninit Weight initialization schemes for PyTorch nn.Modules. This is a port of the popular nninit for Torch7 by @kaixhin. ##Update This repo has been

Alykhan Tejani 69 Jan 26, 2021
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

xTune Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning. Environment DockerFile: dancingsoul/pytorch:xTune Install the f

Bo Zheng 42 Dec 9, 2022
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

multilingual-mrc-isdg Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph". This r

Liyan 5 Dec 7, 2022
Meta Representation Transformation for Low-resource Cross-lingual Learning

MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning This repo hosts the code for MetaXL, published at NAACL 2021. [Meta

Microsoft 36 Aug 17, 2022
Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021)

L1-Refinement Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021) ?? A more detailed readme is co

Lincedo Lab 4 Jun 9, 2021