Transformer Embedder
A Word Level Transformer layer based on PyTorch and
How to use
Install the library from PyPI:
pip install transformer-embedder
It offers a PyTorch layer and a tokenizer that support almost every pretrained model from Huggingface
import transformer_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
model = tre.TransformerEmbedder("bert-base-cased", subtoken_pooling="mean", output_layer="sum")
example = "This is a sample sentence"
inputs = tokenizer(example, return_tensors=True)
{
'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650, 102]]),
'offsets': tensor([[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6]]]),
'attention_mask': tensor([[True, True, True, True, True, True, True]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])
'sentence_length': 7 # with special tokens included
}
outputs = model(**inputs)
# outputs.shape[1:-1] # remove [CLS] and [SEP]
torch.Size([1, 5, 768])
# len(example)
5
Info
One of the annoyance of using transfomer-based models is that it is not trivial to compute word embeddings from the sub-token embeddings they output. With this library it's as easy as using
Model
The TransformerEmbedder
offer 4 ways to retrieve the word embeddings, defined by subtoken_pooling
parameter:
first
: uses only the embedding of the first sub-token of each wordlast
: uses only the embedding of the last sub-token of each wordmean
: computes the mean of the embeddings of the sub-tokens of each wordnone
: returns the raw output of the transformer model without sub-token pooling
There are also multiple type of outputs you can get using output_layer
parameter:
last
: returns the last hidden state of the transformer modelconcat
: returns the concatenation of the last four hidden states of the transformer modelsum
: returns the sum of the last four hidden states of the transformer modelpooled
: returns the output of the pooling layer
If you also want all the outputs from the HuggingFace model, you can set return_all=True
to get them.
class TransformerEmbedder(torch.nn.Module):
def __init__(
self,
model: Union[str, tr.PreTrainedModel],
subtoken_pooling: str = "first",
output_layer: str = "last",
fine_tune: bool = True,
return_all: bool = False,
)
Tokenizer
The Tokenizer
class provides the tokenize
method to preprocess the input for the TransformerEmbedder
layer. You can pass raw sentences, pre-tokenized sentences and sentences in batch. It will preprocess them returning a dictionary with the inputs for the model. By passing return_tensors=True
it will return the inputs as torch.Tensor
.
By default, if you pass text (or batch) as strings, it splits them on spaces
text = "This is a sample sentence"
tokenizer(text)
text = ["This is a sample sentence", "This is another sample sentence"]
tokenizer(text)
You can also use SpaCy to pre-tokenize the inputs into words first, using use_spacy=True
text = "This is a sample sentence"
tokenizer(text, use_spacy=True)
text = ["This is a sample sentence", "This is another sample sentence"]
tokenizer(text, use_spacy=True)
or you can pass an pre-tokenized sentence (or batch of sentences) by setting is_split_into_words=True
text = ["This", "is", "a", "sample", "sentence"]
tokenizer(text, is_split_into_words=True)
text = [
["This", "is", "a", "sample", "sentence", "1"],
["This", "is", "sample", "sentence", "2"],
]
tokenizer(text, is_split_into_words=True) # here is_split_into_words is redundant
Examples
First, initialize the tokenizer
import transformer_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
- You can pass a single sentence as a string:
text = "This is a sample sentence"
tokenizer(text)
{
'input_ids': [101, 1188, 1110, 170, 6876, 5650, 102],
'offsets': [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)],
'attention_mask': [True, True, True, True, True, True, True],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0],
'sentence_length': 7
}
- A sentence pair
text = "This is a sample sentence A"
text_pair = "This is a sample sentence B"
tokenizer(text, text_pair)
{
'input_ids': [101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102],
'offsets': [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14)],
'attention_mask': [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
'sentence_length': 15
}
- A batch of sentences or sentence pairs. Using
padding=True
andreturn_tensors=True
, the tokenizer returns the text ready for the model
batch = [
["This", "is", "a", "sample", "sentence", "1"],
["This", "is", "sample", "sentence", "2"],
["This", "is", "a", "sample", "sentence", "3"],
# ...
["This", "is", "a", "sample", "sentence", "n", "for", "batch"],
]
tokenizer(batch, padding=True, return_tensors=True)
batch_pair = [
["This", "is", "a", "sample", "sentence", "pair", "1"],
["This", "is", "sample", "sentence", "pair", "2"],
["This", "is", "a", "sample", "sentence", "pair", "3"],
# ...
["This", "is", "a", "sample", "sentence", "pair", "n", "for", "batch"],
]
tokenizer(batch, batch_pair, padding=True, return_tensors=True)
Custom fields
It is possible to add custom fields to the model input and tell the tokenizer
how to pad them using add_padding_ops
. Start by simply tokenizing the input (without padding or tensor mapping)
import transformer_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
text = [
["This", "is", "a", "sample", "sentence"],
["This", "is", "another", "example", "sentence", "just", "make", "it", "longer"]
]
inputs = tokenizer(text)
Then add the custom fileds to the result
custom_fields = {
"custom_filed_1": [
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]
]
}
inputs.update(custom_fields)
Now we can add the padding logic for our custom field custom_filed_1
. add_padding_ops
method takes in input
key
: name of the field in the tokenzer inputvalue
: value to use for paddinglength
: length to pad. It can be anint
, or two string value,subtoken
in which the element is padded to the batch max length relative to the sub-tokens length, andword
where the element is padded to the batch max length relative to the original word length
tokenizer.add_padding_ops("custom_filed_1", 0, "word")
Finally, pad the input and convert it to a tensor:
# manual processing
inputs = tokenizer.pad_batch(inputs)
inputs = tokenizer.to_tensor(inputs)
The inputs are ready for the model, including the custom filed.
>>> inputs
{
"input_ids": tensor(
[
[101, 1188, 1110, 170, 6876, 5650, 102, 0, 0, 0, 0],
[101, 1188, 1110, 1330, 1859, 5650, 1198, 1294, 1122, 2039, 102],
]
),
"offsets": tensor(
[
[[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [-1, -1], [-1, -1], [-1, -1]],
[[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10]],
]
),
"attention_mask": tensor(
[
[True, True, True, True, True, True, True, False, False, False, False],
[True, True, True, True, True, True, True, True, True, True, True],
]
),
"word_mask": tensor(
[
[True, True, True, True, True, True, True, False, False, False, False],
[True, True, True, True, True, True, True, True, True, True, True],
]
),
"token_type_ids": tensor(
[[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
),
"sentence_length": tensor([7, 11]),
"custom_filed_1": tensor(
[[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]]
),
}
SpaCy Tokenizer
By default, it uses the multilingual model xx_sent_ud_sm
. You can change it with the language
parameter during the tokenizer initialization. For example, if you prefer an English tokenizer:
tokenizer = tre.Tokenizer("bert-base-cased", language="en_core_web_sm")
For a complete list of languages and models, you can go here.
To-Do
Future developments
- Add an optional word tokenizer, maybe using SpaCy
- Add
add_special_tokens
wrapper - Make
pad_batch
function more general - Add logic (like how to pad, etc) for custom fields
- Documentation
- Include all model outputs
- Documentation
- A TensorFlow version (improbable)
Acknowledgements
Most of the code in the TransformerEmbedder
class is taken from the AllenNLP library. The pretrained models and the core of the tokenizer is from