A Word Level Transformer layer based on PyTorch and 🤗 Transformers.
Install the library from PyPI:
pip install transformers-embedder
or from Conda:
conda install -c riccorl transformers-embedder
It offers a PyTorch layer and a tokenizer that support almost every pretrained model from Huggingface 🤗Transformers library. Here is a quick example:
import transformers_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
model = tre.TransformersEmbedder(
"bert-base-cased", subword_pooling_strategy="sparse", layer_pooling_strategy="mean"
)
example = "This is a sample sentence"
inputs = tokenizer(example, return_tensors=True)
{
'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650, 102]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])
'scatter_offsets': tensor([[0, 1, 2, 3, 4, 5, 6]]),
'sparse_offsets': {
'sparse_indices': tensor(
[
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 7, 7])
},
'sentence_length': 7 # with special tokens included
}
outputs = model(**inputs)
# outputs.word_embeddings.shape[1:-1] # remove [CLS] and [SEP]
torch.Size([1, 5, 768])
# len(example)
5
One of the annoyance of using transformer-based models is that it is not trivial to compute word embeddings from the sub-token embeddings they output. With this API it's as easy as using 🤗Transformers to get word-level embeddings from theoretically every transformer model it supports.
The TransformersEmbedder
class offers 3 ways to get the embeddings:
subword_pooling_strategy="sparse"
: computes the mean of the embeddings of the sub-tokens of each word (i.e. the embeddings of the sub-tokens are pooled together) using a sparse matrix multiplication. This strategy is the default one.subword_pooling_strategy="scatter"
: computes the mean of the embeddings of the sub-tokens of each word using a scatter-gather operation. It is not deterministic, but it works with ONNX export.subword_pooling_strategy="none"
: returns the raw output of the transformer model without sub-token pooling.
Here a little feature table:
Pooling | Deterministic | ONNX | |
---|---|---|---|
Sparse | ✅ | ✅ | ❌ |
Scatter | ✅ | ❌ | ✅ |
None | ❌ | ✅ | ✅ |
There are also multiple type of outputs you can get using layer_pooling_strategy
parameter:
layer_pooling_strategy="last"
: returns the last hidden state of the transformer modellayer_pooling_strategy="concat"
: returns the concatenation of the selectedoutput_layers
of the
transformer modellayer_pooling_strategy="sum"
: returns the sum of the selectedoutput_layers
of the transformer modellayer_pooling_strategy="mean"
: returns the average of the selectedoutput_layers
of the transformer modellayer_pooling_strategy="scalar_mix"
: returns the output of a parameterised scalar mixture layer of the selectedoutput_layers
of the transformer model
If you also want all the outputs from the HuggingFace model, you can set return_all=True
to get them.
class TransformersEmbedder(torch.nn.Module):
def __init__(
self,
model: Union[str, tr.PreTrainedModel],
subword_pooling_strategy: str = "sparse",
layer_pooling_strategy: str = "last",
output_layers: Tuple[int] = (-4, -3, -2, -1),
fine_tune: bool = True,
return_all: bool = True,
)
The Tokenizer
class provides the tokenize
method to preprocess the input for the TransformersEmbedder
layer. You can pass raw sentences, pre-tokenized sentences and sentences in batch. It will preprocess them
returning a dictionary with the inputs for the model. By passing return_tensors=True
it will return the
inputs as torch.Tensor
.
By default, if you pass text (or batch) as strings, it uses the HuggingFace tokenizer to tokenize them.
text = "This is a sample sentence"
tokenizer(text)
text = ["This is a sample sentence", "This is another sample sentence"]
tokenizer(text)
You can pass a pre-tokenized sentence (or batch of sentences) by setting is_split_into_words=True
text = ["This", "is", "a", "sample", "sentence"]
tokenizer(text, is_split_into_words=True)
text = [
["This", "is", "a", "sample", "sentence", "1"],
["This", "is", "sample", "sentence", "2"],
]
tokenizer(text, is_split_into_words=True)
First, initialize the tokenizer
import transformers_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
- You can pass a single sentence as a string:
text = "This is a sample sentence"
tokenizer(text)
{
{
'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],
'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6]],
'sparse_offsets': {
'sparse_indices': tensor(
[
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 7, 7])
},
'sentence_lengths': [7],
}
- A sentence pair
text = "This is a sample sentence A"
text_pair = "This is a sample sentence B"
tokenizer(text, text_pair)
{
'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],
'sparse_offsets': {
'sparse_indices': tensor(
[
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 15, 15])
},
'sentence_lengths': [15],
}
- A batch of sentences or sentence pairs. Using
padding=True
andreturn_tensors=True
, the tokenizer returns the text ready for the model
batch = [
["This", "is", "a", "sample", "sentence", "1"],
["This", "is", "sample", "sentence", "2"],
["This", "is", "a", "sample", "sentence", "3"],
# ...
["This", "is", "a", "sample", "sentence", "n", "for", "batch"],
]
tokenizer(batch, padding=True, return_tensors=True)
batch_pair = [
["This", "is", "a", "sample", "sentence", "pair", "1"],
["This", "is", "sample", "sentence", "pair", "2"],
["This", "is", "a", "sample", "sentence", "pair", "3"],
# ...
["This", "is", "a", "sample", "sentence", "pair", "n", "for", "batch"],
]
tokenizer(batch, batch_pair, padding=True, return_tensors=True)
It is possible to add custom fields to the model input and tell the tokenizer
how to pad them using
add_padding_ops
. Start by initializing the tokenizer with the model name:
import transformers_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
Then add the custom fields to it:
custom_fields = {
"custom_filed_1": [
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]
]
}
Now we can add the padding logic for our custom field custom_filed_1
. add_padding_ops
method takes in
input
key
: name of the field in the tokenizer inputvalue
: value to use for paddinglength
: length to pad. It can be anint
, or two string value,subword
in which the element is padded to match the length of the subwords, andword
where the element is padded relative to the length of the batch after the merge of the subwords.
tokenizer.add_padding_ops("custom_filed_1", 0, "word")
Finally, we can tokenize the input with the custom field:
text = [
"This is a sample sentence",
"This is another example sentence just make it longer, with a comma too!"
]
tokenizer(text, padding=True, return_tensors=True, additional_inputs=custom_fields)
The inputs are ready for the model, including the custom filed.
>>> inputs
{
'input_ids': tensor(
[
[ 101, 1188, 1110, 170, 6876, 5650, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 101, 1188, 1110, 1330, 1859, 5650, 1198, 1294, 1122, 2039, 117, 1114, 170, 3254, 1918, 1315, 106, 102]
]
),
'token_type_ids': tensor(
[
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
]
),
'attention_mask': tensor(
[
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
]
),
'scatter_offsets': tensor(
[
[ 0, 1, 2, 3, 4, 5, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16]
]
),
'sparse_offsets': {
'sparse_indices': tensor(
[
[ 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16],
[ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
]
),
'sparse_values': tensor(
[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
1.0000, 1.0000, 0.5000, 0.5000, 1.0000, 1.0000, 1.0000]
),
'sparse_size': torch.Size([2, 17, 18])
}
'sentence_lengths': [7, 17],
}
Some code in the TransformersEmbedder
class is taken from the PyTorch Scatter
library. The pretrained models and the core of the tokenizer is from 🤗 Transformers.