Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: using BERT and ELMo embedding in TextInputter #422

Open
atebbifakhr opened this issue Apr 29, 2019 · 5 comments
Open

Feature request: using BERT and ELMo embedding in TextInputter #422

atebbifakhr opened this issue Apr 29, 2019 · 5 comments

Comments

@atebbifakhr
Copy link
Contributor

Hi,

Do you have any plan to leverage contextualized embeddings such as BERT and ELMo in TextInputter?

@guillaumekln
Copy link
Contributor

Hi,

I will probably not work on that directly but I'm interested in making sure that users can integrate it without too much pain. Right now, it seems possible to extend TextInputter and override the make_inputs method. All this could be done directly in the user model definition file without changing the OpenNMT-tf code.

@atebbifakhr
Copy link
Contributor Author

Thanks for information.

@atebbifakhr
Copy link
Contributor Author

Hi,

I want to use the BERT representation in my model. Doing so I override WordEmbedder as followes:

class MyEmbedder(onmt.inputters.WordEmbedder):
    def make_features(self, element=None, features=None, training=None):
        features = super(MyEmbedder, self).make_features(
            element=element, features=features, training=training)
        def _python_wrapper(element):
            element = tf.compat.as_text(element.numpy())
            bert_tokenized = bert_tokenizer.encode_plus(
                element,
                add_special_tokens = True, # add [CLS], [SEP]
                max_length = 128, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens 
            )
            return bert_tokenized["input_ids"], bert_tokenized["attention_mask"], bert_tokenized["token_type_ids"]
        input_ids, attention_mask, token_type_ids = tf.py_function(_python_wrapper, [element], [tf.int32, tf.int32, tf.int32])
        features["bert_input_ids"] = input_ids
        features["bert_token_type_ids"] = token_type_ids
        features["bert_attention_mask"] = attention_mask
        return features

But I got this exception:

/usr/local/lib/python3.6/dist-packages/opennmt/inputters/inputter.py in make_training_dataset(self, features_file, labels_file, batch_size, batch_type, batch_multiplier, batch_size_multiple, shuffle_buffer_size, length_bucket_width, maximum_features_length, maximum_labels_length, single_pass, num_shards, shard_index, num_threads, prefetch_buffer_size, cardinality_multiple, weights)
577 shuffle_buffer_size=shuffle_buffer_size,
578 prefetch_buffer_size=prefetch_buffer_size,
--> 579 cardinality_multiple=cardinality_multiple)(dataset)
580 return dataset

/usr/local/lib/python3.6/dist-packages/opennmt/data/dataset.py in _pipeline(dataset)
554 batch_size_multiple=batch_size_multiple,
555 length_bucket_width=length_bucket_width,
--> 556 length_fn=[features_length_fn, labels_length_fn]))
557 dataset = dataset.apply(filter_irregular_batches(batch_multiplier))
558 if not single_pass:

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py in apply(self, transformation_func)
1741 dataset.
1742 """
-> 1743 dataset = transformation_func(self)
1744 if not isinstance(dataset, DatasetV2):
1745 raise TypeError(

/usr/local/lib/python3.6/dist-packages/opennmt/data/dataset.py in (dataset)
324 """
325 return lambda dataset: dataset.padded_batch(
--> 326 batch_size, padded_shapes=padded_shapes or _get_output_shapes(dataset))
327
328 def batch_sequence_dataset(batch_size,

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py in padded_batch(self, batch_size, padded_shapes, padding_values, drop_remainder)
1479 """
1480 return PaddedBatchDataset(self, batch_size, padded_shapes, padding_values,
-> 1481 drop_remainder)
1482
1483 def map(self, map_func, num_parallel_calls=None):

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py in init(self, input_dataset, batch_size, padded_shapes, padding_values, drop_remainder)
3811 nest.flatten(input_shapes), flat_padded_shapes):
3812 flat_padded_shapes_as_tensors.append(
-> 3813 _padded_shape_to_tensor(padded_shape, input_component_shape))
3814
3815 self._padded_shapes = nest.pack_sequence_as(input_shapes,

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py in _padded_shape_to_tensor(padded_shape, input_component_shape)
3721 # tf.TensorShape, so fall back on the conversion to tensor
3722 # machinery.
-> 3723 ret = ops.convert_to_tensor(padded_shape, preferred_dtype=dtypes.int64)
3724 if ret.shape.dims is not None and len(ret.shape.dims) != 1:
3725 six.reraise(ValueError, ValueError(

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
1312
1313 if ret is None:
-> 1314 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
1315
1316 if ret is NotImplemented:

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py in _tensor_shape_tensor_conversion_function(s, dtype, name, as_ref)
332 if not s.is_fully_defined():
333 raise ValueError(
--> 334 "Cannot convert a partially known TensorShape to a Tensor: %s" % s)
335 s_list = s.as_list()
336 int64_value = 0

ValueError: Cannot convert a partially known TensorShape to a Tensor:

Do you know what is the problem?

Thanks a lot!

@atebbifakhr atebbifakhr reopened this May 24, 2020
@atebbifakhr
Copy link
Contributor Author

I'd like to mention I'm using Tensorflow 2.2 and the last version of OpenNMT

@guillaumekln
Copy link
Contributor

guillaumekln commented May 25, 2020

Can you check if the shape of the tensors returned by tf.py_function is defined? You may need to set it manually, see for example:

https://github.com/OpenNMT/OpenNMT-tf/blob/v2.9.3/opennmt/tokenizers/tokenizer.py#L163-L164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants