Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong training procedure? #237

Closed
repodiac opened this issue Sep 28, 2021 · 6 comments
Closed

Wrong training procedure? #237

repodiac opened this issue Sep 28, 2021 · 6 comments

Comments

@repodiac
Copy link

repodiac commented Sep 28, 2021

I trained an extension of model sentence-transformers/paraphrase-multilingual-mpnet-base-v2 (see #235).

After training I used the script save_pretrained_hf.py in order to convey it to a HuggingFace Transformer-compatible format.

When I now run the example code for mean-pooling embeddings I get the following warning (output_bs32_ep20_export is my exported model):

Some weights of the model checkpoint at /tf/data/output_bs32_ep20_export were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaModel were not initialized from the model checkpoint at /tf/data/output_bs32_ep20_export and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Any idea why this occurs? Is it true what the warning says or can I ignore it?

@JohnGiorgi
Copy link
Owner

JohnGiorgi commented Sep 28, 2021

Hmm, your pretrained model does not have weights for ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']

Did you set masked_language_modelling to True in the config? If so the model would have been loaded with AutoModelForMaskedLM (see here) and I would have expected those weights to have been trained.

Still, maybe I am wrong and lm_head is not used by your particular model. I think it is still worth evaluating this model you have trained and see if it performs well on your downstream tasks.

@repodiac
Copy link
Author

"model": {
        "type": "declutr",
        "text_field_embedder": {
            "type": "mlm",
            "token_embedders": {
                "tokens": {
                    "type": "pretrained_transformer_mlm",
                    "model_name": transformer_model,
                    "masked_language_modeling": true
                },
            },
        },
        "loss": {
            "type": "nt_xent",
            "temperature": 0.05,
        },
        // There was a small bug in the original implementation that caused gradients derived from
        // the contrastive loss to be scaled by 1/N, where N is the number of GPUs used during
        // training. This has been fixed. To reproduce results from the paper, set this to false.
        // Note that this will have no effect if you are not using distributed training with more
        // than 1 GPU.
        "scale_fix": false
    },

However, as I wrote in #118 (comment) ** in the continued/restarted runs I used the first model as from_archive: ** Is that the problem?

    "model": {
        "type": "from_archive",
        "archive_file": "/notebooks/DeCLUTR/output_bs32_ep10/model.tar.gz"
    },
  • the underlying model according to huggingface seems to be XLMRobertaModel - does it not use the referenced lm_head hyperparameters in training? I doubt it...

  • Something has been trained for sure :) The embeddings are signficantly different to the base model (sentence-transformers/paraphrase-multilingual-mpnet-base-v2) when used for semantic textual similarity, but I wonder if I miss out something here if the model "complains" in such a manner?

Any clarification is highly appreciated!

@JohnGiorgi
Copy link
Owner

I think you are free to ignore these messages. I imagine this happens because somewhere during loading of the model, AutoModel.from_pretrained is used, so the weights of lm_head are not initialized, which is OK because we don't use them to produce sentence embeddings.

@repodiac
Copy link
Author

I have to admit that I am not particularly familiar enough with the underyling XLMRobertaModel, but lm_head sounds to me like the last hidden layer (in general, you put a task-specific header on top, e.g. softmax for classification tasks etc.) So for embeddings I would expect lm_head to be used as last layer?

@JohnGiorgi
Copy link
Owner

The example code you cited uses mean pooling on the token embeddings from the model's last transformer block. This doesn't require lm_head.

@JohnGiorgi
Copy link
Owner

Closing this, feel free to re-open if you are still having issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants