-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add rope alibi to encoder #1687
base: master
Are you sure you want to change the base?
Conversation
@vince62s I updated Ctranslate and the conversion went without errors.
without this it still gave the error:
|
yes you're correct. Were you able to process inference in ct2 without any issue? I am not merging yet because for AliBi it requires additional changes in C |
So far tried inference only with gated-gelu activation on Ctranslate2 v3.20.0, no issues. I plan to try output with ROPE in the next month after training the corresponding model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello,
Would like to insert a line between lines 8 and 9 to support gated-gelu activation
"gated-gelu": common_spec.Activation.GELU,
Also, gated-gelu does not feature in the "transformers.py" script. Might want to add it after line 30, and modify line 1308 to include gated-gelu.
Forget the transformers.py script : GEMMA transformers implement GeGLU, but with a GELUTanh approximation (just read an article about it) so no need to update. |
Yes, I confirm. Successfully trained a model with gated-GELU and RoPE and inference it with Libretranslate (Ctranslate2 v3.20.0) |
I tried updating CT2 to 4.2.1 and pulling a training. It breaks upon validation step with the errors below. At first, I thought it was the "None" values set in the fix dated March12, so I downgraded onmt to 3.5.0 and further on down to the original pinned version (3.4.1). But best case scenatio, I get the same errors (worst, training aborts at step 1 or doesn't start at all). Tried pretty much any version of CT4.x and onmt3.5x with compatible torch/cuda. I also tried different data to check this out, and different population method. Do you have any idea? From the error, I think this is related to a "filtertoolong' transform that is systematically inserted in the data, but I am not sure.
|
Well, what I did not try was to install two seemingly incompatible versions of CT2 (4.2.1) and ONMT-py (3.4.3). @LynxPDA had me update this way, and now it does work (at least with absolute PE, and probably RPE). Although.... when I tried RoPE, upon converting the checkpoint to a CT2 model, I've got an "unexpected argument" error, explicit enough for me to comment lines 343 & 344 of the transformer_spec.py script and make it through with a seemingly working model (which was a toy config though, I still have to make a full training). The point now, I do not know if these specs really work and even if RoPE is sufficiently implemented in OpenNMT-py 3.4.3 to perform as intended. So, we'll have to work on a fix to use onmt-py 3.5.x I guess. Does the abovementioned bug come from the applied transforms? There are also empty "tgt_prefix" prefixes on top of the filtertoolong, could it be the issue? I attach the corresponding config file, since you have developed a lot of code for opennmt-py as well, you surely know better |
I think I've found the bug... it happens while calculating validation metrics at first val step. At this point, train will repeat call the scoring utils which manipulates "example" strings. The split() function has been updated in the commit you pulled on Jan.18 this year : fix "\n" tokenization + phi-2 new layer names (OpenNMT/OpenNMT-py#2552). I do not understand exactly how it fixes tokenizing \n, but could you tell me if inserting the ... code in the misc.py file would help? This is something I can pull out.
|
@lecoqnicolas sorry for not responding in a timely manner but we switched to https://github.com/eole-nlp/eole |
OK; I'll follow eole to see when we can switch. Too much focus on LLMs in OpenNMT-py lately. As of the transform, well there's no real need for it at valid, but the bug will also appear upon training I guess and may break it should an empty token appear. I'm not a code guru so I surely have missed something, but I think of switching back to the split() default as a fix for this issue. Is it possible without further bugs? |
I don't think the double space is your issue. |
I tried to remove it earlier this morning, then launched a test and went on an errand, you are right, it's something else. I'll remove the filtertoolong all the way, because if there's any such call while training, it will abort too. |
you still need it for training otherwise you take the risk to go OOM. |
isn't the risk to abort training on some empty token bigger? I've implemented it, but i keep my fingers crossed.... Then, you might want to rebase your PR on version 4.3.1 of CTranslate : the error on the "original_max_position_embeddings" argument was due to incomplete specs in the attention layer, which have been updated in merged commit #1700. About the eole project, are you completely abandoning opennmt and ctranslate2 or will you still contribute to CT2? Last question : should i specifiy arguments "max_position_embeddings" and "original..." or default is fine for a seq2seq transformer? Su specified 512 in its article if I'm not mistaken, but that was a BERT-style model. |
Actually, this is a flash_atten2 issue : I have (carefully) merged yout PR's modifications in a CT2 version prior to flash_attention2 implementation, reverted to onmt-py 3.4.1 that doesn't use it either, and successfully -if slowly- trained a rope encoded model with gated-gelu activation. Flash_attn2 may be all the hype, but there are lots of legacy GPUs still out there working. |
No description provided.