-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_wikitext2 has bug #2020
Comments
@SunMarc is there a reason why |
Not sure. This was something TheBloke coded back then.Maybe this is because
This does not happen as we are slicing the tokenized data after: i = random.randint(0, enc.input_ids.shape[1] - seqlen - 1)
j = i + seqlen
inp = enc.input_ids[:, i:j]
attention_mask = torch.ones_like(inp) |
System Info
optimum version 1.21.4 (latest) # Use the official Python image from the Docker Hub FROM public.ecr.aws/docker/library/python:3.10-slim
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Produce warning:
Token indices sequence length is longer than the specified maximum sequence length for this model (73218 > 2048). Running this sequence through the model will result in indexing errors
Expected behavior
This is proposed fix:
Inspired by
get_c4`` and
get_c4_new```.No warning is produced.
The text was updated successfully, but these errors were encountered: