-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Loading big dataset with 2 million examples is causing an error #844
Comments
using nightly build docker |
Thank you for reporting @Stealthwriter |
I just trained a model with 2 million samples without any errors. Could you please attach a config or add more information for what settings this was failing for you, so we can pinpoint the issue? |
🐛 Bug
ARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-08-29 08:24:57,540] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:57,912] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:57,938] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,081] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,189] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,378] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,477] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,481] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,481] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-08-29 08:24:58,694] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,695] [INFO] [comm.py:637:init_distributed] cdb=None
2024-08-29 08:24:59,389 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 10 local rank: 0.
2024-08-29 08:25:02,431 - INFO: Problem Type: text_causal_classification_modeling
2024-08-29 08:25:02,431 - INFO: Global random seed: 309881
2024-08-29 08:25:02,431 - INFO: Preparing the data...
2024-08-29 08:25:02,431 - INFO: Setting up automatic validation split...
2024-08-29 08:25:05,211 - INFO: Preparing train and validation data
2024-08-29 08:25:05,211 - INFO: Loading train dataset...
ERROR:root:Exception occurred during H2O LLM Studio run:
Traceback (most recent call last):
File "/workspace/train_wave.py", line 104, in
run(cfg=cfg)
File "/workspace/train.py", line 557, in run
train_dataset = get_train_dataset(train_df=train_df, cfg=cfg)
File "/workspace/llm_studio/src/utils/data_utils.py", line 388, in get_train_dataset
train_dataset: Dataset = cfg.dataset.dataset_class(
File "/workspace/llm_studio/src/datasets/text_causal_classification_ds.py", line 18, in init
super().init(df=df, cfg=cfg, mode=mode)
File "/workspace/llm_studio/src/datasets/text_causal_language_modeling_ds.py", line 30, in init
self.tokenizer = get_tokenizer(self.cfg)
File "/workspace/llm_studio/src/datasets/text_utils.py", line 44, in get_tokenizer
tokenizer_class = AutoTokenizer.from_pretrained(
File "/workspace/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 896, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/workspace/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2291, in from_pretrained
return cls._from_pretrained(
File "/workspace/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2525, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/workspace/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 134, in init
raise ValueError(
ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a
tokenizers
library serialization file,(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
[2024-08-29 08:25:16,678] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1192
[2024-08-29 08:25:16,852] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1193
[2024-08-29 08:25:17,025] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1194
[2024-08-29 08:25:17,198] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1195
[2024-08-29 08:25:17,371] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1196
[2024-08-29 08:25:17,544] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1197
[2024-08-29 08:25:17,717] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1198
[2024-08-29 08:25:17,890] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1199
[2024-08-29 08:25:17,890] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1200
[2024-08-29 08:25:18,063] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1201
[2024-08-29 08:25:18,236] [ERROR] [launch.py:325:sigkill_handler] ['/workspace/.venv/bin/python', '-u', 'train_wave.py', '--local_rank=9', '-Y', '/home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k.1.1/cfg.yaml'] exits with return code = -9
INFO: 127.0.0.1:58570 - "POST / HTTP/1.1" 200 OK
2024-08-29 08:31:22,777 - INFO: {'settings/content', 'home/compute_stats', 'home/disk_usage', 'experiment/list', 'init_app', 'home/experiments_stats', 'dataset/list', 'home/gpu_stats', 'dataset/display/footer', 'experiment/start', 'settings/footer', 'dataset/import/footer', 'dataset/import', 'experiment/start/footer'}
2024-08-29 08:31:22,795 - INFO: Experiment path /home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k.1.1/charts.db not found.
2024-08-29 08:31:22,807 - INFO: Experiment path /home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k.1/charts.db not found.
2024-08-29 08:31:22,818 - INFO: Experiment path /home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k/charts.db not found.
To Reproduce
try to train llama 3.1 8b on a dataset of 2 million examples for text classification, the dataset is 300mb +
LLM Studio version
The text was updated successfully, but these errors were encountered: