[BUG] Loading big dataset with 2 million examples is causing an error #844

Stealthwriter · 2024-08-29T08:40:04Z

🐛 Bug

ARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-08-29 08:24:57,540] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:57,912] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:57,938] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,081] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,189] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,378] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,477] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,481] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,481] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-08-29 08:24:58,694] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,695] [INFO] [comm.py:637:init_distributed] cdb=None
2024-08-29 08:24:59,389 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 10 local rank: 0.
2024-08-29 08:25:02,431 - INFO: Problem Type: text_causal_classification_modeling
2024-08-29 08:25:02,431 - INFO: Global random seed: 309881
2024-08-29 08:25:02,431 - INFO: Preparing the data...
2024-08-29 08:25:02,431 - INFO: Setting up automatic validation split...
2024-08-29 08:25:05,211 - INFO: Preparing train and validation data
2024-08-29 08:25:05,211 - INFO: Loading train dataset...
ERROR:root:Exception occurred during H2O LLM Studio run:
Traceback (most recent call last):
File "/workspace/train_wave.py", line 104, in
run(cfg=cfg)
File "/workspace/train.py", line 557, in run
train_dataset = get_train_dataset(train_df=train_df, cfg=cfg)
File "/workspace/llm_studio/src/utils/data_utils.py", line 388, in get_train_dataset
train_dataset: Dataset = cfg.dataset.dataset_class(
File "/workspace/llm_studio/src/datasets/text_causal_classification_ds.py", line 18, in init
super().init(df=df, cfg=cfg, mode=mode)
File "/workspace/llm_studio/src/datasets/text_causal_language_modeling_ds.py", line 30, in init
self.tokenizer = get_tokenizer(self.cfg)
File "/workspace/llm_studio/src/datasets/text_utils.py", line 44, in get_tokenizer
tokenizer_class = AutoTokenizer.from_pretrained(
File "/workspace/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 896, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/workspace/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2291, in from_pretrained
return cls._from_pretrained(
File "/workspace/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2525, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/workspace/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 134, in init
raise ValueError(
ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a tokenizers library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
[2024-08-29 08:25:16,678] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1192
[2024-08-29 08:25:16,852] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1193
[2024-08-29 08:25:17,025] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1194
[2024-08-29 08:25:17,198] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1195
[2024-08-29 08:25:17,371] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1196
[2024-08-29 08:25:17,544] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1197
[2024-08-29 08:25:17,717] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1198
[2024-08-29 08:25:17,890] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1199
[2024-08-29 08:25:17,890] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1200
[2024-08-29 08:25:18,063] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1201
[2024-08-29 08:25:18,236] [ERROR] [launch.py:325:sigkill_handler] ['/workspace/.venv/bin/python', '-u', 'train_wave.py', '--local_rank=9', '-Y', '/home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k.1.1/cfg.yaml'] exits with return code = -9
INFO: 127.0.0.1:58570 - "POST / HTTP/1.1" 200 OK
2024-08-29 08:31:22,777 - INFO: {'settings/content', 'home/compute_stats', 'home/disk_usage', 'experiment/list', 'init_app', 'home/experiments_stats', 'dataset/list', 'home/gpu_stats', 'dataset/display/footer', 'experiment/start', 'settings/footer', 'dataset/import/footer', 'dataset/import', 'experiment/start/footer'}
2024-08-29 08:31:22,795 - INFO: Experiment path /home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k.1.1/charts.db not found.
2024-08-29 08:31:22,807 - INFO: Experiment path /home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k.1/charts.db not found.
2024-08-29 08:31:22,818 - INFO: Experiment path /home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k/charts.db not found.

To Reproduce

try to train llama 3.1 8b on a dataset of 2 million examples for text classification, the dataset is 300mb +

LLM Studio version

The text was updated successfully, but these errors were encountered:

Stealthwriter · 2024-08-29T08:40:29Z

using nightly build docker

pascal-pfeiffer · 2024-08-29T10:57:16Z

Thank you for reporting @Stealthwriter
From the first look it doesn't look that the error is related to the size of the dataset but rather to what model and which settings were used in the training. To further investigate what caused this, would you be able to share the cfg.yaml that was used for your experiment? Is this an issue only happening with deepspeed and what is the specific reason to use deepspeed for a 8B model?

pascal-pfeiffer · 2024-09-26T07:56:57Z

I just trained a model with 2 million samples without any errors. Could you please attach a config or add more information for what settings this was failing for you, so we can pinpoint the issue?

Stealthwriter added the type/bug Bug in code label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Loading big dataset with 2 million examples is causing an error #844

[BUG] Loading big dataset with 2 million examples is causing an error #844

Stealthwriter commented Aug 29, 2024

Stealthwriter commented Aug 29, 2024

pascal-pfeiffer commented Aug 29, 2024

pascal-pfeiffer commented Sep 26, 2024

[BUG] Loading big dataset with 2 million examples is causing an error #844

[BUG] Loading big dataset with 2 million examples is causing an error #844

Comments

Stealthwriter commented Aug 29, 2024

🐛 Bug

To Reproduce

LLM Studio version

Stealthwriter commented Aug 29, 2024

pascal-pfeiffer commented Aug 29, 2024

pascal-pfeiffer commented Sep 26, 2024