Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Loading big dataset with 2 million examples is causing an error #844

Open
Stealthwriter opened this issue Aug 29, 2024 · 3 comments
Open
Labels
type/bug Bug in code

Comments

@Stealthwriter
Copy link

🐛 Bug

ARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-08-29 08:24:57,540] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:57,912] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:57,938] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,081] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,189] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,378] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,477] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,481] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,481] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-08-29 08:24:58,694] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-29 08:24:58,695] [INFO] [comm.py:637:init_distributed] cdb=None
2024-08-29 08:24:59,389 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 10 local rank: 0.
2024-08-29 08:25:02,431 - INFO: Problem Type: text_causal_classification_modeling
2024-08-29 08:25:02,431 - INFO: Global random seed: 309881
2024-08-29 08:25:02,431 - INFO: Preparing the data...
2024-08-29 08:25:02,431 - INFO: Setting up automatic validation split...
2024-08-29 08:25:05,211 - INFO: Preparing train and validation data
2024-08-29 08:25:05,211 - INFO: Loading train dataset...
ERROR:root:Exception occurred during H2O LLM Studio run:
Traceback (most recent call last):
File "/workspace/train_wave.py", line 104, in
run(cfg=cfg)
File "/workspace/train.py", line 557, in run
train_dataset = get_train_dataset(train_df=train_df, cfg=cfg)
File "/workspace/llm_studio/src/utils/data_utils.py", line 388, in get_train_dataset
train_dataset: Dataset = cfg.dataset.dataset_class(
File "/workspace/llm_studio/src/datasets/text_causal_classification_ds.py", line 18, in init
super().init(df=df, cfg=cfg, mode=mode)
File "/workspace/llm_studio/src/datasets/text_causal_language_modeling_ds.py", line 30, in init
self.tokenizer = get_tokenizer(self.cfg)
File "/workspace/llm_studio/src/datasets/text_utils.py", line 44, in get_tokenizer
tokenizer_class = AutoTokenizer.from_pretrained(
File "/workspace/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 896, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/workspace/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2291, in from_pretrained
return cls._from_pretrained(
File "/workspace/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2525, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/workspace/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 134, in init
raise ValueError(
ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a tokenizers library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
[2024-08-29 08:25:16,678] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1192
[2024-08-29 08:25:16,852] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1193
[2024-08-29 08:25:17,025] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1194
[2024-08-29 08:25:17,198] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1195
[2024-08-29 08:25:17,371] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1196
[2024-08-29 08:25:17,544] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1197
[2024-08-29 08:25:17,717] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1198
[2024-08-29 08:25:17,890] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1199
[2024-08-29 08:25:17,890] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1200
[2024-08-29 08:25:18,063] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1201
[2024-08-29 08:25:18,236] [ERROR] [launch.py:325:sigkill_handler] ['/workspace/.venv/bin/python', '-u', 'train_wave.py', '--local_rank=9', '-Y', '/home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k.1.1/cfg.yaml'] exits with return code = -9
INFO: 127.0.0.1:58570 - "POST / HTTP/1.1" 200 OK
2024-08-29 08:31:22,777 - INFO: {'settings/content', 'home/compute_stats', 'home/disk_usage', 'experiment/list', 'init_app', 'home/experiments_stats', 'dataset/list', 'home/gpu_stats', 'dataset/display/footer', 'experiment/start', 'settings/footer', 'dataset/import/footer', 'dataset/import', 'experiment/start/footer'}
2024-08-29 08:31:22,795 - INFO: Experiment path /home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k.1.1/charts.db not found.
2024-08-29 08:31:22,807 - INFO: Experiment path /home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k.1/charts.db not found.
2024-08-29 08:31:22,818 - INFO: Experiment path /home/llmstudio/mount/output/user/classifier-cudo-llama31-lower-900k/charts.db not found.

To Reproduce

try to train llama 3.1 8b on a dataset of 2 million examples for text classification, the dataset is 300mb +

LLM Studio version

@Stealthwriter Stealthwriter added the type/bug Bug in code label Aug 29, 2024
@Stealthwriter
Copy link
Author

using nightly build docker

@pascal-pfeiffer
Copy link
Collaborator

Thank you for reporting @Stealthwriter
From the first look it doesn't look that the error is related to the size of the dataset but rather to what model and which settings were used in the training. To further investigate what caused this, would you be able to share the cfg.yaml that was used for your experiment? Is this an issue only happening with deepspeed and what is the specific reason to use deepspeed for a 8B model?

@pascal-pfeiffer
Copy link
Collaborator

I just trained a model with 2 million samples without any errors. Could you please attach a config or add more information for what settings this was failing for you, so we can pinpoint the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

No branches or pull requests

2 participants