Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to fine tune bloom model for a down stream task. #286

Open
pushkarraj opened this issue Apr 12, 2023 · 0 comments
Open

Not able to fine tune bloom model for a down stream task. #286

pushkarraj opened this issue Apr 12, 2023 · 0 comments

Comments

@pushkarraj
Copy link

Code:


import logging
from ludwig.api import LudwigModel
from ludwig.datasets import agnews

Loads the dataset as a pandas.DataFrame

train_df, test_df, _ = agnews.load(split=True)

Prints a preview of the first five rows.

train_df.head(5)

config = {
"input_features": [
{
"name": "title", # The name of the input column
"type": "text", # Data type of the input column
"encoder": {
"type": "auto_transformer", # The model architecture to use
"pretrained_model_name_or_path": "bigscience/bloom-3b",
"trainable": True,
},
},
],
"output_features": [
{
"name": "class",
"type": "category",
}
],
"trainer": {
"learning_rate": 0.00001,
"epochs": 3, # We'll train for three epochs. Training longer might give
# better performance.
},
"backend": {
"type": "ray",
"trainer": {
"strategy": "fsdp", # fsdp distributed strategy for using a multi-GPU cluster
}
}
}

model = LudwigModel(config, logging_level=logging.INFO)
train_stats, preprocessed_data, output_directory = model.train(dataset=train_df)


Error


2023-04-12 02:50:23,717 WARNING read_api.py:330 -- ⚠️ The number of blocks in this dataset (1) limits its parallelism to 1 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use .repartition(n) to increase the number of dataset blocks.
Parquet Files Sample: 100%|██████████| 1/1 [00:00<00:00, 2.12it/s]
Parquet Files Sample: 0%| | 0/1 [00:00<?, ?it/s]
(_sample_piece pid=2075, ip=172.31.47.162) 2023-04-12 02:50:24,737 INFO worker.py:772 -- Task failed with retryable exception: TaskID(45b3d0fcab720f49ffffffffffffffffffffffff01000000).
(_sample_piece pid=2075, ip=172.31.47.162) Traceback (most recent call last):
(_sample_piece pid=2075, ip=172.31.47.162) File "python/ray/_raylet.pyx", line 857, in ray._raylet.execute_task
(_sample_piece pid=2075, ip=172.31.47.162) File "python/ray/_raylet.pyx", line 861, in ray._raylet.execute_task
(_sample_piece pid=2075, ip=172.31.47.162) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 461, in _sample_piece
(_sample_piece pid=2075, ip=172.31.47.162) piece = piece.subset(row_group_ids=[0])
(_sample_piece pid=2075, ip=172.31.47.162) File "pyarrow/_dataset_parquet.pyx", line 424, in pyarrow._dataset_parquet.ParquetFileFragment.subset
(_sample_piece pid=2075, ip=172.31.47.162) File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
(_sample_piece pid=2075, ip=172.31.47.162) File "pyarrow/_fs.pyx", line 1179, in pyarrow._fs._cb_open_input_file
(_sample_piece pid=2075, ip=172.31.47.162) File "/home/ray/anaconda3/lib/python3.8/site-packages/pyarrow/fs.py", line 394, in open_input_file
(_sample_piece pid=2075, ip=172.31.47.162) raise FileNotFoundError(path)
(_sample_piece pid=2075, ip=172.31.47.162) FileNotFoundError: /home/ray/1e10f286d91711edbbf702820dcb34a8.validation.parquet/part.00000000.parquet
2023-04-12 02:50:25,232 WARNING read_api.py:330 -- ⚠️ The number of blocks in this dataset (1) limits its parallelism to 1 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use .repartition(n) to increase the number of dataset blocks.
Parquet Files Sample: 100%|██████████| 1/1 [00:01<00:00, 1.49s/it]
Parquet Files Sample: 0%| | 0/1 [00:00<?, ?it/s]
(_sample_piece pid=2075, ip=172.31.47.162) 2023-04-12 02:50:25,243 INFO worker.py:772 -- Task failed with retryable exception: TaskID(06f28617326374dbffffffffffffffffffffffff01000000).
(_sample_piece pid=2075, ip=172.31.47.162) Traceback (most recent call last):
(_sample_piece pid=2075, ip=172.31.47.162) File "python/ray/_raylet.pyx", line 857, in ray._raylet.execute_task
(_sample_piece pid=2075, ip=172.31.47.162) File "python/ray/_raylet.pyx", line 861, in ray._raylet.execute_task
(_sample_piece pid=2075, ip=172.31.47.162) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 461, in _sample_piece
(_sample_piece pid=2075, ip=172.31.47.162) piece = piece.subset(row_group_ids=[0])
(_sample_piece pid=2075, ip=172.31.47.162) File "pyarrow/_dataset_parquet.pyx", line 424, in pyarrow._dataset_parquet.ParquetFileFragment.subset
(_sample_piece pid=2075, ip=172.31.47.162) File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
(_sample_piece pid=2075, ip=172.31.47.162) File "pyarrow/_fs.pyx", line 1179, in pyarrow._fs._cb_open_input_file
(_sample_piece pid=2075, ip=172.31.47.162) File "/home/ray/anaconda3/lib/python3.8/site-packages/pyarrow/fs.py", line 394, in open_input_file
(_sample_piece pid=2075, ip=172.31.47.162) raise FileNotFoundError(path)
(_sample_piece pid=2075, ip=172.31.47.162) FileNotFoundError: /home/ray/1e10f286d91711edbbf702820dcb34a8.test.parquet/part.00000000.parquet
2023-04-12 02:50:26,238 WARNING read_api.py:330 -- ⚠️ The number of blocks in this dataset (1) limits its parallelism to 1 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use .repartition(n) to increase the number of dataset blocks.
Parquet Files Sample: 100%|██████████| 1/1 [00:00<00:00, 1.00it/s]

Dataset Statistics
╒════════════╤═══════════════╤════════════════════╕
│ Dataset │ Size (Rows) │ Size (In Memory) │
╞════════════╪═══════════════╪════════════════════╡
│ Training │ 80626 │ 12.44 Mb │
├────────────┼───────────────┼────────────────────┤
│ Validation │ 11383 │ 1.76 Mb │
├────────────┼───────────────┼────────────────────┤
│ Test │ 22890 │ 3.53 Mb │
╘════════════╧═══════════════╧════════════════════╛

╒═══════╕
│ MODEL │
╘═══════╛


After this the bloom model downloads but never uses resources from clusters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant