How can I shut down automatically distributing data when using StreamingDataset? #368

ygtxr1997 · 2024-09-12T15:15:05Z

🚀 Feature

Giving the option to shut down the automatic data distributed sampler when using StreamingDataset.

Motivation

When I use StreamingDataset in 'DDP' environment, the dataset_len of StreamingDataset seems always be original_len/world_size.
But I want the different processes (with different local_ranks) to share the totally same StreamingDataset, without any data splitting.

Pitch

How to stop the automatic data distribution when using StreamingDataset in DDP? Or could you provide a setting to this? Or could you explain why we can't shut down the distribution?

Alternatives

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-12T15:15:31Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-09-14T13:22:58Z

Hey @ygtxr1997. You can override the distributed env on the dataset. It is inferred automatically from torch.

What is your use case ?

ygtxr1997 · 2024-09-16T12:50:37Z

Hey @ygtxr1997. You can override the distributed env on the dataset. It is inferred automatically from torch.

What is your use case ?

I think overriding the litdata class should be a method to solve my above issue.

Initially, I want to use litdata to optimize my dataset which consists of ~500k small files (each is about 200KB file size). All these files are stored in a remote storage server. However, which is a bit different from other image datasets, my dataloader needs to read from two distinct files, and the indices between the two file vary in [20,50], like a sliding window. For instance, the file indices in a data batch (batch_size=4) could be like this:

batch: [100,120], [1000,1030], [1020,1070], [500,550]

According to your example of usage and distributed data loading illustration GIF, litdata seems not good at dealing with such random-read-like case, am I right? Maybe the performance depends on how the original files are merged into a litdata chunk. Maybe keeping the order of original files (from small index to large index) could result in a faster loading speed, but this potentially impact the learning of deep models?

Therefore, I don't know if litdata could help me and boost the data loading speed in my case.

ygtxr1997 added the enhancement New feature or request label Sep 12, 2024

bhimrazy added question Further information is requested enhancement New feature or request and removed enhancement New feature or request labels Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I shut down automatically distributing data when using StreamingDataset? #368

How can I shut down automatically distributing data when using StreamingDataset? #368

ygtxr1997 commented Sep 12, 2024

github-actions bot commented Sep 12, 2024

tchaton commented Sep 14, 2024

ygtxr1997 commented Sep 16, 2024

How can I shut down automatically distributing data when using StreamingDataset? #368

How can I shut down automatically distributing data when using StreamingDataset? #368

Comments

ygtxr1997 commented Sep 12, 2024

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

github-actions bot commented Sep 12, 2024

tchaton commented Sep 14, 2024

ygtxr1997 commented Sep 16, 2024