Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I shut down automatically distributing data when using StreamingDataset? #368

Open
ygtxr1997 opened this issue Sep 12, 2024 · 3 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@ygtxr1997
Copy link

🚀 Feature

Giving the option to shut down the automatic data distributed sampler when using StreamingDataset.

Motivation

When I use StreamingDataset in 'DDP' environment, the dataset_len of StreamingDataset seems always be original_len/world_size.
But I want the different processes (with different local_ranks) to share the totally same StreamingDataset, without any data splitting.

Pitch

How to stop the automatic data distribution when using StreamingDataset in DDP? Or could you provide a setting to this? Or could you explain why we can't shut down the distribution?

Alternatives

Additional context

@ygtxr1997 ygtxr1997 added the enhancement New feature or request label Sep 12, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@bhimrazy bhimrazy added question Further information is requested enhancement New feature or request and removed enhancement New feature or request labels Sep 13, 2024
@tchaton
Copy link
Collaborator

tchaton commented Sep 14, 2024

Hey @ygtxr1997. You can override the distributed env on the dataset. It is inferred automatically from torch.

What is your use case ?

@ygtxr1997
Copy link
Author

Hey @ygtxr1997. You can override the distributed env on the dataset. It is inferred automatically from torch.

What is your use case ?

I think overriding the litdata class should be a method to solve my above issue.

Initially, I want to use litdata to optimize my dataset which consists of ~500k small files (each is about 200KB file size). All these files are stored in a remote storage server. However, which is a bit different from other image datasets, my dataloader needs to read from two distinct files, and the indices between the two file vary in [20,50], like a sliding window. For instance, the file indices in a data batch (batch_size=4) could be like this:

batch: [100,120], [1000,1030], [1020,1070], [500,550]

According to your example of usage and distributed data loading illustration GIF, litdata seems not good at dealing with such random-read-like case, am I right? Maybe the performance depends on how the original files are merged into a litdata chunk. Maybe keeping the order of original files (from small index to large index) could result in a faster loading speed, but this potentially impact the learning of deep models?

Therefore, I don't know if litdata could help me and boost the data loading speed in my case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants