Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterable dataset map with explicit features causes slowdown for Sequence features #7215

Open
alex-hh opened this issue Oct 10, 2024 · 0 comments

Comments

@alex-hh
Copy link
Contributor

alex-hh commented Oct 10, 2024

Describe the bug

When performing map, it's nice to be able to pass the new feature type, and indeed required by interleave and concatenate datasets.

However, this can cause a major slowdown for certain types of array features due to the features being re-encoded.

This is separate to the slowdown reported in #7206

Steps to reproduce the bug

from datasets import Dataset, Features, Array3D, Sequence, Value
import numpy as np
import time
features=Features(**{"array0": Sequence(feature=Value("float32"), length=-1), "array1": Sequence(feature=Value("float32"), length=-1)})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,), dtype=np.float32) for x in [5000,10000]*25] for i in range(2)}, features=features)
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~1.5 s on main

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, features=features)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~ 3 s on main

Expected behavior

I'm not 100% sure whether passing new feature types to formatted outputs of map should be supported or not, but assuming it should, then there should be a cost-free way to specify the new feature type - knowing feature type is required by interleave_datasets and concatenate_datasets for example

Environment info

3.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant