Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Array features to numpy arrays rather than lists by default #7210

Open
alex-hh opened this issue Oct 9, 2024 · 0 comments
Open

Convert Array features to numpy arrays rather than lists by default #7210

alex-hh opened this issue Oct 9, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@alex-hh
Copy link
Contributor

alex-hh commented Oct 9, 2024

Feature request

It is currently quite easy to cause massive slowdowns when using datasets and not familiar with the underlying data conversions by e.g. making bad choices of formatting.

Would it be more user-friendly to set defaults that avoid this as much as possible? e.g. format Array features as numpy arrays rather than python lists

Motivation

Default array formatting leads to slow performance: e.g.

import numpy as np
from datasets import Dataset, Features, Array3D
features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)
t0 = time.time()
for ex in ds:
   pass
t1 = time.time()

~1.4 s

ds = dataset.to_iterable_dataset()
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~10s

ds = dataset.with_format("numpy")
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~0.04s

ds = dataset.to_iterable_dataset().with_format("numpy")
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~0.04s

Your contribution

May be able to contribute

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant