Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For a single record data frame train_test_split() sometimes assigns this single record to test set. #975

Open
KWiecko opened this issue Jun 29, 2023 · 4 comments

Comments

@KWiecko
Copy link

KWiecko commented Jun 29, 2023

Describe the issue:

Disclaimer: I know the bug looks silly but I still wanted to give a heads up.

For a single data frame with only 1 record train_test_split() sometimes returns empty train set and test set with 1 record - is that desired behavior?

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split


if __name__ == '__main__':

    for _ in range(20):

        df = pd.DataFrame({'x0': [0], 'x1': [1], 'y': [2]})

        ddf = dd.from_pandas(df, npartitions=1)
        x = ddf[['x0', 'x1']]
        y = ddf['y']

        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

        if x_train.shape[0].compute() == 0:
            print('x_train is empty!')
            break

Anything else we need to know?:

Nope

Environment:

  • Dask version: 2023.5.0
  • Dask ML version: 2023.3.24
  • Python version: 3.8.15
  • Operating System: Ubuntu 22.04
  • Install method (conda, pip, source): pip
@TomAugspurger
Copy link
Member

TomAugspurger commented Jul 2, 2023

What's the behavior of scikit-learn here? We should match that, unless there's some reason not to.

One thing to note: we can't check the length of the DataFrame / array during graph construction. So if scikit-learn does any kind of length check, then we won't be able to (easily) match that behavior.

@KWiecko
Copy link
Author

KWiecko commented Jul 2, 2023

The following code (which should be equivalent to the dask code above):

import pandas as pd
from sklearn.model_selection import train_test_split

if __name__ == '__main__':
    for _ in range(20):

        df = pd.DataFrame({'x0': [0], 'x1': [1], 'y': [2]})

        x = df[['x0', 'x1']]
        y = df['y']

        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
        # line below throws identical error as line above
        # x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)

        if x_train.shape[0].compute() == 0:
            print('x_train is empty!')
            break

throws a following error:

Traceback (most recent call last):
  File "/home/kw/Projects/upwork/gym/src/debug/fail_during_conversion.py", line 33, in <module>
    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
  File "/home/kw/Projects/venvs/gym-test-venv/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2562, in train_test_split
    n_train, n_test = _validate_shuffle_split(
  File "/home/kw/Projects/venvs/gym-test-venv/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2236, in _validate_shuffle_split
    raise ValueError(
ValueError: With n_samples=1, test_size=None and train_size=0.7, the resulting train set will be empty. Adjust any of the aforementioned parameters.

So it looks like default behavior for this case is raise?

@narnia24
Copy link

hey can i work on this issue?

@TomAugspurger
Copy link
Member

Sure, thank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants