Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate make_classification_df date functionality with random dates #851

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 17 additions & 4 deletions dask_ml/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,10 +381,11 @@ def make_classification(
return X, y


def random_date(start, end):
def random_date(start, end, random_state=None):
rng_random_date = dask_ml.utils.check_random_state(random_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
rng_random_date = dask_ml.utils.check_random_state(random_state)
rng_random_date = sklearn.utils.check_random_state(random_state)

That way the .compute() can be avoided (especially relevant on repeated calls.).

delta = end - start
int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
random_second = np.random.randint(int_delta)
random_second = rng_random_date.randint(int_delta).compute().item()
return start + timedelta(seconds=random_second)


Expand Down Expand Up @@ -430,6 +431,13 @@ def make_classification_df(
The output values.

"""
if (
random_state is not None
or not isinstance(random_state, np.random.RandomState)
or not isinstance(random_state, int)
):
random_state = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the block necessary? None is already the default value for random_state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to address the issue you mentioned earlier, "what if random_state is not an integer or any of the accepted values"


X_array, y_array = make_classification(
n_samples=n_samples,
flip_y=(1 - predictability),
Expand All @@ -451,8 +459,13 @@ def make_classification_df(
[
X_df,
dd.from_array(
np.array([random_date(*dates)] * len(X_df)),
chunksize=chunks,
np.array(
[
random_date(*dates, random_state + i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when random_state isn't an integer? Scikit-learn allows for random_state to be an integer, None or an instance of np.random.RandomState (source).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will have to raise a ValueError exception there, on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use this code?

rng = check_random_state(random_state)
dates = [random_date(*dates, rng) for i in range(len(X_df))]
...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code above will produce the same random number since the seed(rng) remains the same in subsequent calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point: I think np.random.RandomState and None should be acceptable types for random_state. I'm fine expanding the for-loop, though I don't think that needs to happen:

[ins] In [193]: def random_dates(random_state):
           ...:     return random_state.randint(100)
           ...:

[ins] In [194]: rng = np.random.RandomState(42)

[ins] In [196]: [random_dates(rng) for _ in range(20)]
Out[196]: [51, 92, 14, 71, 60, 20, 82, 86, 74, 74, 87, 99, 23, 2, 21, 52, 1, 87, 29, 37]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can maybe just check if random_state is one of the accepted values like this and accordingly proceed -
if random_state is not None or not isinstance(random_state, np.random.RandomState) or not isinstance(random_state,int): print("random_state is not to be accepted")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That runs counter to the use of random_state in Scikit-learn. random_date is public, so it should accept all types of random_state that Scikit-learn accepts.

If random_date were a private function, I wouldn't really care

Copy link
Author

@BhuvanashreeM BhuvanashreeM Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random_state : int, RandomState instance or None, optional (default=None), these are values accepted by Scikit-Learn's random_state. I think I can check if the random_state is in neither of the accepted values(Scitkit and Numpy) and set is as the default None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scikit-learn's check_random_state function will likely be useful: https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.html

It takes those values and produces the correct random seed generator.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsievert the accepted types of random_state in Scikit and Numpy appear to be the same.
Refer this: Scikit's version also accepts Numpy's accepted values. Refer this: https://scikit-learn.org/dev/glossary.html#term-random_state

for i in range(len(X_df))
]
),
chunksize=n_samples,
columns=["date"],
),
],
Expand Down
12 changes: 12 additions & 0 deletions tests/test_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,10 +73,22 @@ def test_make_classification_df():
dates=(date(2014, 1, 1), date(2015, 1, 1)),
)

X_df1, y_series1 = dask_ml.datasets.make_classification_df(
n_samples=100,
n_features=5,
random_state=123,
chunks=100,
dates=(date(2014, 1, 1), date(2015, 1, 1)),
)
check_randomness = np.unique((X_df["date"] == X_df1["date"]).compute())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, this checks the random state.

Shouldn't this also check that there's more than one unique value? That's what #845 is focused on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the code I've written checks for repeatability, on account of the seed. Since the the numpy's randint function is a deterministic pseudo random number generator, we can be sure that it will produce a random set of numbers. I've ensured in the code that the function random_date is called multiple times, each with a different seed. Hence I feel the line np.unique(X["date"]).size >= threshold would be redundant. Open to any thoughts you might have here @stsievert

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same random_seed=123 is passed to both calls to make_classification_df; I would expect every value in X_df["date"] and X_df1["date"] to be the same.

I think X["date"].nunique() >= threshold would be a lot simpler.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that sounds good. Now I've to figure out, what would be a good threshold. Will n_samples/2 be good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, threshold=n_samples/2 is more than good. I think threshold=2 would suffice; that'd make sure #845 is resolved.


assert X_df is not None
assert y_series is not None
assert "date" in X_df.columns
assert len(X_df.columns) == 6
assert len(X_df) == 100
assert len(y_series) == 100
assert isinstance(y_series, dask.dataframe.core.Series)
assert check_randomness.size == 1
assert check_randomness[0] is True
assert np.unique(X_df["date"]).size >= 2