Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with string features (pandas) #131

Open
marcoslbueno opened this issue Nov 1, 2021 · 2 comments
Open

Error with string features (pandas) #131

marcoslbueno opened this issue Nov 1, 2021 · 2 comments
Milestone

Comments

@marcoslbueno
Copy link

I am using a classification dataset with a mixture of string and category features in a pandas dataframe, and this breaks down GAMA (see MRE below).

import openml 
from sklearn.model_selection import train_test_split
import gama

if __name__ == '__main__':
    did = 42530
    data = openml.datasets.get_dataset(did)
    X, y, _, _ = data.get_data(dataset_format='dataframe', target=data.default_target_attribute)

    X = X[y.isnull() == False]
    y = y[y.isnull() == False] 

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    print("loaded data")
    
    time_fold = 5*60
    metric = 'accuracy'
    
    clf = gama.GamaClassifier(max_total_time=time_fold, 
                            random_state = 1,
                            scoring=metric, 
                            n_jobs=1, 
                            store='nothing')

    clf.fit(X_train, y_train)
    print("finished fit.")

    proba_predictions = clf.predict_proba(X_test)
    print("finished predictions test data.")

The error I get is

loaded data
Traceback (most recent call last):
  File "mre_gama.py", line 39, in <module>
    clf.fit(X_train, y_train)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/GamaClassifier.py", line 134, in fit
    super().fit(x, y, *args, **kwargs)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/gama.py", line 549, in fit
    self.model = self._post_processing.post_process(
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/postprocessing/best_fit.py", line 27, in post_process
    return self._selected_individual.pipeline.fit(x, y)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/base.py", line 702, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/impute/_base.py", line 288, in fit
    X = self._validate_input(X, in_fit=True)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/impute/_base.py", line 260, in _validate_input
    raise new_ve from None
ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'Midwest'

The problem is solved when I convert the string features (in this case, 0 and 22) to category. I would think it would be best if GAMA could do this automatically, since it is an apparently simple conversion.

@PGijsbers
Copy link
Member

Thanks for raising the issue! This error stems from the assumption that since Dataframes provide type annotation (their dtype), GAMA expects this to be correct (use unannotated numpy otherwise). By providing an explicitly non-categorical feature (technically object), you go against this assumption. This raises an error (although a bad and late one (#132)) because GAMA can't work with an object type series.

If you want feature type inference consider passing the data in numpy format:

- clf.fit(X_train, y_train)
+ clf.fit(X_train.values, y_train.values)

- proba_predictions = clf.predict_proba(X_test)
+ proba_predictions = clf.predict_proba(X_test.values)

By design I think it is good to assume that the user is an expert on the data: they can help the AutoML system with data type annotation. However, expanding the interface to allow for inferring pandas object series if explicitly set (e.g. infer_objects=True) sound reasonable to me. What do you think?

@marcoslbueno
Copy link
Author

Thanks for replying! Indeed by using your suggestion GAMA was able to finish without errors.

I think that adding a parameter like infer_objects=True makes a lot of sense, since the user might be unsure about the column types of the dataset (even when using dataframes) and/or do not want to be checking this.

@PGijsbers PGijsbers added this to the v22.1+ milestone Jul 27, 2022
@PGijsbers PGijsbers modified the milestones: v22.1+, v22.1 Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants