Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial skorch hyperparam opt implementation #149

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ToddMorrill
Copy link

Hey guys, I want to get the conversation started on this. I have a v1 implementation of an example using PyTorch + Skorch for a text classification problem. I'm then using Dask's Hyperband grid search algo to find the best hyperparameters. It ran successfully once and then I made some more changes and it's now failing with a fairly cryptic error message.

If you have some pointers, I can run edits. Meanwhile, I'll keep looking at it for potential bugs.

Separately, I'd love to get your thoughts on how to make better use of torchtext in the current pipeline. The way I'm preparing training data is causing a lot of extra compute and totally breaks the batching semantics of torchtext and deep learning models in general.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

@mrocklin
Copy link
Member

cc @stsievert

Copy link
Member

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @ToddMorrill! I can see where I need to improve some of the documentation, and have fixed some other issues:

fairly cryptic error message.

Resolved in dask/dask-ml#670. Looks like you have a typo; I left a comment below on the appropriate line. For completeness, here's the full traceback:

  File "/Users/scott/anaconda3/envs/dask-ml-docs/lib/python3.6/site-packages/distributed/utils.py", line 665, in log_errors
    yield
  File "/Users/scott/Developer/stsievert/dask-ml/dask_ml/model_selection/_incremental.py", line 115, in _create_model
    model = clone(model).set_params(**params)
  File "/Users/scott/anaconda3/envs/dask-ml-docs/lib/python3.6/site-packages/skorch/net.py", line 1424, in set_params
    self.initialize_module()
  File "/Users/scott/anaconda3/envs/dask-ml-docs/lib/python3.6/site-packages/skorch/net.py", line 467, in initialize_module
    module = module(**kwargs)
TypeError: __init__() got an unexpected keyword argument 'filter_size'

"source": [
"# takes some time to numericalize the whole dataset\n",
"\n",
"# also notice that skorch and dask expect numpy arrays, which isn't ideal since it ties you to the cpu.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know skorch accepts some other formats: https://skorch.readthedocs.io/en/stable/user/FAQ.html#faq-how-do-i-use-a-pytorch-dataset-with-skorch. Why doesn't this work here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge is solving for variable length features. With images or tabular datasets, your feature set size is fixed. With text, your feature length varies by batch due to the varying sequence lengths in your batch.

For Skorch, I need to see if I can use a collate_fn somewhere. I want to pad to the longest sequence length in the batch instead of padding to the longest sequence length in the dataset, which would save significant compute time.

Here's another potential solution that I need to take a closer look at.

As for Dask, it's not clear to me how to get around the fixed shape of Dask arrays. Maybe instead of using the numericalized representation (from torchtext), you could work with raw text in the Dask array and then try to preprocess it somewhere else.

In any event, these approaches involve some tinkering, whereas torchtext has solved these problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable length features sounds like an issue that won't work out of the box. It sounds like you've got a handle on it with collate_fn and the Skorch issue. I'm not sure how the validation/chunk splitting works with Hyperband/etc though. It'd be great to have some practical use!

In the past, I've run into variable-length features with a bag-of-words count using Scikit-learn's CountVectorizer. To resolve this, I used the HashingVectorizer, which is an approximate version of CountVectorizer. I'm not sure if that's relevant.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you've got a handle on it with collate_fn and the Skorch issue

This is working now in Skorch.

I'm not sure how the validation/chunk splitting works with Hyperband/etc though

That's actually a question I have in my most recent commit. I'm not sure if my training collate_fn is being handled differently from my validation collate_fn.

I used the HashingVectorizer, which is an approximate version of CountVectorizer. I'm not sure if that's relevant.

It's totally relevant because it solves the same issue! I'm just trying to find a solution that follows the typical workflow of a deep learning practitioner (i.e. padding at the batch level).

Copy link
Member

@stsievert stsievert May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashingVectorizer ... same issue

Does PyTorch have an equivalent to HashingVectorizer, or can it work with Scikit-learn's HashingVectorizer? I agree, it's useful to highlight the use of DataLoader but it'd also be nice to see an alternative approach that's better for distributed computation.

It sounds like you've got a handle on it with collate_fn and the Skorch issue

This is working now in Skorch.

👍

"\n",
"# it's not immediately obvious to beginners how all these parameters interact with each other\n",
"max_iter = n_params\n",
"chunk_size = n_examples // n_params"
Copy link
Member

@stsievert stsievert May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help to add the following to the "Notes" section of the HyperbandSearchCV docstring?

One feature of Hyperband and the underlying mathematics is that the iteration count max_iter determines the number of parameters that need to be sampled.

  • add to docs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That would be helpful. My frame of reference is sklearn's RandomizedSearchCV, which uses both n_iter and cv parameters. Separately, with Skorch + RandomizedSearchCV, I specify how many epochs to train for. With n_iter, cv, and epochs, it's clear to me how much computation will take place.

When I started looking at Hyperband, I was struggling to map those parameters above to Hyperband. My intuition was that n_params and n_iter were equivalent and that if they were both the same value, you would get an apples-to-apples comparison between RandomizedSearchCV and Hyperband on time-to-compute and accuracy of the model found.

Just so I'm clear, when we set n_params(which then flows into max_iter), it's only loosely related to n_iter in RandomizedSearchCV, is that right?

I also may need to go reread your paper to develop some more intuition.

Copy link
Member

@stsievert stsievert May 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_params and n_iter were equivalent and that if they were both the same value, you would get an apples-to-apples comparison between RandomizedSearchCV and Hyperband on time-to-compute and accuracy of the model found.

n_params and n_iter have the same meaning: the both mean "sample (approximately) this many parameters/initialize this many models."

If n_params == n_iter, HyperbandSearchCV will find the same score as RandomizedSearchCV with high probability. However, HyperbandSearchCV will do a lot less work.

If RandomizedSearchCV and HyperbandSearchCV do the same amount of work, HyperbandSearchCV will find scores that are a lot higher.

Numbers/graphs behind these statements are in the Dask-ML docs at "Hyper Parameter Search > Hyperband performance." These are the same figures shown in the paper.

  • I think it'd help to rename n_models to n_params_actual in search.metadata. Is that accurate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all very helpful, thanks @stsievert. Whenever I see something like n_params = 299 # sample about 300 parameters I'm not sure what to make of it. Is there are reason you chose 299?

I wonder if n_params_searched (though I don't like that it's past tense) would be better, to convey that the you're searching for that many unique hyperparameter configurations. Also, I'm not sure if there is an attribute that could show users which n hyperparameter configurations are actually planned for the search (or are these hyperparameters adaptively selected?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to convey that the you're searching for that many unique hyperparameter configurations

Maybe the best option is to add a note in the rule of thumb saying n_params is approximate, and the true value can be found in search.metadata["n_params"].

Is there are reason you chose 299?

I chose 299 to make the Dask array chunk evenly. I think with 300 there was one chunk with few examples. With 299 all chunks were the same size.

"EPOCHS = 5\n",
"NUM_TRAINING_EXAMPLES = len(train)*.8\n",
"n_examples = EPOCHS * NUM_TRAINING_EXAMPLES\n",
"n_params = 8\n",
Copy link
Member

@stsievert stsievert May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure Hyperband is relevant when n_params = 8. Hyperband is an early stopping scheme, and there's not much early stopping to be done when max_iter = n_params = 8.

Copy link
Member

@stsievert stsievert May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I take that back. Hyperband is still (somewhat) relevant. It'll get more relevant if n_params is higher.

search.metadata["partial_fit_calls"] is the total number of calls to partial_fit, not the number of calls to any one model. No model will see more than max_iter = n_params = 8 calls to partial_fit.

With Hyperband, n_models = 5 parameters are sampled to see if they're the best parameters (n_models is in search.metadata). If RandomizedSearchCV were used instead with the same amount of work, only 2 parameters can be sampled to be considered the best.

>>> # Setup as in the notebook
>>> assert len(X_train) == 25000
>>> n_examples = 5 * len(X_train)
>>> n_params = 8
>>> chunk_size = n_examples // n_params
>>>
>>> # How many data will be fed to the model for Hyperband?
>>> hyperband_eg = 26 * chunk_size
>>> # How many models could we fit if we used randomized search? Randomized search gives all models an equal number of examples.
>>> hyperband_eg / n_examples
2.6
  • TODO: add this example to the docs, or in search.metadata.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a really helpful comparison for people - to see how you're able to sample many more parameters with Hyperband.

I can increase n_params but my frame of reference was that n_params is functionally equivalent to n_iter in RandomizedSearchCV, where higher numbers just mean more compute. If n_params can be increased without increasing the compute cost, then fantastic! If that's the case, then this would be useful to include in the docs.

I need to understand how more params impact GPU memory utilization as well. If models are only initialized one at a time (assuming 1 single GPU) then this is likely fine. The key thing is that those models need to be unpersisted somehow, otherwise you will fill up GPU memory.

"outputs": [],
"source": [
"# define parameter grid\n",
"params = {'module__filter_size': [(1,2,3), (2, 3, 4), (3, 4, 5)], \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

- {'module__filter_size': 
+ {'module__filter_sizes': 

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!

@ToddMorrill
Copy link
Author

I fixed the typo you pointed out (thanks!) and noticed a couple other things along the way.

It turns out that I was running out of GPU memory because of the way that I was passing in pretrained embeddings (was using all of Glove instead of a 25k subset or vectors). So that's fixed now and memory utilization is staying much lower.

Another observation is that GPU memory utilization is monotonically increasing and I haven't been able to reduce it by deleting PyTorch or Skorch objects, which should garbage collect those objects and free memory. I'm wondering if this has something to do with working in a distributed environment, where deleting the object in the Jupyter Notebook doesn't delete references to GPU memory on the workers. When I keyboard interrupted the process, the workers got restarted and memory utilization dropped down to zero.

I'm also getting an error because my filter_sizes parameter is getting registered as a multi-dimensional array when stored in the search.cv_results_ dictionary. I converted search.cv_results_['param_module__filter_sizes'] to a list before passing it to pandas and that cleared it up but not sure if there's something better that can be done under the hood.

The script now runs but it takes a long time on a single small-ish GPU (~60 minutes). I'm hoping to try this out on a GPU cluster soon. I suspect for big hyperparameter optimization jobs you'd want a fairly large cluster of GPUs (e.g. 4+) to get through these jobs in a reasonable amount of time, which does put up a bit of a barrier to entry for an example demo and any practitioners that can't afford that.

I could probably reduce the dataset size and the model would still converge. I'm just trying to create as "real" an example as possible.

@stsievert
Copy link
Member

Thanks for this use case. I've got some of the related fixes in dask/dask-ml#671 (which will remain a draft until this PR is merged). Please comment in that PR with your questions and/or suggestions.

I'm just trying to create as "real" an example as possible.

I'd be careful with excessive computation. These examples run on Binder, which has pretty serious limits on computation. GPUs are definitely out of scope, and I hesitate to do any computation that takes more than ~10 seconds.

I've included cells like this before:

# Make sure the computation isn't too excessive for this simple example.
max_iter = 9
# max_iter = 243  # uncomment this line for more realistic usage

reduce the dataset size and the model would still converge

I typically don't look for performance comparisons in examples like this. I tend to leave that for papers/documentation. Instead, I tend to run these examples to figure out how to use the tool. To me, the most salient question this example answers are the following:

  1. How are Skorch models created from PyTorch models?
  2. How do I pass hyperparameters to Skorch models?
  3. How do I use a non-standard dataset with a Skorch model? What memory constraints does the dataset present, and how are they circumvented?

The last question might warrant another PR. The GPU usage is interesting and good to see. I'd definitely add a note saying this works on GPUs (and probably some code to put it on the GPU if available). Importantly, I don't think a GPU should be required.

@ToddMorrill
Copy link
Author

ToddMorrill commented May 25, 2020

Thanks for all the feedback so far. The example is coming together. I implemented the collate_fn that we discussed above and things are now working well in Skorch (significantly faster when padding at the batch level).

The custom Dataloader worked well with Skorch and I'm currently trying to make it work with Hyperband. I think I found a way around the variable length feature sizes - simply use the raw text as one single feature (i.e. "fixed size") in a Dask array and then let my collate function do the tokenization, padding, etc. This should seriously reduce the amount of computation performed (as it did in Skorch).

The crux of the issue is that Hyperband doesn't appear to be handling the validation data correctly (though frankly, I can't tell if it's handling my training data correctly either - not sure if this would be better in a .py where I might see more log output). It looks like Hyperband is passing my validation data through my collate_fn twice. I say this because Hyperband is erroring out with KeyError: tensor([0.]) and if you look closely at the traceback, the key at this particular stage of the collate_fn should be something like pos or neg, which are the unprocessed labels for my dataset. The output of collate_fn should be a processed label of the form tensor([0.]), hence why I think collate_fn is being called twice on my validation data. Finally, I think is is a validation data issue (as opposed to training data) because this only seems to arise when return accuracy_score(y, self.predict(X), sample_weight=sample_weight) is called. I left the error in the notebook I most recently pushed.

Do you have any thoughts on why the handling of the validation data in Hyperband would differ from the training data? Separately, is there any reason to think that a DataLoader wouldn't work here?

EDIT: do you think this has anything to do with skorch-dev/skorch#641?

@stsievert
Copy link
Member

stsievert commented May 29, 2020

Sorry for the delayed response. I'll have more time to respond a week from now.

Hyperband doesn't appear to be handling the validation data correctly

Could the issue be with out-of-vocabulary words? There might be a word in the validation set that's not in the training set. That's the first idea that comes to mind, especially because you're passing in list of strings. If that's the issue, use of HashingVectorizer would resolve it.

@stsievert
Copy link
Member

I can't tell if it's handling my training data correctly either

From what I can tell, you're handling the test data correctly: it appears you're only running the test data through the model once at the very end. How are you confused?

Do you have any thoughts on why the handling of the validation data in Hyperband would differ from the training data?

Yes, especially with text data. Do the train, validation and test sets all have the same vocabulary? If not, you could probably get it around with something like:

def pad_batch(batch, TEXT, LABEL):
    text, label = list(zip(*batch))
    text = [word for word in text if word in TRAIN_VOCAB]
    # ... rest of function untouched

for some appropriately defined TRAIN_VOCAB. Alternatively, you could use HashingVectorizer as mentioned in #149 (comment).

I suspect this is coming into play with these lines:

train_dataloader = DataLoader(..., collate_fn=pad_batch_partial)  # In[20]
test_dataloader = DataLoader(..., collate_fn=pad_batch_partial)  # In[32]

I wouldn't use a grid search with Hyperband. I prefer random searches.

@ToddMorrill
Copy link
Author

@stsievert, my apologies for the delay in picking this back up. The good news, we’re up and running!

After working with Skorch a bit more, it became quite obvious to me why this wasn’t going to work as it was written before. In short, in Skorch I was passing skorch_model.fit(train_dataset, y=None), while in HyperbandSearchCV I was passing search.fit(X, y). In Skorch, I am using torch.utils.data.Dataset and torch.utils.data.DataLoader, which aren't compatible with HyperbandSearchCV for two reasons: 1) X in skorch_model.fit(train_dataset, y=None) contains both the features and the label in one tuple, while in HyperbandSearchCV, I have to pass X and y explicitly, and 2) torch.utils.data.DataLoader (with a collate_fn) doesn’t accept dask arrays.

But hindsight is 20/20.

Most problems appear to stem from the need to use dask arrays.

In Skorch, you have full control over how your data is formatted when fed to model.fit() and full control over how data is preprocessed (e.g. tokenized, padded, etc.) before it is sent to the network. This flexibility is curtailed when you are required to use dask arrays AND prepare all of your data ahead of time.

Here are two options (as I see it) for using HyperbandSearchCV with variable length features:

  1. You can put raw text in a dask array (i.e. a “fixed-length feature” since it’s a singular string). Then you need some sort of preprocessing to occur. That could happen in the model but it’s not good practice.
    • torch.nn.utils.rnn.pad_sequence might be useful here, but you’d still need a tokenizer in the model and that’s just not a great design pattern.
  2. You can preprocess text outside of the model but then you must have a fixed-size feature set (i.e. pad to the longest sequence in your entire dataset) and this is what drives all the extra compute time.
    • You can fix your sequence length, which I’ve done, to reduce computation times but accuracy does suffer. You’re also still running a lot of extra computation for shorter length sequences that were heavily padded.

Here’s how I’m thinking about the decision to use or not use HyperbandSearchCV. If your grid search is large AND hyperband's algorithm is asymptotically faster than something like RandomizedSearchCV (e.g. O(n^2) vs O(n^3)) then it makes sense to use HyperbandSearchCV. The question is, where is that tipping point in terms of the size of parameter search? In my experiments, I’ve witnessed a 2-3x speedup by using proper deep learning batching semantics (see here). In other words, every batch that is fed through the network in Dask takes 2-3x longer than in vanilla PyTorch or Skorch. If you can overcome this slowdown by simply doing significantly less computation (i.e. less grid searching), then you come out ahead by using HyperbandSearchCV. If your search space isn’t arbitrarily large, then you would likely come out ahead by using Skorch + RandomizedSearchCV simply as a result of implementation details (i.e. not due to the asymptotic performance of RandomizedSearchCV or HyperbandSearchCV).

This analysis doesn’t rigorously address the time it takes for a model to converge under standard batching semantics (i.e. pad at the batch level) vs. padding to the longest example in the dataset. The model may take longer to converge and/or may not be able to achieve peak performance (e.g. accuracy, f1, etc.) by padding to the longest example in the dataset.

Sadly, I don’t know enough about Dask’s or HyperbandSearchCV’s internals to provide a well-reasoned recommendation. One possibility would be to consider how one might use HyperbandSearchCV without dask arrays and instead use numpy arrays or torch tensors. Another thing to consider is the possibility to do some preprocessing (e.g. tokenization, etc.) after data is submitted to search.fit() but before data makes it to the network.

I’m happy to discuss further and answer any questions that you have.

I’m also happy to run any edits on this example to polish it up but in terms of design patterns, I think we’ve explored most of the obvious options.

@stsievert
Copy link
Member

I have one more basic question:

  • Why does model convergence depend on "pad[ding] at the batch level) vs. padding to the longest example in the dataset"? Is "convergence" in terms of optimization iterations?

I'd expect the model to converge at the same rate regardless of batching semantics. I'd expect the model have the identical output for identical inputs, regardless if the input is padding to the longest example in the batch or dataset. Why isn't that the case, or am I mis-understanding?

You can put raw text in a dask array (i.e. a “fixed-length feature” since it’s a singular string). Then you need some sort of preprocessing to occur. That could happen in the model but it’s not good practice.

I like this solution best because arrays of string are passed between workers. Why is this implementation bad practice?

class PreprocessInputs(skorch.NeuralNetClassifier):
    def __init__(self, preprocessing, **kwargs):
        self.preprocessing = preprocessing
        super().__init__(**kwargs)

    def partial_fit(self, X: np.ndarray, y=None):
        X_processed = self.preprocess(torch.from_numpy(X))
        return super().partial_fit(X_processed, y=y)

    def preprocess(self, X):
        with torch.no_grad():
            return self.preprocess(X)

This implementation is more usable because the model is one atomic unit. That is, no outside knowledge is needed on specific methods to preprocess or normalize the input. We could fold this implementation into Dask-ML, but it'd basically be doing the same thing as this implementation.

@ToddMorrill
Copy link
Author

  • Why does model convergence depend on "pad[ding] at the batch level) vs. padding to the longest example in the dataset"? Is "convergence" in terms of optimization iterations?

It's a bit hand wavy but I've observed very different loss metrics when training on data padded at the batch level (lower loss scores) vs. data padded at the dataset level (higher loss scores). Yes, when I say convergence, I mean that the model is actually training effectively and improving with each iteration. I'd have to spend some more time running experiments to determine if a model trained on data padded to the longest example in the dataset would be able to achieve the same level of performance (e.g. accuracy, f1, etc.) as one trained with batch level padding.

I'd expect the model to converge at the same rate regardless of batching semantics. I'd expect the model have the identical output for identical inputs, regardless if the input is padding to the longest example in the batch or dataset. Why isn't that the case, or am I mis-understanding?

Suppose you've trained a classification deep learning model. Further, let's suppose you prepare one single example that you want a prediction for. If you pad that example, you will get one predicted probability distribution. If you don't pad that example, you will get a different probability distribution. Without more experimentation, the question still remains, how big is that difference?

I agree with you, padding should not play a huge role, in general, especially when you might only see a dozen or so pad tokens in a typical batch. However, in our case, we're talking about 1000s of unnecessary pad tokens. It probably will have some sort of impact on the predicted probability distributions.

I like this solution best because arrays of string are passed between workers. Why is this implementation bad practice?
This implementation is more usable because the model is one atomic unit. That is, no outside knowledge is needed on specific methods to preprocess or normalize the input. We could fold this implementation into Dask-ML, but it'd basically be doing the same thing as this implementation.

There's nothing inherently wrong about it. It's just not the typical design pattern that you'll see in the wild, where people tend to experiment with model architectures much more frequently than they experiment with their preprocessing setup. For example, torchtext is decoupled from modeling and Hugging Face follows this same pattern.

I implemented parts of that approach in a previous commit. We can pursue it further - I just worry about adoption of this pattern.

Base automatically changed from master to main January 27, 2021 16:07
@jacobtomlinson
Copy link
Member

It's been a while since there was any activity here. @ToddMorrill is there any still interest in getting this into a mergeable state?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants