Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CountVectorizer example #160

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

TomAugspurger
Copy link
Member

No description provided.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

@mrocklin
Copy link
Member

mrocklin commented Jul 27, 2020 via email

@TomAugspurger
Copy link
Member Author

Merged them into a single "Working with text data" notebook that starts with different comparing different vectorizers (HashingVectorizer, CountVectorizer) and ends with the full pipeline.

" toolz.sliding_window(2, lengths)]\n",
"# Notice the persist here! More details later.\n",
"documents = db.from_delayed([load_news(x) for x in slices]).persist()\n",
"documents"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also call db.read_sequence(..., npartitions=10).persist() and then call client.rebalance()

Given that people are going to blindly copy-paste whatever we do anyway I'd personally rather that they see this. It's a bit more in line with ordinary behavior I think.

Comment on lines 96 to 100
"import dask_ml.feature_extraction.text"
]
},
{
"cell_type": "code",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend merging adjacent code cells, if only to cut down on Ctrl-Enter pressing.

Comment on lines 165 to 169
"remote_vocabulary, = client.scatter([vocabulary], broadcast=True)\n",
"\n",
"vectorizer2 = dask_ml.feature_extraction.text.CountVectorizer(\n",
" vocabulary=remote_vocabulary\n",
")"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a well defined vocabulary that we can use somewhere? Maybe in nltk? I'm concerned that people will see this, and think that they should copy the vocabulary off of one CountVectorizer and then pass it to another.

Also, do we need the scatter? Can you verify that if vocabulary is included directly in the vocabulary= keyword argument that it will occupy only a single task, and not be in many of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about nltk, but probably not worth adding it to the environment just for this example. I noted that you'd probably get this from an external source in practice.

I think that as of dask/dask-ml#719, the answer to your question about user-provided vocabulary being in one task is "yes". But that change hasn't been released yet.

Base automatically changed from master to main January 27, 2021 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants