-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggestion: Use a single file for labels and text #151
Comments
@shashi-netra you're right, having an other option of loading files would be a reasonable feature. I think you're actually not the first who suggested that. It shouldn't be difficult to implement, but I can't promise I'll have time to do that in the near future. You're welcome to take a stab at it and open a PR! |
@jstypka Can you please indicate what the input format looks like? Is it embedding arrays for the inputs and one hot arrays for label? |
@dorg-ekrolewicz the output is one-hot arrays and the input is a 2D array - each row being a word represented as a word2vec vector. A batch of several document would make a 3D tensor. Does that help? |
Are you using padding? Ex for classifying cats and dogs: num_classes = 2 Inputs:
Since we have m=2 examples, the input dimensions would be a (m, embedding_dim, max_num_words)? |
@dorg-ekrolewicz yes, that looks correct. We pad with 0s until Pretty much all the code is in this function. |
In the current version you have
.lab
and.txt
files - one each for a training row. Wouldn't it be easier to save these in a single file or a single one for labels and another for text files? Wouldn't this be moreidiomatic (a la scikit-learn)
Having several million
.lab
files and.txt
files is especially problematic when there are millions of files and the filesystem chokes up.The text was updated successfully, but these errors were encountered: