suggestion: Use a single file for labels and text #151

shashi-netra · 2018-08-04T11:20:57Z

In the current version you have .lab and .txt files - one each for a training row. Wouldn't it be easier to save these in a single file or a single one for labels and another for text files? Wouldn't this be more
idiomatic (a la scikit-learn)

Having several million .lab files and .txt files is especially problematic when there are millions of files and the filesystem chokes up.

The text was updated successfully, but these errors were encountered:

jstypka · 2018-08-04T16:44:32Z

@shashi-netra you're right, having an other option of loading files would be a reasonable feature. I think you're actually not the first who suggested that. It shouldn't be difficult to implement, but I can't promise I'll have time to do that in the near future. You're welcome to take a stab at it and open a PR!

dorg-ekrolewicz · 2018-10-04T17:21:40Z

@jstypka Can you please indicate what the input format looks like? Is it embedding arrays for the inputs and one hot arrays for label?

jstypka · 2018-10-04T21:46:57Z

@dorg-ekrolewicz the output is one-hot arrays and the input is a 2D array - each row being a word represented as a word2vec vector. A batch of several document would make a 3D tensor. Does that help?

dorg-ekrolewicz · 2018-10-04T21:54:13Z

Are you using padding?

Ex for classifying cats and dogs: num_classes = 2
max_num_words = number of words in x = 10 (in this example)

Inputs:

x = "the dog is red" y = [0,1] where num_words = 4
x = "the cat and dog are blue" y = [1,1] where num_words = 6

Since we have m=2 examples, the input dimensions would be a (m, embedding_dim, max_num_words)?

jstypka · 2018-10-04T21:57:29Z

@dorg-ekrolewicz yes, that looks correct. We pad with 0s until max_num_words and throw a 0 vector if we don't have a representation for a word (unfamiliar vocabulary).

Pretty much all the code is in this function.

jstypka added the enhancement label Aug 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suggestion: Use a single file for labels and text #151

suggestion: Use a single file for labels and text #151

shashi-netra commented Aug 4, 2018

jstypka commented Aug 4, 2018

dorg-ekrolewicz commented Oct 4, 2018

jstypka commented Oct 4, 2018

dorg-ekrolewicz commented Oct 4, 2018

jstypka commented Oct 4, 2018

suggestion: Use a single file for labels and text #151

suggestion: Use a single file for labels and text #151

Comments

shashi-netra commented Aug 4, 2018

jstypka commented Aug 4, 2018

dorg-ekrolewicz commented Oct 4, 2018

jstypka commented Oct 4, 2018

dorg-ekrolewicz commented Oct 4, 2018

jstypka commented Oct 4, 2018