The parameters on large dataset #16

kr11 · 2018-03-14T11:53:29Z

I find there are many parameters in training phase. Have you run this project on large datasets, like SIFT1M(even SIFT1B) and GIST1M? And how to choose the appropriate parameters? Thanks a lot!

pumpikano · 2018-03-14T14:29:43Z

Yes, we have run this on large datasets—the Spark version is useful if the dataset will not fit in machine memory. I recall having run it on a dataset of a few 100m points. You should be able to run SIFT1M locally, but it make take an hour or so if I recall.

There are a number of parameters to decide on. I would refer you to this guide (particularly section 2). You may also be interested in these Python functions that can be used to compute computational cost as a function of parameters values.

I'm happy to try to answer specific questions about parameter settings if you have them.

pumpikano · 2018-03-14T14:32:32Z

BTW, It is not well documented, but the library contains a function lopq.utils.load_xvecs that can convert the binary format of SIFT1M to a numpy array.

xhappy · 2019-03-01T00:55:51Z

@pumpikano Do you run LOPQ on spark? Could you please share your experience here ? I have a large dataset that my single server can't run

pumpikano · 2019-03-03T19:53:48Z

Sorry, I haven't worked on this in years. At the time, we ran a Java implementation of LOPQ search (which was never part of this open-source project) and simply sharded the index on multiple machines. There is a branch of this repo that has an implementation that uses Spark to accomplish the sharding and serving (https://github.com/yahoo/lopq/blob/spark-search-cluster/spark/spark_lopq_cluster.py). I would strongly recommend against this for any production use case though — it was only intended to help test a large index within a Spark workflow, and we had a separate, battle-hardened index based on https://vespa.ai/.

kr11 changed the title ~~The parameters on SIFT1M~~ The parameters on large dataset Mar 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The parameters on large dataset #16

The parameters on large dataset #16

kr11 commented Mar 14, 2018

pumpikano commented Mar 14, 2018

pumpikano commented Mar 14, 2018

xhappy commented Mar 1, 2019

pumpikano commented Mar 3, 2019

The parameters on large dataset #16

The parameters on large dataset #16

Comments

kr11 commented Mar 14, 2018

pumpikano commented Mar 14, 2018

pumpikano commented Mar 14, 2018

xhappy commented Mar 1, 2019

pumpikano commented Mar 3, 2019