Skip to content
This repository has been archived by the owner on May 1, 2020. It is now read-only.

The parameters on large dataset #16

Open
kr11 opened this issue Mar 14, 2018 · 4 comments
Open

The parameters on large dataset #16

kr11 opened this issue Mar 14, 2018 · 4 comments

Comments

@kr11
Copy link

kr11 commented Mar 14, 2018

I find there are many parameters in training phase. Have you run this project on large datasets, like SIFT1M(even SIFT1B) and GIST1M? And how to choose the appropriate parameters? Thanks a lot!

@kr11 kr11 changed the title The parameters on SIFT1M The parameters on large dataset Mar 14, 2018
@pumpikano
Copy link
Collaborator

Yes, we have run this on large datasets—the Spark version is useful if the dataset will not fit in machine memory. I recall having run it on a dataset of a few 100m points. You should be able to run SIFT1M locally, but it make take an hour or so if I recall.

There are a number of parameters to decide on. I would refer you to this guide (particularly section 2). You may also be interested in these Python functions that can be used to compute computational cost as a function of parameters values.

I'm happy to try to answer specific questions about parameter settings if you have them.

@pumpikano
Copy link
Collaborator

BTW, It is not well documented, but the library contains a function lopq.utils.load_xvecs that can convert the binary format of SIFT1M to a numpy array.

@xhappy
Copy link

xhappy commented Mar 1, 2019

@pumpikano Do you run LOPQ on spark? Could you please share your experience here ? I have a large dataset that my single server can't run

@pumpikano
Copy link
Collaborator

Sorry, I haven't worked on this in years. At the time, we ran a Java implementation of LOPQ search (which was never part of this open-source project) and simply sharded the index on multiple machines. There is a branch of this repo that has an implementation that uses Spark to accomplish the sharding and serving (https://github.com/yahoo/lopq/blob/spark-search-cluster/spark/spark_lopq_cluster.py). I would strongly recommend against this for any production use case though — it was only intended to help test a large index within a Spark workflow, and we had a separate, battle-hardened index based on https://vespa.ai/.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants