-
Notifications
You must be signed in to change notification settings - Fork 130
The parameters on large dataset #16
Comments
Yes, we have run this on large datasets—the Spark version is useful if the dataset will not fit in machine memory. I recall having run it on a dataset of a few 100m points. You should be able to run SIFT1M locally, but it make take an hour or so if I recall. There are a number of parameters to decide on. I would refer you to this guide (particularly section 2). You may also be interested in these Python functions that can be used to compute computational cost as a function of parameters values. I'm happy to try to answer specific questions about parameter settings if you have them. |
BTW, It is not well documented, but the library contains a function |
@pumpikano Do you run LOPQ on spark? Could you please share your experience here ? I have a large dataset that my single server can't run |
Sorry, I haven't worked on this in years. At the time, we ran a Java implementation of LOPQ search (which was never part of this open-source project) and simply sharded the index on multiple machines. There is a branch of this repo that has an implementation that uses Spark to accomplish the sharding and serving (https://github.com/yahoo/lopq/blob/spark-search-cluster/spark/spark_lopq_cluster.py). I would strongly recommend against this for any production use case though — it was only intended to help test a large index within a Spark workflow, and we had a separate, battle-hardened index based on https://vespa.ai/. |
I find there are many parameters in training phase. Have you run this project on large datasets, like SIFT1M(even SIFT1B) and GIST1M? And how to choose the appropriate parameters? Thanks a lot!
The text was updated successfully, but these errors were encountered: