-
Notifications
You must be signed in to change notification settings - Fork 130
No search method in Spark implementation #4
Comments
Thanks, I hope you find it useful. Yes, this is functionality that would great to have. I can't say for sure if or when I will work on it, however. But I thought I might lay out a few approaches here for possible discussion.
Besides the convenience of building a large similarity index in Spark for testing, a tantalizing potential use case for a fast Spark implementation of LOPQ search might be in providing fast near neighbor sampling as a component of other ML pipelines or algos in Spark. If you explore this further I would love to hear how it goes and possibly get your work into this project! |
Thanks for the answer and the alternatives! Some applications (e.g., entity disambiguation) need to have the nearest neighbors of the points already indexed so approach 1 would be the best in that scenario since you need all NN anyways. For other applications that require to query few points, solution 3 would work but I'm not familiar with IndexedRDD. If you have very few points per cell (e.g., image search), then solution 2 would work, but other applications (e.g., entity disambiguation), there might be big buckets. I was wondering if putting the (id, hash, data) into a parquet where you could use the built it partitioning system would allow you to retrieve candidate cells without traversing the whole dataset. One solution off the top of my head would be to define a partitioning by the first couple of codebooks in the product. I'm interested in working on this so I might try IndexedRDD or the parquet solution and let you know. Thanks for the great work! |
I recently needed better support for this myself, so I added some utilities to launch a search cluster on Spark. It is a bit experimental at the moment, but you can find it in the spark-search-cluster branch. https://github.com/yahoo/lopq/blob/spark-search-cluster/spark/spark_lopq_cluster.py Usage example and documentation are in that file. Given an RDD of lopq codes, it creates a mapPartitions job that launches a search server on every partition in the cluster. Those servers register themselves with the driver when they start up. After all partitions have registered, you can query them in parallel from the driver, bypassing Spark completely. There is also a small utility that wraps a server around this functionality so that the cluster can be queried from, eg., a webpage. This is intended for experimentation and evaluation and certainly not for live production uses. I have found this approach to work very well for the cases I have tested. The largest index size I have tested was 100 million and the largest cluster size was 480 partitions. Queries that rank a few tens of thousands points run in 2-3 seconds. It is based on a set of primitives I wrote to aid in launching clusters on Spark like this: https://github.com/pumpikano/spark-partition-server If you find it useful, let me know! |
Hi there. This library looks cool - just wondering if it supports cosine similarity and dot-product (max inner product search) - it seems to only mention Euclidean distance? |
Hi Nick,
you are right, it does not support MIPS at its current state, only min
Euclidean distance search. If your data vectors are L2 normalized, though,
the two are equivalent and you can use Euclidean distance instead of inner
product.
Yannis
…On Wed, Mar 8, 2017 at 12:26 AM, Nick Pentreath ***@***.***> wrote:
Hi there. This library looks cool - just wondering if it supports cosine
similarity and dot-product (max inner product search) - it seems to only
mention Euclidean distance?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALZ1FHCEJP4rGELby_puGDPYX3UAKTv8ks5rjmY7gaJpZM4HGBiX>
.
|
Thanks very much for this implementation!
I was wondering if there was any plan on adding a script to search the index in spark. So far, we can create a model, save it, and compute the code of a set of data points, but no search functionality.
Thanks!
The text was updated successfully, but these errors were encountered: