Skip to content
This repository has been archived by the owner on May 1, 2020. It is now read-only.

No search method in Spark implementation #4

Open
daniel-acuna opened this issue Jan 15, 2016 · 5 comments
Open

No search method in Spark implementation #4

daniel-acuna opened this issue Jan 15, 2016 · 5 comments

Comments

@daniel-acuna
Copy link

Thanks very much for this implementation!

I was wondering if there was any plan on adding a script to search the index in spark. So far, we can create a model, save it, and compute the code of a set of data points, but no search functionality.

Thanks!

@pumpikano
Copy link
Collaborator

Thanks, I hope you find it useful.

Yes, this is functionality that would great to have. I can't say for sure if or when I will work on it, however. But I thought I might lay out a few approaches here for possible discussion.

  1. The simplest solution would be to cache an RDD of (id, lopq code) pairs, generate some candidate cells for a query with the multisequence generator, filter the RDD for those cells, and do distance reconstruction for this subset of items (either locally or distributively). The main drawback here is that filtering for some set of cells requires a pass over the data - you will get the right results but it will likely be slow. This could be mitigated by a partitioning scheme resulting in points with the same cell to fall in the same partition.
  2. Rather than filtering the whole dataset, you could first groupByKey to create buckets of points keyed by cell and filter this RDD. This could cause trouble with very large cell buckets and grouping by key is sometime nontrivial in terms of memory issues.
  3. I suspect the best solution would utilize something like an IndexedRDD which can very quickly return cell buckets, perhaps iteratively until a result quota for the query is reached. Currently IndexedRDD does not have a python interface.

Besides the convenience of building a large similarity index in Spark for testing, a tantalizing potential use case for a fast Spark implementation of LOPQ search might be in providing fast near neighbor sampling as a component of other ML pipelines or algos in Spark. If you explore this further I would love to hear how it goes and possibly get your work into this project!

@daniel-acuna
Copy link
Author

Thanks for the answer and the alternatives!

Some applications (e.g., entity disambiguation) need to have the nearest neighbors of the points already indexed so approach 1 would be the best in that scenario since you need all NN anyways. For other applications that require to query few points, solution 3 would work but I'm not familiar with IndexedRDD. If you have very few points per cell (e.g., image search), then solution 2 would work, but other applications (e.g., entity disambiguation), there might be big buckets.

I was wondering if putting the (id, hash, data) into a parquet where you could use the built it partitioning system would allow you to retrieve candidate cells without traversing the whole dataset. One solution off the top of my head would be to define a partitioning by the first couple of codebooks in the product.

I'm interested in working on this so I might try IndexedRDD or the parquet solution and let you know.

Thanks for the great work!

@pumpikano
Copy link
Collaborator

I recently needed better support for this myself, so I added some utilities to launch a search cluster on Spark. It is a bit experimental at the moment, but you can find it in the spark-search-cluster branch.

https://github.com/yahoo/lopq/blob/spark-search-cluster/spark/spark_lopq_cluster.py

Usage example and documentation are in that file. Given an RDD of lopq codes, it creates a mapPartitions job that launches a search server on every partition in the cluster. Those servers register themselves with the driver when they start up. After all partitions have registered, you can query them in parallel from the driver, bypassing Spark completely. There is also a small utility that wraps a server around this functionality so that the cluster can be queried from, eg., a webpage.

This is intended for experimentation and evaluation and certainly not for live production uses. I have found this approach to work very well for the cases I have tested. The largest index size I have tested was 100 million and the largest cluster size was 480 partitions. Queries that rank a few tens of thousands points run in 2-3 seconds.

It is based on a set of primitives I wrote to aid in launching clusters on Spark like this: https://github.com/pumpikano/spark-partition-server

If you find it useful, let me know!

@MLnick
Copy link

MLnick commented Mar 8, 2017

Hi there. This library looks cool - just wondering if it supports cosine similarity and dot-product (max inner product search) - it seems to only mention Euclidean distance?

@skamalas
Copy link

skamalas commented Mar 8, 2017 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants