Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save the search results like groundtruth_ivecs #9

Open
zzzmm1 opened this issue Dec 7, 2022 · 11 comments
Open

Save the search results like groundtruth_ivecs #9

zzzmm1 opened this issue Dec 7, 2022 · 11 comments

Comments

@zzzmm1
Copy link

zzzmm1 commented Dec 7, 2022

I am not familiar with cuda programming.
How can I minimally modify the code to save the search results like groundtruth_ivecs?
Any help would be greatly appreciated!
Thanks!

@LukasRuppert
Copy link
Collaborator

Hi,

once a query has been performed, all search results are stored on the CPU-side for evaluating the accuracy:

you can see the layout of the data here:

to store the results, simply store the data pointed to by ggnn_results.h_sorted_ids

Best regards,
Lukas

@zzzmm1
Copy link
Author

zzzmm1 commented Dec 7, 2022

Thanks for your reply!
One more question.
Applying the default search graph configuration "k=96, s=64, l=4, r=2, tau=0.4" like GIST on a 1 million dataset in 1024 dimensions, using the cosine similarity metric, the search performance is not good enough (can't reach 99% R@1).
How to tune the parameters and would multi-GPUs help?

@LukasRuppert
Copy link
Collaborator

LukasRuppert commented Dec 7, 2022

Well, there is tau_build and tau_query.
The higher tau_query, the better your search results should become, but it also increases the cost and there are diminishing returns on quality.
Try using tau_query somewhere between 1 and 2 for high accuracy.

Otherwise, increasing tau_build can get you some better query performance in both speed and accuracy at the cost of construction time and increasing k can increase accuracy, but will slow everything down.

Multi-GPU can help a bit, because it shards the dataset which reduces the number important neighbors within that shard that we need to keep, so it has similar effects as increasing k. But its only a small improvement, I would guess.
Essentially, it just performs the search on all shards and then merges the search results at the end (so this also has the effect of having an increased k_query).
You can even run the multi-GPU mode on a single GPU, if you find that sharding produces better results for you.

Another thing you can try is changing the query parameters:
You can change the size of the k-best list of the search (k_query) (it should have similar effects to changing tau_query but at higher cost),
the size of the priority queue (which did not matter too much in our tests - see table 6 in the arxiv paper),
and the size of the search iterations (set the size of the visited list to a similar value, or the search might run in cycles)

If changing tau_query does not get you to 99%R@1,
the main parameters to boost performance are really k, the number of neighbors per point
and the number of search iterations (and matching visited list size).

@zzzmm1
Copy link
Author

zzzmm1 commented Dec 7, 2022

Thanks a lot!
I will try.

@zzzmm1
Copy link
Author

zzzmm1 commented Dec 12, 2022

Hi,
Is there any way to get the number of distances computed during the search to compare the acceleration ratio between GGNN and other graph search algorithms (like HNSW)?
Since one is GPU-based and one is CPU-based, it's not fair to compare the speed directly.
Or is there another metric to measure the acceleration ratio between GGNN and brute-force search?

@LukasRuppert
Copy link
Collaborator

Hi,

that functionality exists:

d_dist_stats[n] = cache.get_dist_stats();

query_kernel.d_dist_stats = m_dist_statistics;

There is just currently no code that would print it out.
Also, it looks to me like the memory containing the number of distances computed is being freed before the computation is actually done (CUDA kernels run asynchronously in the background) -- so better not use this one.
In any case, it would only be computed if explicitly requested via the template parameter DIST_STATS since simply counting the number of distances computed is additional work that does not need to be done to actually execute the query and resources are scarce on the GPU, so even just adding a few bytes can in a few cases slow things down considerably.

An alternative way to get statistics is running the "stats" query, which runs the same query but collects a lot of statistics as well, so it will be a bit slower:

typedef StatsQueryKernel<measure, ValueT, KeyT, D, KBuild, KF, KQuery, S, BLOCK_DIM_X, BaseT,

you can find an example of using it here:

m_ggnn.queryLayerDebug<32, 400, 448, 64>();

But you would again have to add your code to actually write out the number of distances per query.
It must have gotten lost during some round of refactoring we did on the code.

You should add that code at some point before this line:

cudaFree(m_dist_statistics);

Since it is "managed" memory, it can be accessed by both the GPU and CPU, so you can just use it (as opposed to device memory, which is only accessible by the GPU).

In general, comparing the performance of different algorithms is always a bit tricky.
I would argue that the best way to compare would be to run both algorithms on the same hardware,
but, of course, one cannot do that when comparing CPU and GPU-based implementations.

We also have brute-force search modes available and can even import a HNSW graph for searching, if you want to test that:

HNSWLoader m_hnsw_loader(FLAGS_graph_filename);

When using the "no slack" query, then it should be quite similar to what HNSW does on the CPU-side,
but that query is not as efficient on the GPU as our regular query implementation with "tau_query" and a significantly shorter k-best list.

Anyway, I hope this helps.
Best regards,
Lukas

@zzzmm1
Copy link
Author

zzzmm1 commented Dec 12, 2022

Hello.
I added the code before line 469 and both printing out dist_stats and writing dist_stats to the binary file.
The results are all 0.
Also, the result_distance printed on line 459 is all 0 while the printed m_query_results are correct.
Is there something wrong, where is the code that counts the num distance exactly?

@LukasRuppert
Copy link
Collaborator

Ah, I see.

The counter is initialized here:

if (DIST_STATS && !threadIdx.x) dist_calc_counter = 0;

but never incremented.

All distances are computed here:

const ValueT dist = rs_dist_calc.distance_synced(other_m);

So you have to add something like

if (DIST_STATS && !threadIdx.x) ++dist_calc_counter;

after that line to actually count the distances.

@zzzmm1
Copy link
Author

zzzmm1 commented Dec 13, 2022

I have a confusing problem where CHECK_CUDA(cudaPeekAtLastError()) fails on line 391 when building and querying with mydataset_stats (generated by mydataset_stats.cu), but mydataset (generated by mydataset.cu generated) works fine.
I have tried sift1m and sift1m_stats and both return similar results.
When using mydataset_stats with a pre-built graph file, it works, but returns worse results than the normal mydataset (c@1 0.70 vs 0.93).
In addition, the average distance computation per query on a 1M dataset is only 20k, which is as expected?
This is much less than HNSW.

@LukasRuppert
Copy link
Collaborator

LukasRuppert commented Dec 13, 2022

the "stats" and "non-stats" versions run completely independent code.
when there is a CUDA error, that means one has to debug a bit more to figure out what kind of error this was and what might be the cause.
with just this information, I can hardly even begin to speculate what might be the issue.
I can also sadly not spend too much time on this.
But if you can give me the parameters for GGNN and your dataset, I might be able to try and reproduce it here.

The main reason for seeing different performance is probably different parameters during construction and query.
One has to be quite careful there, since there's a lot of parameters.
E.g., the stats version seems to have a different query configuration and also does not run the query with the same parameters. (Simply look at the diff between the two)
For the number of distance computations per query, I don't have any numbers right now, but we can compute the worst case:
For every visited point (one per iteration), we compute the distance to all its neighbors (k), cached distances are skipped.
So if we have to query configured to 400 iterations max and the search-graph configured to k=24 for SIFT1M, then there should be no more than 400*24 = 9600 distances computed.
Additionally to these iterations, we also once compute all distances to the top layer points s = 32, but that is almost negligible in comparison.

Best regards,
Lukas

@zzzmm1
Copy link
Author

zzzmm1 commented Dec 15, 2022

Thanks for your reply!
I am using the same build and query parameters in both the "stats" and "non-stats" versions, and the "stats" version gives CUDA errors when both use <32, 400, 448, 64> parameters , and both work fine when both use <128, 2000, 2048, 64> parameters.
I adjusted each of these parameters for comparison and found that it was "BLOCK_DIM_X" that matters, probably because my dataset is 1024 dimensions, closer to the GIST dataset, but using the cosine distance metric.
Also, does each iteration refer to visiting a node, in my experiments the number of MAX_ITERATIONS had the most noticeable effect on the search performance when searching on the same built graph.
How should I set the "base, factor and shard" when using multi-gpus if the number of datasets N_base is not divisible by 2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants