Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve VCIO bulk API package lookup performance #1561

Open
Tracked by #1538
pombredanne opened this issue Aug 20, 2024 · 5 comments
Open
Tracked by #1538

Improve VCIO bulk API package lookup performance #1561

pombredanne opened this issue Aug 20, 2024 · 5 comments

Comments

@pombredanne
Copy link
Collaborator

pombredanne commented Aug 20, 2024

From aboutcode-org/dejacode#94 (comment) by @tdruez

Could you tell me the PURL types from the list that are not supported (no data available) by VCIO? Excluding those will reduce the number of "useless" requests to the API.
['gem', 'autotools', 'sourceforge', 'bitbucket', 'rpm', 'gitlab', 'cran', 'windows-program', 'docker', 'bower', 'nuget', 'generic', 'cargo', 'npm', 'deb', 'golang', 'maven', 'composer', 'pypi', 'hackage', 'unknown', 'rubygems', 'about', 'github']

Well, for example we have ±300,000 sourceforge PURL in the nexB Dataspace, doing lookup for those is a total waste of time and resources.

More context: For ±133,000 packages in the nexB Dataspace, it currently takes about 1h and 2,674 HTTP requests made to the VCIO API.

The result is only 1,235 vulnerabilities fetched and created.
Seems like there's a lot of wasted time and resources with our current approach.

I suggest these progressive steps:

  • use a hardcoded list of distinct existing PURL types in VCIO
  • expose this list of existing PURL types as an endpoint
  • expose a new special endpoint that would provide a highly-compressed data structure to download quickly from VCIO and that you can query to know if a PURL may exist in VCIO
    • this could be an automaton (ahocorasick or FST) leveraging the fact that many PURL share a common prefix, or a bloom filter.
    • it would be best cached for a few hours and should come withe client code to use it to filter a (long) list of PURLs to remove these that surely do not exists @ VCIO
@tdruez
Copy link
Contributor

tdruez commented Aug 20, 2024

From aboutcode-org/dejacode#94 (comment)

@pombredanne Thanks, this sounds like it will require some work to make this happen.

In the short term, could VCIO expose a new "action" on the package endpoint to get this list of supported types? (Should be a very small and fast query)
On the DejaCode side, the process could start with fetching the available types to get a QuerySet limited to those and drastically reduce the number a queries.

>>> unique_types = Package.objects.values_list("type", flat=True).distinct()
>>> unique_types
<PackageQuerySet ['about', 'cargo', 'cocoapods', 'composer', 'deb', 'github', ...

@tdruez
Copy link
Contributor

tdruez commented Aug 23, 2024

Another examples that takes over a minute to load: https://public.vulnerablecode.io/api/vulnerabilities?vulnerability_id=VCID-j2zf-12g6-aaag

@TG1999 TG1999 self-assigned this Sep 12, 2024
@pombredanne
Copy link
Collaborator Author

pombredanne commented Sep 12, 2024

We need to change what we return API data entirely, in a new endpoint that does not provide all the package details in a vulnerability. We care about packages 1st, and less about vulnerabilities, so when querying by vulnerability, we should not serialize so much package data.

@TG1999
Copy link
Contributor

TG1999 commented Sep 12, 2024

This is a related issue to restructure the API:

@pombredanne
Copy link
Collaborator Author

See a first PR to improve the results:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants