Support parallel enumeration of Git repositories #69

bradlarsen · 2023-07-27T15:29:26Z

Currently, the scan command runs in two main phases: input enumeration and content scanning. Each of these phases runs in parallel (but not concurrently; the input enumeration phase completes entirely before the content scanning phase completes).

However, within the input enumeration phase, when a Git repository is discovered on the filesystem, that repository is enumerated sequentially, by a single thread. This becomes noticeable when you are scanning just a single huge repository, such as the Linux kernel, which has over a million commits, several million objects, and can take over a hundred GB of space when uncompressed.

It would be better if Nosey Parker did not have this sequential bottleneck, and was instead able to enumerate a single Git repository in parallel, using all available cores.

The implementation of this will be a bit tricky, requiring rework of the parallelism mechanism in the input enumerator code. That currently uses the ignore crate to do parallel filesystem walking, but that does not seem to expose its thread pool. We would want the proposed parallel Git enumerator to not oversubscribe the system running scan; the total number of enumeration threads should be controllable.

Additionally complicated will be figuring out how to build up the Git metadata graph that is being added in #66 (to address #16): the core graph data structure there is not designed for out-of-the-box mutation from many threads.

The text was updated successfully, but these errors were encountered:

bradlarsen added performance Related to runtime performance content discovery Related to enumerating or specifying content to scan enhancement New feature or request labels Jul 27, 2023

bradlarsen changed the title ~~Rework input enumeration to make it possible to enumerate Git repositories in parallel~~ Support parallel enumeration of Git repositories Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parallel enumeration of Git repositories #69

Support parallel enumeration of Git repositories #69

bradlarsen commented Jul 27, 2023 •

edited

Loading

Support parallel enumeration of Git repositories #69

Support parallel enumeration of Git repositories #69

Comments

bradlarsen commented Jul 27, 2023 • edited Loading

bradlarsen commented Jul 27, 2023 •

edited

Loading