Support parallel enumeration of Git repositories #69
Labels
content discovery
Related to enumerating or specifying content to scan
enhancement
New feature or request
performance
Related to runtime performance
Currently, the
scan
command runs in two main phases: input enumeration and content scanning. Each of these phases runs in parallel (but not concurrently; the input enumeration phase completes entirely before the content scanning phase completes).However, within the input enumeration phase, when a Git repository is discovered on the filesystem, that repository is enumerated sequentially, by a single thread. This becomes noticeable when you are scanning just a single huge repository, such as the Linux kernel, which has over a million commits, several million objects, and can take over a hundred GB of space when uncompressed.
It would be better if Nosey Parker did not have this sequential bottleneck, and was instead able to enumerate a single Git repository in parallel, using all available cores.
The implementation of this will be a bit tricky, requiring rework of the parallelism mechanism in the input enumerator code. That currently uses the
ignore
crate to do parallel filesystem walking, but that does not seem to expose its thread pool. We would want the proposed parallel Git enumerator to not oversubscribe the system runningscan
; the total number of enumeration threads should be controllable.Additionally complicated will be figuring out how to build up the Git metadata graph that is being added in #66 (to address #16): the core graph data structure there is not designed for out-of-the-box mutation from many threads.
The text was updated successfully, but these errors were encountered: