PECANN: Parallel Efficient Clustering with Approximate Nearest Neighbors

PECANN is an efficient clustering algorithm for large scale, high dimensional density peaks clustering. This library contains PECANN's fast and parallel C++ implementation, as well as easy to use python bindings.

Building PECANN

To build PECANN, you can build either python bindings or our C++ library/executable file. Below are instructions for both:

Python Bindings

python3 -m pip install .

C++ Library, Command Line Executable, and Python .so File

python3 -m pip install nanobind
mkdir build
cd build
cmake ..
make

Example Usage

Below are simple examples for how to run PECANN from python and from the command line on one of the sample gaussian datasets:

Python

import dpc_ann
cluster_results = dpc_ann.dpc_numpy(
        center_finder=dpc_ann.ThresholdCenterFinder(dependant_dist_threshold=8.36),
        density_computer=dpc_ann.KthDistanceDensityComputer(),
        data=data
)
# You could also do e.g. center_finder=dpc_ann.ProductCenterFinder(num_clusters=4)
clusters = clustering_result.clusters # 1 X N numpy array of cluster assignments
metadata = cluster_results.metadata # Dictionary of runtimes for various parts of the clustering process

Below are some additional arguments this function takes, as well as a short description of each argument.

Additional arguments

Argument	Description
data	A 2d float32 numpy array.
K	How many neighbors to use to estimate the density (default is 6).
Lbuild	The beam search size to use during the underlying graph index construction (default is 12).
L	The beam search size to use while finding the nearest K neighbors for each point (default is 12).
Lnn	The initial beam search size to use when finding dependent points (default is 4).
center_finder	The center finder to use to prune the DPC tree. Options include ProductCenterFinder or ThresholdCenterFinder, both of which can be directly constructed from the dpc_ann module.
density_computer	The density computer to use to compute the density of each point. Options include KthDistanceDensityComputer, ExpSquaredDensityComputer, and the NormalizedDensityComputer (among others), all of which can be constructed directly from the dpc_ann module. Default is KthDistanceDensityComputer.
output_path	An optional parameter specifying the path to write the cluster assignment to.
decision_graph_path	An optional parameter specifying the path to write the decision graph to.
max_degree	The maximum degree of the underlying graph index (default 16).
alpha	The alpha to use for the underlying graph index, for indices that require it (default 1.2).
num_clusters	The number of times to independently reconstruct the underlying graph index, for indices that require this (default 4).
graph_type	Which underlying graph index to use (default "Vamana", choices are "Vamana", "HCNNG", "BruteForce", "pyNNDescent").

Command Line

./build/dpc_framework_exe  --query_file ./data/gaussian_example/gaussian_4_1000.data --decision_graph_path ./results/gaussian_4_1000.dg --output_file ./results/gaussian_4_1000.cluster --dist_cutoff 8.36

The command line takes most of the same options as the python bindings, and they can be seen by running the command line script with no arguments.

Reproducing our Experiments

All of the experiments from our paper can be easily reproduced.

Installing Dependencies

Our experiment's dependencies are listed in requirements.txt. You can install them by running pip3 install -r requirements.txt

As baselines for our experiments, we evaluate the K-means implementation from the excellent FAISS library as well as fast-dp, a prior high dimensional DPC algorithm. FAISS is included as a dependency in requirements.txt. To install fastDP, you need to clone their repo and build it:

git clone https://github.com/uef-machine-learning/fastdp
cd fastdp
pip3 install .

Datasets

Our experiments expect each dataset to be in a folder named data/<dataset_name> (if you wish to use another location for disk space reasons, you can store the datasets there and add a simlink to the data folder). A dataset consists of two files: a <dataset_name>.npy file that is a 2D float32 numpy array, where each line is a point in the dataset, and a <dataset_name>.gt text file, where the ith line is a number representing the ground truth clustering for the ith point in the dataset.

You can also generate them as follows:

bin/download_simple_datasets.sh downloads s2, s3, and the unbalance datasets from here.
bin/download_mnist.sh downloads mnist.
python3 data_processors/create_imagenet.py --dataset_path <path to kaggle imagenet download> creates the ImageNet embedding dataset. link to kaggle
python3 data_processors/create_birds.py --dataset_path <path to kaggle birds download train subset folder> creates the birds embedding dataset. link to kaggle
python3 data_processors/embed_huggingface.py --dataset_name mteb/reddit-clustering --download_dir <path to store downloaded data> --data_dir <path to store embedded data> creates the reddit dataset.
python3 data_processors/embed_huggingface.py mteb/arxiv-clustering-s2s --download_dir <path to store downloaded data> --data_dir <path to store embedded data> creates the arxiv dataset.

Running Experiments

By default, all experiments will run with all threads on your machine except for the thread scaling experiment.

Dataset Size Scaling:

The first command below varies the dataset size and saves the ARI results to a log file in results (cluster_analysis_dataset_size_scaling_{timestamp}), while the second command plots the results from that file and saves a resulting pdf of the graph to results/paper. This is a similar pattern to most of our experiments.

python3 experiments/paper/dataset_size_scaling/dataset_size_scaling.py
python3 experiments/paper/dataset_size_scaling/plot_dataset_size_scaling.py results/cluster_analysis_dataset_size_scaling_{fill_in_run_specific_value}

Pareto Frontiers and Brute Force Table:

The following commands generate the data for Pareto frontiers for each method. Since they include brute force runs of DPC and many hyperparameter settings, they may take a few days or longer to run, depending on how many cores your machine has.

python3 experiments/paper/brute_force.py
python3 experiments/paper/pareto/different_graph_methods_on_imagenet.py
python3 experiments/paper/pareto/vamana_pareto.py
python3 experiments/paper/pareto/test_fastdp.sh

The following command plots the pareto fronts:

python3 experiments/paper/pareto/plot_pareto.py results

Runtime Decompositions and Accuracy vs Time for Various Density Methods

python3 experiments/paper/varying_k/varying_k.py
python3 experiments/paper/varying_k results

ARI vs. (# Clusters Passed to Method / # Ground Truth Clusters)

python3 experiments/paper/varying_num_clusters/varying_num_clusters.py
python3 experiments/paper/varying_num_clusters/plot_varying_num_clusters.py results/cluster_analysis_varying_num_clusters.csv

Thread Scaling

The thread scaling experiment assumes a 30 core, 60 thread machine. If you have a different size machine, you will need to change line 78 of experiments/paper/thread_scaling/thread_scaling.py accordingly:

    if len(current_threads) in [1, 2, 4, 8, 15, 30, 60]:

Also, if your machine has multiple numa nodes, it is a good idea to change line 78 to be the single numa node thread counts and run it, then change it to be just the multiple numa node thread counts and run it prefixed with numactl -i all, and then combine the resulting result csv files.

The commands:

python3 experiments/paper/thread_scaling/thread_scaling.py
python3 experiments/paper/thread_scaling/plot_thread_scaling.py results --threads

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
.github/workflows		.github/workflows
ParDPC @ 038f92b		ParDPC @ 038f92b
ParlayANN @ 1ae9a48		ParlayANN @ 1ae9a48
bin		bin
data		data
data_processors		data_processors
experiments		experiments
post_processors		post_processors
results		results
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
DPC_ANN_full_version.pdf		DPC_ANN_full_version.pdf
README.md		README.md
create_graphs.sh		create_graphs.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PECANN: Parallel Efficient Clustering with Approximate Nearest Neighbors

Building PECANN

Python Bindings

C++ Library, Command Line Executable, and Python .so File

Example Usage

Python

Additional arguments

Command Line

Reproducing our Experiments

Installing Dependencies

Datasets

Running Experiments

Dataset Size Scaling:

Pareto Frontiers and Brute Force Table:

Runtime Decompositions and Accuracy vs Time for Various Density Methods

ARI vs. (# Clusters Passed to Method / # Ground Truth Clusters)

Thread Scaling

About

Releases

Packages

Contributors 2

Languages

yushangdi/PECANN-DPC

Folders and files

Latest commit

History

Repository files navigation

PECANN: Parallel Efficient Clustering with Approximate Nearest Neighbors

Building PECANN

Python Bindings

C++ Library, Command Line Executable, and Python .so File

Example Usage

Python

Additional arguments

Command Line

Reproducing our Experiments

Installing Dependencies

Datasets

Running Experiments

Dataset Size Scaling:

Pareto Frontiers and Brute Force Table:

Runtime Decompositions and Accuracy vs Time for Various Density Methods

ARI vs. (# Clusters Passed to Method / # Ground Truth Clusters)

Thread Scaling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages