par-filtered-graph-clustering

To get the submodules:

git pull
git submodule update --init

Implementations

Each folder contains an implementation that we tested.

par_tmfg our parallel TMFG and DBHT implementation

hac the hierarchical agglomerative clustering (HAC) algorithm by Yu et al.

Aste Aste's MATLAB TMFG+DBHT implementation. This is modified from DBHT and PMFG. The modifications include adding timers for benchmarking and substitute some subroutines for better performance. Speficically, we changed Aste's TMFG+DBHT implementation (DBHTs.m file) to use boost library's all pair's shortest path and breadth-first search implementation, because this gives significant speedup. Aste's MATLAB PMFG+DBHT implementation also uses boost's implementation.

mpi-scalablekmeanspp the C++ implmentation of k-means++.

Installation Requirement

g++ = 7.5.0
make
C++ boost library
MATLAB
MATLAB BGL

After boost is installed, set the BOOST_ROOT variable in par_tmfg/Makefile to the address of boost folder

Input

The input to both implementations is a symmetric matrix. The format of the file is a binary file with $n^2$ double numbers. An example file is in the dataset folder.

Datasets

The UCR data sets can be downloaded from here. The stock data can be obtained using the Yahoo Finance API. Our data is obtained in Nov. 2021.

You can also download our data here. There is a readme.md in the data repository linked above that explains how to use the datasets.

Running Tests

For running time tests, we use numactl. It can be installed using apt install numactl.

To run HAC:

run make in hac/general_hac

PARLAY_NUM_THREADS=wk numactl -i all ./linkage dataset n outpout method round

wk is the number of workers to use
numactl -i all is optional
dataset is the file name of the input distance matrix (in binary format)
n is the number of data points
output is the file name of the output file for the resulting dendrogram
method can be "comp" or "avg" for complete linkage and average linkage respectively
round is the number of times to run the program

cd hac/general_hac
make
PARLAY_NUM_THREADS=${wk} numactl -i all ./linkage ../../datasets/CBF.dat 930 outputs/CBF_comp_dendro comp 1

To run parallel TMFG + DBHT:

run make in par_tmfg

PARLAY_NUM_THREADS=wk numactl -i all ./tmfg S output n D method prefix round

wk is the number of workers to use
numactl -i all is optional
S is the file name of the input similarity matrix (in binary format)
output is the file name prefix of the output file for the resulting dendrogram (-Z) and the resulting TMFG (-P). The outputs are going to be saved folders "par_tmfg/outputs/Ps/" and "par_tmfg/outputs/Zs/", so these two folders should be created in advance.
n is the number of data points
D is the file name of the input dissimilarity matrix. If D=0, will use D = sqrt(2(1-s))
method can be "exact" or "prefix".
prefix is the prefix size to insert in each round. it is ignored when method is exact
round is the number of times to run the program

cd par_tmfg
make
PARLAY_NUM_THREADS=${wk} numactl -i all ./tmfg ../datasets/CBF.dat outputs/CBF 930 0 prefix 2 1
PARLAY_NUM_THREADS=1 ./tmfg ../datasets/CBF.dat outputs/CBF 930 0 exact 0 1

To run Aste's implementation of TMFG/PMFG+DBHT:

UCR_PMFG(dataset, inputdir, outputdir)

UCR_TMFG(dataset, inputdir, outputdir)

dataset is the name of the dataset.
inputdir is the directory of the input dataset
outputdir is the output directory

cd Aste
matlab -nojvm -nosplash -nodesktop  -nodisplay  -r  'UCR_PMFG("iris", "../datasets/", "outputs"); exit'  -logfile outputs/iris_pmfg_timing.txt
matlab -nojvm -nosplash -nodesktop  -nodisplay  -r  'UCR_TMFG("iris", "../datasets/", "outputs"); exit'  -logfile outputs/iris_tmfg_timing.txt

To run KMeans:

The C++ k-means++ code is in mpi-scalablekmeanspp/ folder.

To run KMeans+Spectral:

from sklearn.cluster import SpectralClustering

SpectralClustering(n_clusters=k, affinity="nearest_neighbors",
                                n_neighbors=n_neighbor,
                                assign_labels='discretize',
                                random_state=1, n_jobs=worker).fit(X)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

par-filtered-graph-clustering

Implementations

Installation Requirement

Input

Datasets

Running Tests

To run HAC:

To run parallel TMFG + DBHT:

To run Aste's implementation of TMFG/PMFG+DBHT:

To run KMeans:

To run KMeans+Spectral:

Files

README.md

Latest commit

History

README.md

File metadata and controls

par-filtered-graph-clustering

Implementations

Installation Requirement

Input

Datasets

Running Tests

To run HAC:

To run parallel TMFG + DBHT:

To run Aste's implementation of TMFG/PMFG+DBHT:

To run KMeans:

To run KMeans+Spectral: