To get the submodules:
git pull
git submodule update --init
Each folder contains an implementation that we tested.
par_tmfg
our parallel TMFG and DBHT implementation
hac
the hierarchical agglomerative clustering (HAC) algorithm by Yu et al.
Aste
Aste's MATLAB TMFG+DBHT implementation. This is modified from DBHT
and PMFG. The modifications include adding timers for benchmarking and substitute some subroutines for better performance. Speficically, we changed Aste's TMFG+DBHT implementation (DBHTs.m file) to use boost library's all pair's shortest path and breadth-first search implementation, because this gives significant speedup. Aste's MATLAB PMFG+DBHT implementation also uses boost's implementation.
mpi-scalablekmeanspp
the C++ implmentation of k-means++.
- g++ = 7.5.0
- make
- C++ boost library
- MATLAB
- MATLAB BGL
After boost is installed, set the BOOST_ROOT variable in par_tmfg/Makefile to the address of boost folder
The input to both implementations is a symmetric matrix.
The format of the file is a binary file with dataset
folder.
The UCR data sets can be downloaded from here. The stock data can be obtained using the Yahoo Finance API. Our data is obtained in Nov. 2021.
You can also download our data here. There is a readme.md in the data repository linked above that explains how to use the datasets.
For running time tests, we use numactl
. It can be installed using apt install numactl
.
run make
in hac/general_hac
PARLAY_NUM_THREADS=wk
numactl -i all ./linkage dataset
n
outpout
method
round
wk
is the number of workers to usenumactl -i all
is optionaldataset
is the file name of the input distance matrix (in binary format)n
is the number of data pointsoutput
is the file name of the output file for the resulting dendrogrammethod
can be "comp" or "avg" for complete linkage and average linkage respectivelyround
is the number of times to run the program
cd hac/general_hac
make
PARLAY_NUM_THREADS=${wk} numactl -i all ./linkage ../../datasets/CBF.dat 930 outputs/CBF_comp_dendro comp 1
run make
in par_tmfg
PARLAY_NUM_THREADS=wk
numactl -i all ./tmfg S
output
n
D
method
prefix
round
wk
is the number of workers to usenumactl -i all
is optionalS
is the file name of the input similarity matrix (in binary format)output
is the file name prefix of the output file for the resulting dendrogram (-Z) and the resulting TMFG (-P). The outputs are going to be saved folders "par_tmfg/outputs/Ps/" and "par_tmfg/outputs/Zs/", so these two folders should be created in advance.n
is the number of data pointsD
is the file name of the input dissimilarity matrix. If D=0, will use D = sqrt(2(1-s))method
can be "exact" or "prefix".prefix
is the prefix size to insert in each round. it is ignored when method is exactround
is the number of times to run the program
cd par_tmfg
make
PARLAY_NUM_THREADS=${wk} numactl -i all ./tmfg ../datasets/CBF.dat outputs/CBF 930 0 prefix 2 1
PARLAY_NUM_THREADS=1 ./tmfg ../datasets/CBF.dat outputs/CBF 930 0 exact 0 1
UCR_PMFG(dataset
, inputdir
, outputdir
)
UCR_TMFG(dataset
, inputdir
, outputdir
)
dataset
is the name of the dataset.inputdir
is the directory of the input datasetoutputdir
is the output directory
cd Aste
matlab -nojvm -nosplash -nodesktop -nodisplay -r 'UCR_PMFG("iris", "../datasets/", "outputs"); exit' -logfile outputs/iris_pmfg_timing.txt
matlab -nojvm -nosplash -nodesktop -nodisplay -r 'UCR_TMFG("iris", "../datasets/", "outputs"); exit' -logfile outputs/iris_tmfg_timing.txt
The C++ k-means++ code is in mpi-scalablekmeanspp/ folder.
from sklearn.cluster import SpectralClustering
SpectralClustering(n_clusters=k, affinity="nearest_neighbors",
n_neighbors=n_neighbor,
assign_labels='discretize',
random_state=1, n_jobs=worker).fit(X)