This repository contains supplemental materials including an additional document
and a code set for appying our K-means clustering algorithm, AF-ICP,
to Large-scale and High-dimensional sparse data sets such as
the 8.2M-sized PubMed data set and comparing it with the other algorithms,
ICP, TA-ICP, and CS-ICP in ./Comparision
.
The codes are implemented with C.
- OS: CentOS 7.6 and later
- g++ (GCC): >= 8.2.0
- perl: >= 5.16
- perf: 3.10
- bzip2 (optional)
- Prepare the 8.2M-sized PubMed data set with a procedure in dataset.
This procedure creates./dataset/pubmed.8_2M.db
that is avilable for the codes in this repository.
You can download pubmed.8_2M.db.bz2 if you fail to download the original data (docword.pubmed.txt) from UCI machine learning repository. Then, executebzip2 -d pubmed.8_2M.db.bz2
to extract thepubmed.8_2M.db
and move it to./dataset
directory. - Execute
make -f Makefile_itr5_aficp
in./src
.
This makes./bin/itr5_aficp
object in your system. - Execute the perl script
./itr5_exeAFICP_8.2Mpubmed_perf.pl
in./exe
.
The 8.2M-sized PubMed data set is loaded from./dataset/pubmed.8_2M.db
(3.8GB) in around two minutes and given K=10,000, AF-ICP is executed with 50-thread parallel processing (default).
You can change default values in the perl scripts. For instance, the number of threads is defined by$NumThreads
in the script.
A log file is generated in./Log
.
Go to Comparison.
Please check LICENSE for the detail.