Accelerating K-Means Clustering for Documents
with an Architecture-Friendly Pruning Method

This repository contains supplemental materials including an additional document and a code set for appying our K-means clustering algorithm, AF-ICP, to Large-scale and High-dimensional sparse data sets such as the 8.2M-sized PubMed data set and comparing it with the other algorithms, ICP, TA-ICP, and CS-ICP in ./Comparision. The codes are implemented with C.

Requirements for executing codes

OS: CentOS 7.6 and later
g++ (GCC): >= 8.2.0
perl: >= 5.16
perf: 3.10
bzip2 (optional)

Quick start: AF-ICP in five iterations

Prepare the 8.2M-sized PubMed data set with a procedure in dataset.
This procedure creates ./dataset/pubmed.8_2M.db that is avilable for the codes in this repository.
You can download pubmed.8_2M.db.bz2 if you fail to download the original data (docword.pubmed.txt) from UCI machine learning repository. Then, execute bzip2 -d pubmed.8_2M.db.bz2 to extract the pubmed.8_2M.db and move it to ./dataset directory.
Execute make -f Makefile_itr5_aficp in ./src.
This makes ./bin/itr5_aficp object in your system.
Execute the perl script ./itr5_exeAFICP_8.2Mpubmed_perf.pl in ./exe.
The 8.2M-sized PubMed data set is loaded from ./dataset/pubmed.8_2M.db (3.8GB) in around two minutes and given K=10,000, AF-ICP is executed with 50-thread parallel processing (default).
You can change default values in the perl scripts. For instance, the number of threads is defined by $NumThreads in the script.
A log file is generated in ./Log.

Compare AF-ICP with other algorithms, ICP, TA-ICP, and CS-ICP

Go to Comparison.

License

Please check LICENSE for the detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Accelerating K-Means Clustering for Documents
with an Architecture-Friendly Pruning Method

Requirements for executing codes

Quick start: AF-ICP in five iterations

Compare AF-ICP with other algorithms, ICP, TA-ICP, and CS-ICP

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Comparison		Comparison
Log		Log
bin		bin
dataset		dataset
exe		exe
src		src
LICENSE_v2.1.pdf		LICENSE_v2.1.pdf
README.md		README.md
supp.pdf		supp.pdf

nttcslab/KmeansDocData

Folders and files

Latest commit

History

Repository files navigation

Accelerating K-Means Clustering for Documents with an Architecture-Friendly Pruning Method

Requirements for executing codes

Quick start: AF-ICP in five iterations

Compare AF-ICP with other algorithms, ICP, TA-ICP, and CS-ICP

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Accelerating K-Means Clustering for Documents
with an Architecture-Friendly Pruning Method

Packages