Clustering high-dimensional data such as the embeddings from language models poses unique challenges. Unlike K-Means, where you need to specify the number of clusters beforehand, this repository offers an advanced solution using the DBSCAN algorithm—a more adaptable and insightful method, especially useful for large and complex datasets.
This collection of scripts:
- Loads the CSV file containing high-dimensional vectors.
- Utilizes the DBSCAN algorithm to cluster entities based on their similarity without the need for predefining a specific number of clusters.
- Identifies and separates outliers or noise from the main clusters to prevent the merging of dissimilar entities.
- Exports the results with assigned cluster IDs for each entity to an Excel file for convenient review and analysis.
- Automated Cluster Detection: Finds the natural number of clusters in the data without any preset conditions.
- Customization: Adjustable similarity thresholds and minimum sample sizes to fine-tune the clustering process.
- High-Dimensional Data Handling: Developed to work with embeddings from models such as OpenAI's text-ada-003-large.
- Noise and Outlier Management: Isolates less similar vectors effectively, maintaining cleaner and more meaningful clusters.
- Silhouette Score Assessment: Provides an option to measure the clustering quality using the silhouette score, where feasible.
- Install the necessary Python packages pandas, numpy, scikit-learn, and openpyxl.
- Set your CSV file path to
input_csv_path
, where your embeddings are stored. - Run the script. It will automatically perform clustering, identify the noise, and save the results.
The sweet_spot_finder.py
script assists in finding the optimal DBSCAN parameters by testing different combinations of similarity thresholds and minimum samples. It runs multiple iterations of the clustering process in parallel and reports the number of clusters formed for each configuration. This helps in identifying the "sweet spot," where the clustering logic best aligns with the natural structure of the data.
- Set the input CSV file path by changing
input_csv_path
in the script. - Review and adjust the ranges for
similarity_thresholds
andmin_samples_values
to fit your dataset and clustering goals. - Execute the script. The output will display different configurations and their corresponding number of clusters, aiding you in selecting the best parameters for
DBSCAN
.
CleanShot.2024-02-04.at.16.37.30.mp4
The added HDBSCAN & PCA functionality helps in:
- Reducing Computation Time: By trimming down the dimensions, PCA speeds up the clustering process while still retaining the essential characteristics of the data.
- Enhancing Clustering Performance: With fewer dimensions, clustering algorithms like HDBSCAN can perform more efficiently and potentially yield more meaningful clustering results.
- Facilitating Data Visualization: Lower-dimensional data can be plotted and visualized, aiding in a more intuitive understanding and analysis of the clusters formed.
To take advantage of PCA in your clustering workflow, simply adjust the number of desired components in N_COMPONENTS
constant and follow the standard script execution process. This addition empowers you with more control over your clustering journey, ensuring that your exploration of high-dimensional data spaces is both manageable and insightful.