A curated list of amazing topic modelling libraries.
- Libraries & Toolkits
- Models
- Techniques
- Research Implementations
- Visualizations
- Resources
- Related awesome lists
- gensim - Python library for topic modelling
- scikit-learn - Python library for machine learning
- tomotopy - Python extension for Gibbs sampling based tomoto which is written in C++
- tomoto - Ruby extension for Gibbs sampling based tomoto which is written in C++
- OCTIS - Python package to integrate, optimize and evaluate topic models
- tmtoolkit - Python topic modeling toolkit with parallel processing power
- Mallet - Java-based package for topic modeling
- TopicModel4J - Java-based package for topic modeling
- BIDMach - CPU and GPU-accelerated machine learning library
- BigARTM - Fast topic modeling platform
- TopicNet - A high-level Python interface for BigARTM library
- stm - R package for the Structural Topic Model
- RMallet - R package to interface with the Java machine learning tool MALLET
- R-lda - R package for topic modelling (LDA, sLDA, corrLDA, etc.)
- topicmodels - R package with interface to C code for LDA and CTM
- lda++ - C++ library for LDA and (fast) supervised LDA (sLDA/fsLDA) using variational inference
There are huge differences in performance and scalability as well as the support of advanced features as hyperparameter tuning or evaluation capabilities.
Truncated Singular Value Decomposition (SVD) / Latent Semantic Analysis (LSA) / Latent Semantic Indexing (LSI)
- scikit-learn - Python implementation using fast randomized SVD solver or a “naive” algorithm that uses ARPACK
- gensim - Python implementation using multi-pass randomized SVD solver or a one-pass merge algorithm
- SVDlibc - C implementation of SVD by Doug Rohde
- sparsesvd - Python wrapper for SVDlibc
- BIDMach - Scala implementation of a scalable approximate SVD using subspace iteration
- scikit-learn - Python implementation using a coordinate descent or a multiplicative update solver
- gensim - Python implementation of online NMF
- BIDMach - CPU and GPU-accelerated Scala implementation with L2 loss
Latent Dirichlet Allocation (LDA) 📄
- scikit-learn - Python implementation using online variational Bayes inference 📄
- lda - Python implementation using collapsed Gibbs sampling which follows scikit-learn interface 📄
- lda-gensim - Python implementation using online variational inference 📄
- ldamulticore-gensim - Parallelized Python implementation using online variational inference 📄
- GibbsSamplingLDA-TopicModel4J - Java implementation using collapsed Gibbs sampling 📄
- CVBLDA-TopicModel4J - Java implementation using collapsed variational Bayesian (CVB) inference 📄
- Mallet - Parallelized Java implementation using Gibbs sampling 📄📄
- gensim-wrapper-Mallet - Python wrapper for Mallet's implementation 📄📄
- PartiallyCollapsedLDA - Various fast parallelized samplers for LDA, including Partially Collapsed LDA, LightLDA, Partially Collapsed Light LDA and a very efficient Polya-Urn LDA
- Vowpal Wabbit - C++ implementaion using online variational Bayes inference 📄
- tomotopy - Python binding for C++ implementation using Gibbs sampling and different term-weighting options 📄
- topicmodel-lib - Cython library for online/streaming LDA (Online VB, Online CVB0, Online CGS, Online OPE, Online FW, Streaming VB, Streaming OPE, Streaming FW, ML-OPE, ML-CGS, ML-FW)
- jsLDA - JavaScript implementation of LDA topic modeling in the browser
- lda-nodejs - Node.js implementation of LDA topic modeling
- lda-purescript - PureScript, browser-based implementation of LDA topic modeling
- TopicModels.jl - Julia implementation of LDA
- turicreate - C++ LDA and aliasLDA implementation with export to Apple's Core ML for use in iOS, macOS, watchOS, and tvOS apps
- MeTA - C++ implementation of (parallel) collapsed Gibbs sampling, CVB0 and SCVB
- Fugue - Java implementation of collapsed Gibbs sampling with slice sampling for hyper-parameter optimization
- GA-LDA - R scripts using Genetic Algorithms (GA) for hyper-paramenter optimization, based on Panichella 📄
- Search-Based-LDA - R scripts using Genetic Algorithms (GA) for hyper-paramenter optimization by Panichella 📄
- Dodge - Python tuning tool that ignores redundant tunings 📄
- LDADE - Python tuning tool using differential evolution 📄
- ldatuning - R package to find optimal number of topics for LDA 📄
- Scalable - Scalable Hyperparameter Selection for LDA 📄
- topic_interpretability - Computation of the semantic interpretability of topics produced by topic models 📄
- topic-coherence-sensitivity - Code to compute topic coherence for several topic cardinalities and aggregate scores across them 📄
- topic-model-diversity - A collection of topic diversity measures for topic modeling 📄
- LDA* - Tencent's hybrid sampler that uses different samplers for different types of documents in combination with an asymmetric parameter server 📄
- FastLDA - C++ implementation of LDA 📄
- dmlc - Single-and multi-threaded C++ implementations of lightLDA, F+LDA, AliasLDA, forestLDA and many more
- SparseLDA - Java algorithm and data structure for evaluating Gibbs sampling distributions used in Mallet 📄
- warpLDA - C++ cache efficient LDA implementation which samples each token in O(1) 📄
- lightLDA - C++ implementation using O(1) Metropolis-Hastings sampling 📄
- F+LDA - C++ implementation of F+LDA using an appropriately modified Fenwick tree 📄
- AliasLDA - C++ implemenation using Metropolis-Hastings and alias method📄
- Yahoo-LDA - Yahoo!'s topic modelling framework 📄
- PLDA+ - Google's C++ implementation using data placement and pipeline processing 📄
- Familia - A toolkit for industrial topic modeling (LDA, SentenceLDA and Topical Word Embedding)
⚠️ 📄
- SaberLDA - GPU-based system that implements a sparsity-aware algorithm to achieve sublinear time complexity
- GS-LDA-BIDMach - CPU and GPU-accelerated Scala implementation using Gibbs sampling
- VB-LDA-BIDMach - CPU and GPU-accelerated Scala implementation using online variational Bayes inference
Hierarchical Dirichlet Process (HDP) 📄
- gensim - Python implementation using online variational inference 📄
- tomotopy - Python extension for C++ implementation using Gibbs sampling 📄
- Mallet - Java-based package for topic modeling using Gibbs sampling
- TopicModel4J - Java implementation using Gibbs sampling based on Chinese restaurant franchise metaphor
- hca - C implementation using Gibbs sampling with/without burstiness modelling
- bnp - Cython reimplementation based on online-hdp following scikit-learn's API.
- Scalable HDP - interesting paper
Hierarchical LDA (hLDA) 📄
- tomotopy - Python extension for C++ implementation using Gibbs sampling
- Mallet - Java implementation using Gibbs sampling
- hlda - Python package based on Mallet's Gibbs sampler having a fixed depth on the nCRP tree
- hLDA - C implementation of hierarchical LDA by David Blei
Dynamic Topic Model (DTM) 📄
- tomotopy - Python extension for C++ implementation using Gibbs sampling based on FastDTM
- FastDTM - Scalable C++ implementation using Gibbs sampling with Stochastic Gradient Langevin Dynamics (MCMC-based) 📄
- ldaseqmodel-gensim - Python implementation using online variational inference 📄
- dtm-BigTopicModel - C++ engine for running large-scale topic models
- tca - C implementation using Gibbs sampling with/without burstiness modelling 📄
- DETM - Python implementation of the Dynamic Embedded Topic Model 📄
Author-topic Model (ATM) 📄
- gensim - Python implementation with online training (constant in memory w.r.t. the number of documents)
- TopicModel4J - Java implementation
- Matlab Topic Modeling Toolbox - Matlab and C++ implementation using Gibbs sampling
- Topic-Model - Simple Python implementation using Gibbs sampling
Labeled Latent Dirichlet Allocation (LLDA, Labeled-LDA, L-LDA) 📄
- tomotopy - Python extension for C++ implementation using Gibbs sampling
- TopicModel4J - Java implementation
- Mallet - Java implementation using Gibbs sampling 📄
- gensims_mallet_wrapper - Python wrapper for Mallet using gensim interface
- STMT - Scala implementation by Daniel Ramage
- topbox - Python wrapper for labeled LDA implementation of Stanford TMT
- Labeled-LDA-Python - Python implementation (easy to use, does not scale)
- JGibbLabeledLDA - Java implementation based on the popular JGibbLDA package
Partially Labeled Dirichlet Allocation (PLDA) / Dirichlet Process (PLDP) 📄
- tomotopy - Python extension for C++ implementation using Gibbs sampling
- TopicModel4J - Java implementation using collapsed Gibbs sampling
- STMT - Scala implementation of PLDA & PLDP by Daniel Ramage
Dirichlet Multinomial Regression (DMR) topic model 📄
- tomotopy - Python extension for C++ implementation using Gibbs sampling
- Mallet - Java-based package for topic modeling
Generalized Dirichlet Multinomial Regression (g-DMR) topic model 📄
- tomotopy - Python extension for C++ implementation using Gibbs sampling
- PTM - implemented as benchmark 📄
- TopicModel4J - Java implementation using collapsed Gibbs sampling
- tomotopy - Python extension for C++ implementation using Gibbs sampling 📄
- ctm-c - Original C implementation of the correlated topic model by David Blei 📄
- BigTopicModel - C++ engine for running large-scale DTM 📄
- stm - R package for the Structural Topic Model (CTM in case of no covariates) 📄
- BigTopicModel - C++ engine for running large-scale topic models
- Constrained-RTM - Java implementation of Contrained RTM 📄
- R-lda - R implementation using collapsed Gibbs sampling
Supervised LDA (sLDA) 📄
- tomotopy - Python extension for C++ implementation using Gibbs sampling
- R-lda - R implementation using collapsed Gibbs sampling
- slda - Cython implementation of Gibbs sampling for LDA and various sLDA variants
- supervised LDA (linear regression)
- binary logistic supervised LDA (logistic regression)
- binary logistic hierarchical supervised LDA (trees)
- generalized relational topic models (graphs)
- YWWTools - Java implementation using Gibbs sampling for LDA and various sLDA variants:
- BS-LDA: Binary SLDA
- Lex-WSB-BS-LDA: BS-LDA with Lexcial Weights and Weighted Stochastic Block Priors
- Lex-WSB-Med-LDA: Lex-WSB-BS-LDA with Hinge Loss
- sLDA - C++ implementation of supervised topic models with a categorical response
Sentence-LDA / SentenceLDA / Sentence LDA 📄
- TopicModel4J - Java implementation of Sentence-LDA using collapsed Gibbs sampling
- Familia - Apply inference on pre-trained SentenceLDA models
⚠️ 📄
Dirichlet Multinomial Mixture Model (DMM) 📄
- GPyM_TM - Python implementation of DMM and Poisson model
- TopicModel4J - Java implementation using collapsed Gibbs sampling 📄
- jLDADMM - Java implementation using collapsed Gibbs sampling 📄
- TopicModel4J - Java implementation using collapsed Gibbs sampling 📄
Pseudo-document-based Topic Model (PTM) 📄
- tomotopy - Python extension for C++ implementation using Gibbs sampling
- TopicModel4J - Java implementation using collapsed Gibbs sampling
- TopicModel4J - Java implementation using collapsed Gibbs sampling
- BTM - Original C++ implementation using collapsed Gibbs sampling 📄
- BurstyBTM - Original C++ implementation of the Bursty BTM (BBTM) 📄
- OnlineBTM - Original C++ implementation of online BTM (oBTM) and incremental BTM (iBTM) :page_facing_up
- R-BTM - R package wrapping the C++ code from BTM
- STTM - Java implementation and evaluation of DMM, WNTM, PTM, ETM, GPU-DMM, GPU-DPMM, LF-DMM 📄
- SATM - Java implementation of Self-Aggregation Topic Model 📄
- shorttext - Python implementation of various algorithms for Short Text Mining
- trLDA - Python implementation of streaming LDA based on trust-regions 📄
- Logistic LDA - Tensorflow implementation of Discriminative Topic Modeling with Logistic LDA 📄
- EnsTop - Python implementation of ENSemble TOPic modelling with pLSA
- Dual-Sparse Topic Model - implemented in TopicModel4J using collapsed variational Bayes inference 📄
- Multi-Grain-LDA - MG-LDA implemented in tomotopy using collapsed Gibbs sampling 📄
- lda++ - C++ library for LDA and (fast) supervised LDA (sLDA/fsLDA) using variational inference 📄 📄
- discLDA - C++ implementation of discLDA based on GibbsLDA++ 📄
- GuidedLDA - Python implementation that can be guided by setting some seed words per topic (using Gibbs sampling) 📄
- seededLDA - R package that implements seeded-LDA for semi-supervised topic modeling
- keyATM - R package for Keyword Assisted Topic Models.
- hca - C implementation of non-parametric topic models (HDP, HPYP-LDA, etc.) with focus on hyperparameter tuning
- BayesPA - Python interface for streaming implementation of MedLDA, maximum entropy discrimination LDA (max-margin supervised topic model) 📄
- sailing-pmls - Parallel LDA and medLDA implementation
- BigTopicModel - C++ engine for running large-scale MedLDA models 📄
- DAPPER - Python implementation of Dynamic Author Persona (DAP) topic model 📄
- ToT - Python implementation of Topics Over Time (A Non-Markov Continuous-Time Model of Topical Trends) 📄
- MLTM - C implementation of multilabel topic model (MLTM) 📄
- sequence-models - Java implementation of block HMM and the mixed membership Markov model (M4)
- Entropy-Based Topic Modeling - Java implementation of Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections
- ST-LDA - ST-LDA: Single Topic LDA 📄
- MTM - Java implementation of Multilingual Topic Model 📄
- YWWTools - Java-based package for various topic models by Weiwei Yang
- TEM - Topic Expertise Model 📄
- PTM - Prescription Topic Model for Traditional Chinese Medicine Prescriptions 📄 (interesting benchmark models)
- KGE-LDA - Knowledge Graph Embedding LDA 📄
- LDA-SP - A Latent Dirichlet Allocation Method for Selectional Preferences 📄
- LDA+FFT - LDA and FFTs (Fast and Frugal Trees) for better comprehensibility 📄
- BERTopic - BERTopic supports guided, (semi-) supervised, and dynamic topic modeling and visualization 📄
- CTM - CTMs combine contextualized embeddings (e.g., BERT) with topic models
- ETM - Embedded Topic Model 📄
- D-ETM - Dynamic Embedded Topic Model 📄
- ProdLDA - Original TensorFlow implementation of Autoencoding Variational Inference (AEVI) for Topic Models 📄
- pytorch-ProdLDA - PyTorch implementation of ProdLDA 📄
- CatE - Discriminative Topic Mining via Category-Name Guided Text Embedding 📄
- Top2Vec - Python implementation that learns jointly embedded topic, document and word vectors 📄
- lda2vec - Mixing dirichlet topic models and word embeddings to make lda2vec 📄
- lda2vec-pytorch - PyTorch implementation of lda2vec
- G-LDA - Java implementation of Gaussian LDA using word embeddings 📄
- MG-LDA - Python implementation of (Multi-lingual) Gaussian LDA 📄
- MetaLDA - Java implementation using Gibbs sampling that leverages document metadata and word embeddings 📄
- LFTM - Java implementation of latent feature topic models (improving LDA and DMM with word embeddings) 📄
- CorEx - Recover latent factors with Correlation Explanation (CorEx) 📄
- Anchored CorEx - Hierarchical Topic Modeling with Minimal Domain Knowledge 📄
- Linear CorEx - Latent Factor Models Based on Linear Total CorEx 📄
- Stan - Platform for statistical modeling and high-performance statistical computation, e.g., LDA 📄
- PyMC3 - Python package for Bayesian statistical modeling and probabilistic machine learning, e.g., LDA 📄
- Turing.jl - Julia library for general-purpose probabilistic programming 📄
- TFP - Probabilistic reasoning and statistical analysis in TensorFlow, e.g., LDA 📄
- edward2 - Simple PPL with core utilities in the NumPy and TensorFlow ecosystem 📄
- pyro - PPL built on PyTorch, e.g., prodLDA 📄
- edward - A PPL built on TensorFlow, e.g., LDA 📄
- ZhuSuan - A PPL for Bayesian deep learning, generative models, built on Tensorflow, e.g., LDA 📄
- lda-c - C implementation using variational EM by David Blei
- sLDA - C++ implementation of supervised topic models with a categorical response.
- onlineldavb - Python online variational Bayes implementation by Matthew Hoffman 📄
- HDP - C++ implementation of hierarchical Dirichlet processes by Chong Wang
- online-hdp - Python implementation of online hierarchical Dirichlet processes by Chong Wang
- ctr - C++ implementation of collaborative topic models by Chong Wang
- dtm - C implementation of dynamic topic models by David Blei & Sean Gerrish
- ctm-c - C implementation of the correlated topic model by David Blei
- diln - C implementation of Discrete Infinite Logistic Normal (with HDP option) by John Paisley
- hLDA - C implementation of hierarchical LDA by David Blei
- turbotopics - Python implementation that finds significant multiword phrases in topics by David Blei
- Stanford Topic Modeling Toolbox - Scala implementation of LDA, labeledLDA, PLDA, PLDP by Daniel Ramage and Evan Rosen
- LDAGibbs - Java implementation of LDA using Gibbs sampling by Liu Yang
- Matlab Topic Modeling Toolbox - Matlab implementations of LDA, ATM, HMM-LDA, LDA-COL (Collocation) models by Mark Steyvers and Tom Griffiths
- cvbLDA - Python C extension implementation of collapsed variational Bayesian inference for LDA
- fast - A Fast And Scalable Topic-Modeling Toolbox (Fast-LDA, CVB0) by Arthur Asuncion and colleagues 📄
- Stanford Topic Modeling Toolbox - Scala implementation of LDA, labeledLDA, PLDA, PLDP by Daniel Ramage and Evan Rosen
- Matlab Topic Modeling Toolbox - Matlab implementations of LDA, ATM, HMM-LDA, LDA-COL (Collocation) models by Mark Steyvers and Tom Griffiths
- GibbsLDA++ - C++ implementation using Gibbs sampling 📄 🍴
- JGibbLDA - Java implementation using Gibbs sampling
- Mr.LDA - Scalable Topic Modeling using Variational Inference in MapReduce 📄
- topic_models - Python implementation of LSA, PLSA and LDA
- Topic-Model - Python implementation of LDA, Labeled LDA, ATM, Temporal Author-Topic Model using Gibbs sampling
- LDAvis - R package for interactive topic model visualization
- pyLDAvis - Python library for interactive topic model visualization
- scalaLDAvis - Scala port of pyLDAvis
- dtmvisual - Python package for visualizing DTM (trained with gensim)
- TMVE online - Online Django variant of topic model visualization engine (TMVE)
- TMVE - Original topic model visualization engine (LDA trained with lda-c) 📄
- topicmodel-lib - Python wrapper for TMVE for visualizing LDA (trained with topicmodel-lib)
- wordcloud - Python package for visualizing topics via word_cloud
- Mallet-GUI - GUI for creating and analyzing topic models produced by MALLET
- TWiC - Topic Words in Context is a highly-interactive, browser-based visualization for MALLET topic models
- dfr-browser - Explore Mallet's topic models of texts in a web browser
- Termite - Explore topic models using term-topic matrix, group-in-a-box visualization or scatter plot.
- Topics - Python library for topic modeling and visualization
- TopicsExplorer - Explore your own text collection with a topic model – without prior knowledge 📄
- topicApp - A Simple Shiny App for Topic Modeling
- stminsights - A Shiny Application for Inspecting Structural Topic Models
- Slice sampling
- Minka
- fastfit
- dirichlet Python port of fastfit
- lightspeed
- lecture-notes
- Newton-Raphson Method
- fixed-point iteration - Wallach's PhD thesis, chapter 2.3
- David Blei - David Blei's Homepage with introductory materials
Contributions welcome! Read the contribution guidelines first.
To the extent possible under law, Jonathan Schneider has waived all copyright and related or neighboring rights to this work.