*Seer: A person who practiced divination
in ancient Greece; to foresee, foretell, predict, or prophesy.
This is an open-source extensible
end-to-end
python-based framework
to predict the future user communities in a text streaming social network (e.g., Twitter) based on the users’ topics of interest. User community prediction aims at identifying communities in the future based on the users' temporal topics of interest. We model inter-user topical affinities at each time interval via streams of temporal graphs. Our framework benefits from temporal graph embedding methods to learn temporal vector representations for users as users' topics of interests and hence their inter-user topical affinities are changing in time. We predict user communities in future time intervals based on the final locations of users' vectors in the latent space. Our framework employs layered software design
that adds modularity, maintainability, ease of extensibility, and stability against customization and ad hoc changes to its components including topic modeling
, user modeling
, temporal user embedding
, user community prediction
and evaluation
. More importantly, our framework further offers one-stop shop access to future communities to improve recommendation systems and advertising campaigns. Our proposed framework has already been benchmarked on a Twitter dataset and showed improvements compared to the state of the art in underlying applications such as news article recommendation
and user prediction
(see here, also below).
Tutorials: 1) Overview 2) Quickstart (Colab Notebook
) 3) Extension
Workflow | Layers |
---|---|
Our framework has six major layers: Data Access Layer (dal
), Topic Modeling Layer (tml
), User Modeling Layer (uml
), Graph Embedding Layer (gel
), and Community Prediction Layer (cpl
). The application layer (apl
), is the last layer, as shown in the above figure.
Each layer process the input data from previous layer and produces new processed data for the next layer as explained below. Sample outputs on toy
data can be seen here ./output/toy
:
├── {#Topics}topics.csv -> N topics with their top 10 vocabulary set and probabilities
├── {#Topics}topics.model -> The topic model (e.g., LDA, GSDMM, BTM, ...)
├── {#Topics}TopicsDictionary.mm -> Dictionary of tokens/words
├── graphs
│ ├── Day{K}userSimilarities.npz ->
│ ├── graphs.npz[.pkl] ->
├── Day{K}UserIDs.pkl -> User IDs for K-th day [Size: #Users × 1]
├── Day{K}UsersTopicInterests.pkl -> Matrix of users to topics [Size: #Topics × #Users]
├── Users.npy -> User IDs [Size: #Users × 1]
├── Embeddings.pkl -> Embedded user graphs [Size: #Days-loockback × #Users × Embedding dim]
├── cluster2user.csv[.pkl] ->
├── ClusterTopic.csv[.pkl] ->
├── Graph.adjlist -> Final predicted user graph for the future from last embeddings
├── Pred_users_similarity.npz ->
├── PredUserClusters.npy[.csv] -> Cluster ID for each user [Size: #Users × 1]
├── user2cluster.csv[.pkl] ->
├── evl ->
| ├── Pred.Eval.csv ->
| ├── Pred.Eval.Mean.csv ->
| ├── UserMentions.pkl ->
├── NewsIds_ExpandedURLs.npy ->
├── NewsTopics.pkl ->
├── RecommendationTableUser.pkl ->
├── topRecommendationMentionerUser.pkl ->
├── TopRecommendationsUser.pkl ->
├── users_mentions_mentioned_user.pkl ->
------------------*Previous*
├── ClusterNumbers.npy -> Cluster IDs [Size: #Communities]
├── NewsIds.npy -> News IDs [Size: #News × 1]
├── CommunitiesTopicInterests.npy -> Topic vector for each community [Size: #Communities × #Topics]
├── NewsTopics.npy -> Topic vector for each news article [Size: #News × #Topics]
├── RecommendationTable.npy -> Recommendations scores of news articles for each community [Size: #Communities × #News]
├── TopRecommendations.npy -> TopK recommendations scores of news articles for each community [Size: #Communities × TopK]
+---data
| +---toy.synthetic
| | +---News.csv
| | +---TweetEntities.csv
| | \---Tweets.csv
| |
| +---toy
| | +---News.csv
| | +---TweetEntities.csv
| | +---Tweets.csv
| | \---readme.md
| |
+---src
| +---cmn (common functions)
| | \---Common.py
| |
| +---dal (data access layer)
| | +---DataPreparation.py
| | \---DataReader.py
| |
| +---tml (topic modeling layer)
| | \---TopicModeling.py
| |
| +---uml (user modeling layer)
| | +---UsersGraph.py
| | \---UserSimilarities.py
| |
| +---gel (graph embedding layer)
| | +---CppWrapper.py
| | +---GraphEmbedding.py
| | \---GraphToText.py
| |
| +---cpl (community prediction layer)
| | +---GraphClustering.py
| | \---GraphReconstruction_main.py
| |
| +---apl (application layer)
| | +---NewsTopicExtraction.py
| | +---NewsRecommendation.py
| | +---News.py
| | +---NewsCrawler.py
| | \---ModelEvaluation.py
| |
| +---params.py
| +---ParamsTemplate.py
| \---main.py
|
+---environment.yml
+---quickstart.ipynb
\---requirements.txt
SEERa
has been developed on Python 3.6
and can be installed by conda
or pip
:
git clone https://github.com/fani-lab/seera.git
cd seera
conda env create -f environment.yml
conda activate seera
git clone https://github.com/fani-lab/seera.git
cd seera
pip install -r requirements.txt
This command installs compatible versions of the following libraries:
- tml:
gensim, tagme, nltk, pandas, requests, bitermplus
- gel:
networkx
- others:
scikit-network, scikit-learn, sklearn, numpy, scipy, matplotlib
Additionally, you need to install the following libraries from their source:
MAchine Learning for LanguagE Toolkit (mallet)
as a requirement intml
.Gibbs Sampling algorithm for a Dirichlet Mixture Model (GSDMM)
as a requirement intml
:
git clone https://github.com/rwalk/gsdmm.git
cd gsdmm
python setup.py install
cd ..
DynamicGem
as a requirement ingel
:
git clone https://github.com/palash1992/DynamicGEM.git
cd DynamicGEM
python setup.py install
pip install tensorflow==1.11.0 --force-reinstall #may be needed
We crawled and stored ~2.9M
Twitter posts (tweets) for 2 consecutive months 2010-11-01
and 2010-12-31
. Tweet Ids are provided at ./data/TweetIds.csv
for streaming tweets from Twitter using tools like hydrator
.
For quickstart purposes, a toy
sample of tweets between 2010-12-01
and 2010-12-04
has been provided at ./data/toy/Tweets.csv
.
This framework contains six different layers. Each layer is affected by multiple parameters, e.g., number of topics, that can be adjusted by the user via ./src/params_template.py
in root folder.
You can run the framework via ./src/main.py
with following command:
cd ../src
python -u main.py -r toy -t lda.gensim lda.mallet gsdmm btm -g AE DynAE DynAERNN
where the input arguements are:
-r
: A unique description for the run, for example test1
, required.
-t
: A list of topic modeling methods among {lda.gensim
, lda.mallet
, gsdmm
, btm
}, required, case-insensitive.
-g
: A list of graph embedding methods among {AE
, DynAE
, DynRNN
, DynAERNN
}, required, case-insensitive.
-p
: A flag for the run to be time-profiled, optional.
A run will produce an output folder at ./output/{r}
and subfolders for each topic modeling and graph embedding pair as baselines, e.g., lda.AE
, lda.DynAE
, and lda.DynAERNN
. The final evaluation results are aggregated in ./output/{r}/pred.eval.mean.csv
. See an example run on toy dataset at ./output/toy
.
Method | News Recommendation | User Prediction | ||||
---|---|---|---|---|---|---|
mrr | ndcg5 | ndcg10 | Precision | Recall | f1-measure | |
Community Prediction | ||||||
Fani et al.[ECIR'20] | 0.255 | 0.108 | 0.105 | 0.012 | 0.035 | 0.015 |
Appel et al. [PKDD'18] | 0.176 | 0.056 | 0.055 | 0.007 | 0.094 | 0.0105 |
Temporal community detection | ||||||
Hu et al. [SIGMOD'15] | 0.173 | 0.056 | 0.049 | 0.007 | 0.136 | 0.013 |
Fani et al. [CIKM'17] | 0.065 | 0.040 | 0.040 | 0.007 | 0.136 | 0.013 |
Non-temporal link-based community detection | ||||||
Ye et al.[CIKM'18] | 0.139 | 0.056 | 0.055 | 0.008 | 0.208 | 0.014 |
Louvain[JSTAT'08] | 0.108 | 0.048 | 0.055 | 0.004 | 0.129 | 0.007 |
Collaborative filtering | ||||||
rrn[WSDM’17] | 0.173 | 0.073 | 0.08 | 0.004 | 0.740 | 0.008 |
timesvd++ [KDD'08] | 0.141 | 0.058 | 0.064 | 0.003 | 0.657 | 0.005 |
©2021. This work is licensed under a CC BY-NC-SA 4.0 license.
Soroush Ziaenejad1,2, Hossein Fani1,3
1School of Computer Science, Faculty of Science, University of Windsor, ON, Canada.
2[email protected], [email protected] 3[email protected]
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
In this work, we use bitermplus
, dynamicgem
, mallet
, pytrec_eval
and other libraries. We would like to thank the authors of these libraries.
@inproceedings{DBLP:conf/cikm/ZiaeinejadSF22,
author = {Soroush Ziaeinejad and Saeed Samet and Hossein Fani},
title = {SEERa: {A} Framework for Community Prediction},
booktitle = {Proceedings of the 31st {ACM} International Conference on Information {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
pages = {4762--4766},
publisher = {{ACM}},
year = {2022},
url = {https://doi.org/10.1145/3511808.3557529},
doi = {10.1145/3511808.3557529},
biburl = {https://dblp.org/rec/conf/cikm/ZiaeinejadSF22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{DBLP:conf/ecir/FaniBD20,
author = {Hossein Fani and Ebrahim Bagheri and Weichang Du},
title = {Temporal Latent Space Modeling for Community Prediction},
booktitle = {Advances in Information Retrieval - 42nd European Conference on {IR} Research, {ECIR} 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part {I}},
series = {Lecture Notes in Computer Science},
volume = {12035},
pages = {745--759},
publisher = {Springer},
year = {2020},
url = {https://doi.org/10.1007/978-3-030-45439-5\_49},
doi = {10.1007/978-3-030-45439-5\_49},
biburl = {https://dblp.org/rec/conf/ecir/FaniBD20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}