`SEERa`^*: An Open-Source Framework for Future Community Prediction

^{*Seer: A person who practiced divination in ancient Greece; to foresee, foretell, predict, or prophesy.}

This is an open-source extensible end-to-end python-based framework to predict the future user communities in a text streaming social network (e.g., Twitter) based on the users’ topics of interest. User community prediction aims at identifying communities in the future based on the users' temporal topics of interest. We model inter-user topical affinities at each time interval via streams of temporal graphs. Our framework benefits from temporal graph embedding methods to learn temporal vector representations for users as users' topics of interests and hence their inter-user topical affinities are changing in time. We predict user communities in future time intervals based on the final locations of users' vectors in the latent space. Our framework employs layered software design that adds modularity, maintainability, ease of extensibility, and stability against customization and ad hoc changes to its components including topic modeling, user modeling, temporal user embedding, user community prediction and evaluation. More importantly, our framework further offers one-stop shop access to future communities to improve recommendation systems and advertising campaigns. Our proposed framework has already been benchmarked on a Twitter dataset and showed improvements compared to the state of the art in underlying applications such as news article recommendation and user prediction (see here, also below).

Demo
Structure
Setup
Quickstart
Benchmark Result
License
Citation

1. 🎥 Demo

Tutorials: 1) Overview 2) Quickstart (Colab Notebook) 3) Extension

Workflow	Layers

2. Structure

Framework Structure

Our framework has six major layers: Data Access Layer (dal), Topic Modeling Layer (tml), User Modeling Layer (uml), Graph Embedding Layer (gel), and Community Prediction Layer (cpl). The application layer (apl), is the last layer, as shown in the above figure.

Each layer process the input data from previous layer and produces new processed data for the next layer as explained below. Sample outputs on toy data can be seen here ./output/toy:

`tml`

├── {#Topics}topics.csv                           -> N topics with their top 10 vocabulary set and probabilities
├── {#Topics}topics.model                         -> The topic model (e.g., LDA, GSDMM, BTM, ...)
├── {#Topics}TopicsDictionary.mm                  -> Dictionary of tokens/words

`uml`

├── graphs
│   ├── Day{K}userSimilarities.npz  ->
│   ├── graphs.npz[.pkl]            ->
├── Day{K}UserIDs.pkl               -> User IDs for K-th day [Size: #Users × 1]
├── Day{K}UsersTopicInterests.pkl   -> Matrix of users to topics [Size: #Topics × #Users]
├── Users.npy                       -> User IDs [Size: #Users × 1]

`gel`

├── Embeddings.pkl -> Embedded user graphs [Size: #Days-loockback × #Users × Embedding dim]

`cpl`

├── cluster2user.csv[.pkl]      -> 
├── ClusterTopic.csv[.pkl]      -> 
├── Graph.adjlist               -> Final predicted user graph for the future from last embeddings
├── Pred_users_similarity.npz   -> 
├── PredUserClusters.npy[.csv]  -> Cluster ID for each user [Size: #Users × 1]
├── user2cluster.csv[.pkl]      ->

`apl`

├── evl                                     ->
|   ├── Pred.Eval.csv                       ->
|   ├── Pred.Eval.Mean.csv                  ->
|   ├── UserMentions.pkl                    ->
├── NewsIds_ExpandedURLs.npy                ->
├── NewsTopics.pkl                          ->
├── RecommendationTableUser.pkl             ->
├── topRecommendationMentionerUser.pkl      ->
├── TopRecommendationsUser.pkl              ->
├── users_mentions_mentioned_user.pkl       ->

------------------*Previous*
├── ClusterNumbers.npy              -> Cluster IDs [Size: #Communities]
├── NewsIds.npy                     -> News IDs [Size: #News × 1]
├── CommunitiesTopicInterests.npy   -> Topic vector for each community [Size: #Communities × #Topics]
├── NewsTopics.npy                  -> Topic vector for each news article [Size: #News × #Topics]
├── RecommendationTable.npy         -> Recommendations scores of news articles for each community [Size: #Communities × #News]
├── TopRecommendations.npy          -> TopK recommendations scores of news articles for each community [Size: #Communities × TopK]

Code Structure

+---data
|   +---toy.synthetic
|   |   +---News.csv
|   |   +---TweetEntities.csv
|   |   \---Tweets.csv
|   |
|   +---toy
|   |   +---News.csv
|   |   +---TweetEntities.csv
|   |   +---Tweets.csv
|   |   \---readme.md
|   |
+---src
|   +---cmn (common functions)
|   |   \---Common.py
|   |
|   +---dal  (data access layer)
|   |   +---DataPreparation.py
|   |   \---DataReader.py
|   |
|   +---tml  (topic modeling layer)
|   |   \---TopicModeling.py
|   |
|   +---uml (user modeling layer)
|   |   +---UsersGraph.py
|   |   \---UserSimilarities.py
|   |
|   +---gel (graph embedding layer)
|   |   +---CppWrapper.py
|   |   +---GraphEmbedding.py
|   |   \---GraphToText.py
|   |
|   +---cpl (community prediction layer)
|   |   +---GraphClustering.py
|   |   \---GraphReconstruction_main.py
|   |
|   +---apl (application layer)
|   |   +---NewsTopicExtraction.py
|   |   +---NewsRecommendation.py
|   |   +---News.py
|   |   +---NewsCrawler.py
|   |   \---ModelEvaluation.py
|   |
|   +---params.py
|   +---ParamsTemplate.py
|   \---main.py
|
+---environment.yml
+---quickstart.ipynb
\---requirements.txt

3. Setup

SEERa has been developed on Python 3.6 and can be installed by conda or pip:

git clone https://github.com/fani-lab/seera.git
cd seera
conda env create -f environment.yml
conda activate seera

git clone https://github.com/fani-lab/seera.git
cd seera
pip install -r requirements.txt

This command installs compatible versions of the following libraries:

tml: gensim, tagme, nltk, pandas, requests, bitermplus

gel: networkx

others: scikit-network, scikit-learn, sklearn, numpy, scipy, matplotlib

Additionally, you need to install the following libraries from their source:

MAchine Learning for LanguagE Toolkit (mallet) as a requirement in tml.
Gibbs Sampling algorithm for a Dirichlet Mixture Model (GSDMM) as a requirement in tml:

git clone https://github.com/rwalk/gsdmm.git
cd gsdmm
python setup.py install
cd ..

DynamicGem as a requirement in gel:

git clone https://github.com/palash1992/DynamicGEM.git
cd DynamicGEM
python setup.py install
pip install tensorflow==1.11.0 --force-reinstall #may be needed

4. Quickstart

Data

We crawled and stored ~2.9M Twitter posts (tweets) for 2 consecutive months 2010-11-01 and 2010-12-31. Tweet Ids are provided at ./data/TweetIds.csv for streaming tweets from Twitter using tools like hydrator.

For quickstart purposes, a toy sample of tweets between 2010-12-01 and 2010-12-04 has been provided at ./data/toy/Tweets.csv.

Run

This framework contains six different layers. Each layer is affected by multiple parameters, e.g., number of topics, that can be adjusted by the user via ./src/params_template.py in root folder.

You can run the framework via ./src/main.py with following command:

cd ../src
python -u main.py -r toy -t lda.gensim lda.mallet gsdmm btm -g AE DynAE DynAERNN

where the input arguements are:

-r: A unique description for the run, for example test1, required.

-t: A list of topic modeling methods among {lda.gensim, lda.mallet, gsdmm, btm}, required, case-insensitive.

-g: A list of graph embedding methods among {AE, DynAE, DynRNN, DynAERNN}, required, case-insensitive.

-p: A flag for the run to be time-profiled, optional.

A run will produce an output folder at ./output/{r} and subfolders for each topic modeling and graph embedding pair as baselines, e.g., lda.AE, lda.DynAE, and lda.DynAERNN. The final evaluation results are aggregated in ./output/{r}/pred.eval.mean.csv. See an example run on toy dataset at ./output/toy.

5. Benchmark Result

Method	News Recommendation			User Prediction
Method	mrr	ndcg5	ndcg10	Precision	Recall	f1-measure
Community Prediction
Fani et al.[ECIR'20]	0.255	0.108	0.105	0.012	0.035	0.015
Appel et al. [PKDD'18]	0.176	0.056	0.055	0.007	0.094	0.0105
Temporal community detection
Hu et al. [SIGMOD'15]	0.173	0.056	0.049	0.007	0.136	0.013
Fani et al. [CIKM'17]	0.065	0.040	0.040	0.007	0.136	0.013
Non-temporal link-based community detection
Ye et al.[CIKM'18]	0.139	0.056	0.055	0.008	0.208	0.014
Louvain[JSTAT'08]	0.108	0.048	0.055	0.004	0.129	0.007
Collaborative filtering
rrn[WSDM’17]	0.173	0.073	0.08	0.004	0.740	0.008
timesvd++ [KDD'08]	0.141	0.058	0.064	0.003	0.657	0.005

6. License

Authors

Soroush Ziaenejad^1,2, Hossein Fani^1,3

^{¹School of Computer Science, Faculty of Science, University of Windsor, ON, Canada.}

^{²[email protected], [email protected]} ^{³[email protected]}

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Acknowledgments

In this work, we use bitermplus, dynamicgem, mallet, pytrec_eval and other libraries. We would like to thank the authors of these libraries.

7. Citation

@inproceedings{DBLP:conf/cikm/ZiaeinejadSF22,
  author    = {Soroush Ziaeinejad and Saeed Samet and Hossein Fani},
  title     = {SEERa: {A} Framework for Community Prediction},
  booktitle = {Proceedings of the 31st {ACM} International Conference on Information {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
  pages     = {4762--4766},
  publisher = {{ACM}},
  year      = {2022},
  url       = {https://doi.org/10.1145/3511808.3557529},
  doi       = {10.1145/3511808.3557529},
  biburl    = {https://dblp.org/rec/conf/cikm/ZiaeinejadSF22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@inproceedings{DBLP:conf/ecir/FaniBD20,
  author    = {Hossein Fani and Ebrahim Bagheri and Weichang Du},
  title     = {Temporal Latent Space Modeling for Community Prediction},
  booktitle = {Advances in Information Retrieval - 42nd European Conference on {IR} Research, {ECIR} 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part {I}},
  series    = {Lecture Notes in Computer Science},
  volume    = {12035},
  pages     = {745--759},
  publisher = {Springer},
  year      = {2020},
  url       = {https://doi.org/10.1007/978-3-030-45439-5\_49},
  doi       = {10.1007/978-3-030-45439-5\_49},
  biburl    = {https://dblp.org/rec/conf/ecir/FaniBD20.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`SEERa`^*: An Open-Source Framework for Future Community Prediction

1. 🎥 Demo

2. Structure

Framework Structure

`tml`

`uml`

`gel`

`cpl`

`apl`

Code Structure

3. Setup

4. Quickstart

Data

Run

5. Benchmark Result

6. License

Authors

Contributing

Acknowledgments

7. Citation

About

Releases

Packages

Contributors 6

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 311 Commits
data		data
demo		demo
output		output
src		src
environment.yml		environment.yml
license.txt		license.txt
quickstart.ipynb		quickstart.ipynb
readme.md		readme.md
requirements.txt		requirements.txt

License

fani-lab/SEERa

Folders and files

Latest commit

History

Repository files navigation

SEERa*: An Open-Source Framework for Future Community Prediction

1. 🎥 Demo

2. Structure

Framework Structure

Code Structure

3. Setup

4. Quickstart

Data

Run

5. Benchmark Result

6. License

Authors

Contributing

Acknowledgments

7. Citation

About

Resources

License

Stars

Watchers

Forks

Languages

`SEERa`^*: An Open-Source Framework for Future Community Prediction