This is an open-source extensible
end-to-end
python-based framework
to predict the future user communities in a text streaming social network (e.g., Twitter) based on the users’ topics of interest. User community prediction aims at identifying communities in the future based on the users' temporal topics of interest. We model inter-user topical affinities at each time interval via streams of temporal graphs. Our framework benefits from temporal graph embedding methods to learn temporal vector representations for users as users' topics of interests and hence their inter-user topical affinities are changing in time. We predict user communities in future time intervals based on the final locations of users' vectors in the latent space. Our framework employs layered software design
that adds modularity, maintainability, ease of extensibility, and stability against customization and ad hoc changes to its components including topic modeling
, user modeling
, temporal user embedding
, user community prediction
and evaluation
. More importantly, our framework further offers one-stop shop access to future communities to improve recommendation systems and advertising campaigns. Our proposed framework has already been benchmarked on a Twitter dataset and showed improvements compared to the state of the art in underlying applications such as news article recommendation
and user prediction
(see here, also below).
Tutorials: 1) Overview 2) Quickstart (Colab Notebook
) 3) Extension
Workflow | Layers |
---|---|
Our framework has six major layers: Data Access Layer (dal
), Topic Modeling Layer (tml
), User Modeling Layer (uml
), Graph Embedding Layer (gel
), and Community Prediction Layer (cpl
). The application layer (apl
), is the last layer, as shown in the above figure.
+---output
+---src
| +---cmn (common functions)
| | \---Common.py
| |
| +---dal (data access layer)
| | +---DataPreparation.py
| | \---DataReader.py
| |
| +---tml (topic modeling layer)
| | \---TopicModeling.py
| |
| +---uml (user modeling layer)
| | +---UsersGraph.py
| | \---UserSimilarities.py
| |
| +---gel (graph embedding layer)
| | +---GraphEmbedding.py
| | \---GraphReconstruction.py
| |
| +---cpl (community prediction layer)
| | \---GraphClustering.py
| |
| +---apl (application layer)
| | +---NewsTopicExtraction.py
| | +---NewsRecommendation.py
| | \---ModelEvaluation.py
| |
| +---main.py
| \---params.py
\---requirements.txt
It is strongly recommended to use Linux OS for installing the packages and executing the framework. To install packages and dependencies, simply use the following commands in your shell:
git clone https://github.com/fani-lab/seera.git
cd seera
pip install -r requirements.txt
This command installs compatible version of the following libraries:
- dal:
mysql-connector-python
- tml:
gensim, tagme, nltk, pandas, requests
- gel:
networkx, dynamicgem
- others:
scikit-network, scikit-learn, sklearn, numpy, scipy, matplotlib
Also, you need to install MAchine Learning for LanguagE Toolkit (mallet)
from its git
or website
, as a requirement in tml
.
We crawled and stored Twitter posts (tweets) for 2 consecutive months. The data is available as sql
scripts at ds_twitter
, including, Tweets
, TweetEntities
, TweetUsers
, TagmeAnnotations
, NewsTables
, and GoldenStandard (for news article recommendation)
.
This framework contains six different layers. Each layer is affected by multiple parameters.
Some of those parameters are fixed in the code via trial and error. However, major parameters such as number of topics can be adjusted by the user.
They can be modified via 'params.py' file in root folder.
After modifying 'params.py', you can run the framework via 'main.py' with following command:
cd src
python main.py
import random
import numpy as np
random.seed(0)
np.random.seed(0)
RunID = 1
# SQL setting. Should be set for each mysql instance
user = ''
password = ''
host = ''
database = ''
general = {
'Comment': '', # Any comment to express more information about the configuration.
}
dal = {
'start': '2010-12-17', # First date of system activity
'end': '2011-02-17', # Last day of system activity
'timeInterval': 1, # Time interval (days) for grouping documents
'lastRowsNumber': 100000, # Number of picked rows of the dataset for the whole process as a sample
# Following parameters is used to generate corpus from our dataset:
'userModeling': True, # Aggregates all tweets of a user as a document
'timeModeling': True, # Aggregate all tweets of a specific day as a document
'preProcessing': False, # Applying some traditional pre-processing methods on corpus
'TagME': False, # Apply Tagme on the raw dataset. Set it to False if tagme-dataset is used
'tagme_GCUBE_TOKEN': "--------------" # Tagme GCUBE TOKEN. For more information, visit: [TagmeHelp](https://sobigdata.d4science.org/web/tagme/tagme-help)
}
tml = {
'num_topics': 25, # Number of topics that should be extracted from our corpus
'library': 'gensim', # Used library to extract topics from the corpus. Could be 'gensim' or 'mallet'
'mallet_home': '--------------', # mallet_home path
'filterExtremes': True, # Filter very common and very rare terms in all documents
'JO': False, # (JO:=JustOne) If True, just one topic is chosen for each document
'Bin': True, # (Bin:=Binary) If True, all scores above/below a threshold is set to 1/0 for each topic
'Threshold': 0.2, # A threshold for topic scores quantization
'path2saveTML': f'../output/{RunID}/tml'
}
uml = {
'RunId': RunID, # A unique number to identify the configuration per run
'UserSimilarityThreshold': 0.2, # A threshold for filtering low user similarity scores
'path2saveUML': f'../output/{RunID}/uml'
}
gel = {
'GraphEmbedding': 'Node2Vec', # Graph embedding method. Available options are ['Node2Vec', 'AE', 'DynAE', 'DynRNN', 'DynAERNN']
'EmbeddingDim': 40, # Embedding dimension
'path2saveGEL': f'../output/{RunID}/gel'
}
cpl = {
'ClusteringApproach': 'Indirect', # Available options are ['Direct', 'Indirect']. 'Direct': Applying a non-graph clustering method directly on predicted communities in latent space; 'Indirect': Apply a graph clustering method on generated graph based on the output of predicted communities
'ClusteringMethod': 'Louvain', # Specification of the clustering method based on 'ClusteringApproach'. The only available option is 'Louvain' ('ClusteringApproach': 'Indirect') which is a graph clustering method
}
evl = {
'RunId': RunID,
'EvaluationType': 'Extrinsic', # ['Intrinsic', 'Extrinsic']
# If 'EvaluationType' is set to 'Intrinsic', two below parameters should set as well
'EvaluationMetrics': ['adjusted_rand', 'completeness', 'homogeneity', 'rand', 'v_measure',
'normalized_mutual_info', 'adjusted_mutual_info', 'mutual_info', 'fowlkes_mallows'],
'GoldenStandardPath': '/path2GS', # Path to the golden standard
# ----------------------------------------------------------------------------------
}
application = {
'Threshold': 0.2, # A threshold for filtering low news article recommendation scores
'TopK': 20 # Number of selected top news article recommendation candidates
}
Method | News Recommendation | User Prediction | ||||
---|---|---|---|---|---|---|
mrr | ndcg5 | ndcg10 | Precision | Recall | f1-measure | |
Community Prediction | ||||||
Our approach | 0.255 | 0.108 | 0.105 | 0.012 | 0.035 | 0.015 |
Appel et al. [PKDD' 18] | 0.176 | 0.056 | 0.055 | 0.007 | 0.094 | 0.0105 |
Temporal community detection | ||||||
Hu et al. [SIGMOD’15] | 0.173 | 0.056 | 0.049 | 0.007 | 0.136 | 0.013 |
Fani et al. [CIKM’17] | 0.065 | 0.040 | 0.040 | 0.007 | 0.136 | 0.013 |
Non-temporal link-based community detection | ||||||
Ye et al.[CIKM’18] | 0.139 | 0.056 | 0.055 | 0.008 | 0.208 | 0.014 |
Louvain[JSTAT’08] | 0.108 | 0.048 | 0.055 | 0.004 | 0.129 | 0.007 |
Collaborative filtering | ||||||
rrn[WSDM’17] | 0.173 | 0.073 | 0.08 | 0.004 | 0.740 | 0.008 |
timesvd++ [KDD’08] | 0.141 | 0.058 | 0.064 | 0.003 | 0.657 | 0.005 |
©2021. This work is licensed under a CC BY-NC-SA 4.0 license.
Email: [email protected], [email protected]
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
In this work, we use dynamicgem
, mallet
, pytrec_eval
and other libraries. We would like to thank the authors of these libraries.
@inproceedings{DBLP:conf/ecir/FaniBD20,
author = {Hossein Fani and Ebrahim Bagheri and Weichang Du},
title = {Temporal Latent Space Modeling for Community Prediction},
booktitle = {Advances in Information Retrieval - 42nd European Conference on {IR} Research, {ECIR} 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part {I}},
series = {Lecture Notes in Computer Science},
volume = {12035},
pages = {745--759},
publisher = {Springer},
year = {2020},
url = {https://doi.org/10.1007/978-3-030-45439-5\_49},
doi = {10.1007/978-3-030-45439-5\_49},
timestamp = {Thu, 14 May 2020 10:17:16 +0200},
biburl = {https://dblp.org/rec/conf/ecir/FaniBD20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}