Skip to content

jalalshabo/community_prediction

 
 

Repository files navigation

SEERa: An Open-Source Framework for Future Community Prediction

This is an open-source extensible end-to-end python-based framework to predict the future user communities in a text streaming social network (e.g., Twitter) based on the users’ topics of interest. User community prediction aims at identifying communities in the future based on the users' temporal topics of interest. We model inter-user topical affinities at each time interval via streams of temporal graphs. Our framework benefits from temporal graph embedding methods to learn temporal vector representations for users as users' topics of interests and hence their inter-user topical affinities are changing in time. We predict user communities in future time intervals based on the final locations of users' vectors in the latent space. Our framework employs layered software design that adds modularity, maintainability, ease of extensibility, and stability against customization and ad hoc changes to its components including topic modeling, user modeling, temporal user embedding, user community prediction and evaluation. More importantly, our framework further offers one-stop shop access to future communities to improve recommendation systems and advertising campaigns. Our proposed framework has already been benchmarked on a Twitter dataset and showed improvements compared to the state of the art in underlying applications such as news article recommendation and user prediction (see here, also below).

  1. Demo
  2. Structure
  3. Setup
  4. Quickstart
  5. Result
  6. License
  7. Citation

1. 🎥 Demo

Tutorials: 1) Overview 2) Quickstart (Colab Notebook) 3) Extension

Workflow Layers

2. Structure

Framework Structure

Our framework has six major layers: Data Access Layer (dal), Topic Modeling Layer (tml), User Modeling Layer (uml), Graph Embedding Layer (gel), and Community Prediction Layer (cpl). The application layer (apl), is the last layer, as shown in the above figure.

Code Structure

+---output
+---src
|   +---cmn (common functions)
|   |   \---Common.py
|   |
|   +---dal  (data access layer)
|   |   +---DataPreparation.py
|   |   \---DataReader.py
|   |
|   +---tml  (topic modeling layer)
|   |   \---TopicModeling.py
|   |
|   +---uml (user modeling layer)
|   |   +---UsersGraph.py
|   |   \---UserSimilarities.py
|   |
|   +---gel (graph embedding layer)
|   |   +---GraphEmbedding.py
|   |   \---GraphReconstruction.py
|   |
|   +---cpl (community prediction layer)
|   |   \---GraphClustering.py
|   |
|   +---apl (application layer)
|   |   +---NewsTopicExtraction.py
|   |   +---NewsRecommendation.py
|   |   \---ModelEvaluation.py
|   |
|   +---main.py
|   \---params.py
\---requirements.txt

3. Setup

It is strongly recommended to use Linux OS for installing the packages and executing the framework. To install packages and dependencies, simply use the following commands in your shell:

git clone https://github.com/fani-lab/seera.git
cd seera
pip install -r requirements.txt

This command installs compatible version of the following libraries:

  • dal: mysql-connector-python
  • tml: gensim, tagme, nltk, pandas, requests
  • gel: networkx, dynamicgem
  • others: scikit-network, scikit-learn, sklearn, numpy, scipy, matplotlib

Also, you need to install MAchine Learning for LanguagE Toolkit (mallet) from its git or website, as a requirement in tml.

4. Quickstart

Data

We crawled and stored Twitter posts (tweets) for 2 consecutive months. The data is available as sql scripts at ds_twitter, including, Tweets, TweetEntities, TweetUsers, TagmeAnnotations, NewsTables, and GoldenStandard (for news article recommendation).

Run

This framework contains six different layers. Each layer is affected by multiple parameters. Some of those parameters are fixed in the code via trial and error. However, major parameters such as number of topics can be adjusted by the user. They can be modified via 'params.py' file in root folder.
After modifying 'params.py', you can run the framework via 'main.py' with following command:

cd src
python main.py

Examples

params.py

import random
import numpy as np

random.seed(0)
np.random.seed(0)
RunID = 1                         

# SQL setting. Should be set for each mysql instance
user = ''
password = ''
host = ''
database = ''


general = {
    'Comment': '', # Any comment to express more information about the configuration.
}

dal = {
    'start': '2010-12-17', # First date of system activity
    'end': '2011-02-17', # Last day of system activity
    'timeInterval': 1, # Time interval (days) for grouping documents
    'lastRowsNumber': 100000, # Number of picked rows of the dataset for the whole process as a sample
    
    # Following parameters is used to generate corpus from our dataset:
    'userModeling': True, # Aggregates all tweets of a user as a document
    'timeModeling': True, # Aggregate all tweets of a specific day as a document
    'preProcessing': False, # Applying some traditional pre-processing methods on corpus
    'TagME': False, # Apply Tagme on the raw dataset. Set it to False if tagme-dataset is used
    'tagme_GCUBE_TOKEN': "--------------" # Tagme GCUBE TOKEN. For more information, visit: [TagmeHelp](https://sobigdata.d4science.org/web/tagme/tagme-help)
}

tml = {
    'num_topics': 25, # Number of topics that should be extracted from our corpus
    'library': 'gensim', # Used library to extract topics from the corpus. Could be 'gensim' or 'mallet'
    'mallet_home': '--------------', # mallet_home path
    'filterExtremes': True, # Filter very common and very rare terms in all documents
    'JO': False, # (JO:=JustOne) If True, just one topic is chosen for each document
    'Bin': True, # (Bin:=Binary) If True, all scores above/below a threshold is set to 1/0 for each topic
    'Threshold': 0.2, # A threshold for topic scores quantization
    'path2saveTML': f'../output/{RunID}/tml'
}

uml = {
    'RunId': RunID, # A unique number to identify the configuration per run
    'UserSimilarityThreshold': 0.2, # A threshold for filtering low user similarity scores
    'path2saveUML': f'../output/{RunID}/uml'
}

gel = {
    'GraphEmbedding': 'Node2Vec', # Graph embedding method. Available options are ['Node2Vec', 'AE', 'DynAE', 'DynRNN', 'DynAERNN']
    'EmbeddingDim': 40, # Embedding dimension
    'path2saveGEL': f'../output/{RunID}/gel'
}

cpl = {
    'ClusteringApproach': 'Indirect', # Available options are ['Direct', 'Indirect']. 'Direct': Applying a non-graph clustering method directly on predicted communities in latent space; 'Indirect': Apply a graph clustering method on generated graph based on the output of predicted communities
    'ClusteringMethod': 'Louvain', # Specification of the clustering method based on 'ClusteringApproach'. The only available option is 'Louvain' ('ClusteringApproach': 'Indirect') which is a graph clustering method
}

evl = {
    'RunId': RunID,
    'EvaluationType': 'Extrinsic', # ['Intrinsic', 'Extrinsic']
    
    # If 'EvaluationType' is set to 'Intrinsic', two below parameters should set as well
    'EvaluationMetrics': ['adjusted_rand', 'completeness', 'homogeneity', 'rand', 'v_measure',
                          'normalized_mutual_info', 'adjusted_mutual_info', 'mutual_info', 'fowlkes_mallows'],
    'GoldenStandardPath': '/path2GS', # Path to the golden standard
    # ----------------------------------------------------------------------------------
}

application = {
    'Threshold': 0.2, # A threshold for filtering low news article recommendation scores
    'TopK': 20 # Number of selected top news article recommendation candidates
}

5. Result

Method News Recommendation User Prediction
mrr ndcg5 ndcg10 Precision Recall f1-measure
Community Prediction
Our approach 0.255 0.108 0.105 0.012 0.035 0.015
Appel et al. [PKDD' 18] 0.176 0.056 0.055 0.007 0.094 0.0105
Temporal community detection
Hu et al. [SIGMOD’15] 0.173 0.056 0.049 0.007 0.136 0.013
Fani et al. [CIKM’17] 0.065 0.040 0.040 0.007 0.136 0.013
Non-temporal link-based community detection
Ye et al.[CIKM’18] 0.139 0.056 0.055 0.008 0.208 0.014
Louvain[JSTAT’08] 0.108 0.048 0.055 0.004 0.129 0.007
Collaborative filtering
rrn[WSDM’17] 0.173 0.073 0.08 0.004 0.740 0.008
timesvd++ [KDD’08] 0.141 0.058 0.064 0.003 0.657 0.005

6. License

©2021. This work is licensed under a CC BY-NC-SA 4.0 license.

Contact

Email: [email protected], [email protected]

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Acknowledgments

In this work, we use dynamicgem, mallet, pytrec_eval and other libraries. We would like to thank the authors of these libraries.

7. Citation

@inproceedings{DBLP:conf/ecir/FaniBD20,
  author    = {Hossein Fani and Ebrahim Bagheri and Weichang Du},
  title     = {Temporal Latent Space Modeling for Community Prediction},
  booktitle = {Advances in Information Retrieval - 42nd European Conference on {IR} Research, {ECIR} 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part {I}},
  series    = {Lecture Notes in Computer Science},
  volume    = {12035},
  pages     = {745--759},
  publisher = {Springer},
  year      = {2020},
  url       = {https://doi.org/10.1007/978-3-030-45439-5\_49},
  doi       = {10.1007/978-3-030-45439-5\_49},
  timestamp = {Thu, 14 May 2020 10:17:16 +0200},
  biburl    = {https://dblp.org/rec/conf/ecir/FaniBD20.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 76.3%
  • Jupyter Notebook 23.7%