Probabilistic model on hypergraphs able to incorporate the information about node covariates.
This repository contains the implementation of the HyCoSBM model presented in:
[1] Hypergraphs with node attributes: structure and inference.
Anna Badalyan, Nicolò Ruggeri, and Caterina De Bacco
[
ArXiv
]
HyCoSBM is a stochastic block model for higher-order interactions that can
incorporate node covariates for improved inference.
This code is made available for the public, if you make use of it please cite our work
in the form of the references above.
The implementation is based on the Hy-MMSBM model.
The code was developed utilizing Python 3.9, and can be downloaded and used locally as-is.
To install the necessary packages, run the following command
pip install -r requirements.txt
The inference of the affinity matrix w and community assignments u is
performed by running the code in main_inference.py
.
The most basic run only needs a hypergraph, the number of communities K, and a path to store the results.
For example, to perform inference on the High School dataset with K=2
communities, one can run the following command:
python main_inference.py
--K 2 --out_dir ./out_inference --pickle_file data/examples/high_school_dataset/hypergraph.pkl
The basic run, however, doesn't use the attributes. To add the attributes we need to specify the link to a csv file containing attributes with --attribute_file
parameter and the names of the columns to be used as attributes in --attribute_names
. By default, gamma = 0.0
, we can also change this parameter by using --gamma 0.8
command. The following command runs inference on High School dataset using attributes class and sex with K = 2
and gamma = 0.8
.
python main_inference.py
--K 2
--gamma 0.8
--out_dir ./out_inference
--pickle_file data/examples/high_school_dataset/hypergraph.pkl
--attribute_file data/examples/high_school_dataset/attributes.csv
--attribute_names class sex
It is possible to provide the input dataset in two formats.
1. Text format
A hypergraph can be provided as input via two .txt files,
containing the list of hyperedges, and the relative weights.
This allows the user to provide arbitrary datasets as inputs.
To perform inference on a dataset specified in text format, provide the path to the two
files as
python main_inference.py
--K 2
--out_dir ./out_inference
--hyperedge_file data/examples/high_school_dataset/hyperedges.txt
--weight_file data/examples/high_school_dataset/weights.txt
2. Pickle format
Alternatively, one can provide a Hypergraph
instance, which is the main representation
utilized internally in the code (see src.data.representation
), serialized via the
pickle Python library.
An example equivalent to the above is
python main_inference.py
--K 2
--out_dir ./out_inference
--pickle_file data/examples/high_school_dataset/hypergraph.pkl
Similarly to the text format, this allows to provide arbitrary hypergraphs as input.
Additional options can be specified, the full documentation is shown by running
python main_inference.py --help
Among the important ones we list:
--assortative
whether to run inference with a diagonal affinity matrix w.--max_hye_size
to keep only hyperedges up to a given size for inference. IfNone
, all hyperedges are utilized.--w_prior
and--u_prior
the rates for the exponential priors on the parameters. A value of zero is equivalent to no prior, any positive value is utilized for MAP inference.
For non-uniform priors, the path to a file containing a NumPy array can be specified, which will be loaded vianumpy.load
.--em_rounds
number of EM steps during optimization. It is sometimes useful when the model doesn't converge rapidly.--training_rounds
the number of models to train with different random initializations. The one with the highest log-likelihood is returned and saved.--seed
integer random seed.
All synthetically generated attributes and hypergraphs used in the experiments are available in data/generated
folder.
All real datasets used in the experiments are publically available.