A multipurpose network sampling tool.
Traversing the deserts of the internet.
- spiderexpress
In order to use spiderexpress
you need to have Python 3.8 or higher and poetry installed on your system.
- Clone the repository
git clone https://github.com/Leibniz-HBI/spiderexpress.git
. - Install the dependencies
cd spiderexpress
andpoetry install
. - Activate the virtual environment
poetry shell
- Run the CLI
spiderexpress --help
.
In the future we will provide a PyPI package which will make the installation process much easier.
$ spiderexpress --help
Usage: spiderexpress [OPTIONS] COMMAND [ARGS]...
Traverse the deserts of the internet.
Options:
--help Show this message and exit.
Commands:
create create a new configuration
start start a job
This command creates a spiderexpress
project in the current directory.
By default, the project creation process will be interactive, but this can be disabled by passing the --non-interactive
flag.
Usage: spiderexpress create [OPTIONS] CONFIG
create a new configuration
Options:
--interactive / --non-interactive
This command starts a spiderexpress
job with the given configuration file.
Usage: spiderexpress start [OPTIONS] CONFIG
start a job
Options:
--help Show this message and exit.
A spiderexpress
project could for example look like this:
├── my_project
│ ├── my_project.pe.yml
│ ├── my_project.db
│ └── seed_file.txt
Whereas my_project.db
is the resulting database, my_project.pe.yml
is the project's configuration in which a data source and sampling strategy and other parameters may be specified (see Configuration for further details) and seed_file.txt
is a text file which contains one node name per line.
For example projects, please refer to the examples
directory or the unit tests in /tests/.
spiderexpress
utilizes YAML de-/serialization for its configuration file. As such, initializing a project is as easy as: running $ spiderexpress create
and a pleasurable and comforting dialogue prompt will guide you through the process.
The resulting file could look like something like this example:
project_name: spider
batch_size: 150
db_url: test2.sqlite
max_iteration: 10000
edge_table_name: edge_list
node_table_name: node_list
seeds:
- ...
connector: telegram
strategy:
spikyball:
layer_max_size: 150
sampler:
source_node_probability:
coefficient: 1
weights:
subscriber_count: 4
videos_count: 1
target_node_probability:
coefficient: 1
weights:
edge_probability:
coefficient: 1
weights:
views: 1
How the tables are structured is determined by the configuration file. The following sections describe the minimal configuration for the tables as well as the configuration syntax.
Both tables are created if they do not exist; the name of the tables are determined by the configuration file. For example, consider the following configuration snippet:
edge_raw_table:
name: tg_edges_raw
columns:
post_id: Text
datetime: Text
views: Integer
text: Text
forwarded_message_url: Text
This will create a table named tg_edges_raw
, under columns
column names and their types are specified.
Allowed types are Text
, Integer
.
Note Column names are used to get data from the connector, thus, they must be present in the connector's output – otherwise they will contain
None
-values.
erDiagram
NODE ||--|{ EDGE_RAW : targets
NODE ||--|{ EDGE_RAW : originates
NODE ||--|| SEEDS : tracks
EDGE_AGG ||--|{ EDGE_RAW : aggregates
APP_STATE ||-- |{ SEEDS : numbers
EDGE_RAW {
string source "source node identifier"
string target "target node identifier"
any more-columns "many more columns if you like"
}
EDGE_AGG {
string source PK "source node identifier"
string target PK "target node identifier"
integer weight "number of edges between both nodes in `edges_dense`"
}
NODE {
string node_id "node identifier"
integer degree
}
APP_STATE {
integer iteration
integer max_iteration
}
SEEDS {
string node_id "seed's node identifier"
integer iteration "seed's iteration identifier"
datetime visited_at "when the node was visited"
string status "has the node been processed?"
}
The edges of the network are kept in two tables: edges_raw and edges_agg, whereas in the aggregated table only sampled edges are persisted and the raw table includes all edges spiderexpress collected in the process.
The following table informs about the minimally necessary columns it will create, although more metadata can be stored in the table.
The raw edges table contains all edges that were collected by the connector. Which columns are persisted is determined by the configuration as well as the data coming from the connector.
Column Name | Description |
---|---|
source | source node name |
target | target node name |
... | optionally additional data coming from the connector |
Columns are specified as key-value-pairs, available data types are Text
and Integer
.
edge_raw_table:
name: tg_edges_raw
columns:
post_id: Text
datetime: Text
views: Integer
text: Text
forwarded_message_url: Text
As the name suggests in this the aggregated, simple edges are kept and persisted after sampling.
The following table informs about the minimally available columns spiderexpress
will create.
Column Name | Description |
---|---|
source | source node name |
target | target node name |
weight | number of multi-edges between the two nodes |
... | optionally additional data |
The aggregation can be specified in the configuration file, again with key-value-pairs specifying the column name and the aggregation function. Available aggregation functions are sum
, min
, max
, avg
and count
. Note: only numeric columns may be aggregated and used in the sampler configuration.
edge_agg_table:
name: tg_edges_agg
columns:
views: sum
The nodes of the network are kept a node
-table. The following table informs about the minimally necessary columns it will create, although more metadata can be stored in the table, similarly to the specification under Raw Edges.
Column Name | Description |
---|---|
name | node identifier |
... | optionally additional data coming from the connector |
stateDiagram-v2
gathering: gathering node data for the first node in queue
sampling : sampling aggregated edges in iteration
retrying : retrying finding seeds
[*] --> idle
idle --> starting : load_config()
starting --> gathering : initialize_seeds()
gathering --> gathering
gathering --> sampling
sampling --> gathering : increment_iteration()
sampling --> retrying
retrying --> gathering : increment_iteration()
sampling --> stopping
stopping --> [*]
This connector reads network from CSV files, one for the edges and, optionally, one for node information. The tables must follow the above stated column names for both kinds of tables. Required configuration are the following key-value-pairs:
Key | Description |
---|---|
edge_list_location | relative or absolute path to the edge CSV-file. |
node_list_location | relative or absolute path to the edge CSV-file. |
mode | either "in", "out" or "both", determines which edges to emit. |
Information must be given in the spiderexpress
-project configuration file, e.g. consider the following configuration snippet:
connector:
csv:
edge_list_location: path/to/file.csv
node_list_location: path/to/file.csv
mode: out
Note In the current implementation the tables are reread on each call of the connector, thus, loading large networks will lead to long loading times.
- Telegram: A connector scrapes the latest messages of a Telegram's channel public website. Find it on GitHub.
- Twitter: A connector to retrieve followers, friends and other things from Twitter. Find it on GitHub.
This strategy samples implements a weighted random sampling of the network's edges, whereas the variables to utilize for the weighting can be specified in the configuration file.
spikyball:
layer_max_size: 150
sampler:
source_node_probability:
coefficient: 1
weights:
subscriber_count: 4
videos_count: 1
target_node_probability:
coefficient: 1
weights:
edge_probability:
coefficient: 1
weights:
views: 1
The following table informs about the configuration keys and their meaning:
Key | Description |
---|---|
layer_max_size | maximum number of nodes to sample in one iteration |
source_node_probability | probability of sampling a source node |
target_node_probability.weights | column names in the node table and the weight it should be assigned to |
edge_probability | probability of sampling an edge |
*.weights | column names in the aggregated edge table and the weight it should be assigned to |
*.coefficient | coefficient to multiply the probability with |
Specified columns must be present in the node and edge tables, and thus, must be specified in the configuration file as well. See the configuration section for more information on where to specify which data columns should be included in both the node table and the aggregation specification for the aggregated edge table.
This strategy samples implements a random sampling of the network's edges (and it's implementation is shown below). The configuration is simple and only requires the following key-value-pair:
Key | Description |
---|---|
n | maximum number of nodes to sample in one iteration |
spiderexpress
is extensible via plug-ins and sports two setuptools
entry points to register plug-ins with:
spiderexpress.connectors
under which a connector may be registered, i.e. a program that retrieves and returns new data from a data source.spiderexpress.strategies
under which sampling strategies may be registered.
The idea of a Connector
is to deliver new information of the network to be explored. The function takes a List[str]
which is a list of node names for which we need information about, and it returns two dataframes, the edges and the node information.
All Connectors must implement the following function interface:
Connector = Callable[[list[str]], tuple[pd.DataFrame, pd.DataFrame]]
# Connector(node_names: List[str]) -> DataFrame, DataFrame
Where the returns are the following:
List[str]
is a list of the new seed nodes for the next iteration,DataFrame
is the table of new edges to be added to the network,DataFrame
is the table of new nodes to be added to the network.Dict
is a dictionary holding additional configuration information fo the strategy.
Strategy = Callable[[pd.DataFrame, pd.DataFrame, list[str]], Tuple[list[str], pd.DataFrame, pd.DataFrame]]
# Strategy(edges: DataFrame, nodes: DataFrame, known_nodes: List[str]) -> List[str], DataFrame, DataFrame
The parameters are the following:
DataFrame
is the table of edges to be sampled,DataFrame
is the table of nodes to be sampled.List[str]
is a list of the visited nodes,Dict
is a dictionary holding additional configuration information fo the strategy.
Where the returns are the following:
List[str]
is a list of the new seed nodes for the next iteration,DataFrame
is the table of new edges to be added to the network,DataFrame
is the table of new nodes to be added to the network.
The registered plug-ins must follow the below stated function interfaces. Although any additional parameters stated in the configuration file will be passed into the function as well.
E.g. if a configuration file states:
strategy:
random:
n: 15
will result in the following function call random(edges, nodes, known_nodes, {"n": 15})
at the sampling stage.
To further illustrate the process we consider an implementation of random sampling, here our strategy is to select 10 random nodes for each layer:
from typing import Any, Dict, List
import pandas as pd
def random_strategy(
edges: pd.DataFrame,
nodes: pd.DataFrame,
known_nodes: List[str],
configuration: Dict[str, Any],
):
"""Random sampling strategy."""
# split the edges table into edges _inside_ and _outside_ of the known network
mask = edges.target.isin(known_nodes)
edges_inward = edges.loc[mask, :]
edges_outward = edges.loc[~mask, :]
# select 10 edges to follow
if len(edges_outward) < configuration["n"]:
edges_sampled = edges_outward
else:
edges_sampled = edges_outward.sample(n=configuration["n"], replace=False)
new_seeds = edges_sampled.target # select target node names as seeds for the
# next layer
edges_to_add = pd.concat([edges_inward, edges_sampled]) # add edges inside the
# known network as well as the sampled edges to the known network
new_nodes = nodes.loc[nodes.name.isin(new_seeds), :]
return new_seeds, edges_to_add, new_nodes
To further illustrate the process we consider an implementation of a connector that reads data from a CSV file. The connector is designed to read data from a CSV file that contains a list of edges and (optionally) a list of nodes. The CSVConnectorConfiguration class defines the configuration options for the connector, including the location of the edge list CSV file, the mode of operation (in, out, or both), and the optional location of the node list CSV file.
The csv_connector()
function takes two arguments: a list of node IDs and the configuration options. The function then reads in the edge list CSV file and (optionally) the node list CSV file using the Pandas library. Based on the mode of operation specified in the configuration options, the function filters the edge list to return only those edges that are connected to the specified nodes. If a node list CSV file is provided, the function also filters the node list to return only those nodes that are connected to the specified edges. The function then returns two Pandas dataframes, one containing the filtered edge list and the other containing the filtered node list (if provided).
- The function takes two arguments: node_ids (a list of node IDs) and configuration (either a dictionary or an instance of CSVConnectorConfiguration class).
def csv_connector(
node_ids: List[str], configuration: Dict
) -> (pd.DataFrame, pd.DataFrame):
- The function reads in the edge list CSV file using the Pandas read_csv() function and stores the resulting dataframe in a variable called edges. The dtype=str argument specifies that all columns should be read in as strings.
edges = pd.read_csv(configuration["edge_list_location"], dtype=str)
- If a node list CSV file location is provided in the configuration options, the function reads in the node list CSV file using the Pandas read_csv() function and stores the resulting dataframe in a variable called nodes. The dtype=str argument specifies that all columns should be read in as strings. If no node list CSV file location is provided, the nodes variable is set to None.
nodes = (
pd.read_csv(configuration["node_list_location"], dtype=str)
if configuration["node_list_location"]
else None
)
- Based on the mode specified in the configuration options, the function creates a boolean mask that filters the edges dataframe to only include edges that are connected to the specified nodes. If mode is set to "in", the function filters for edges with a target node that is in node_ids. If mode is set to "out", the function filters for edges with a source node that is in node_ids. If mode is set to "both", the function filters for edges that are connected to nodes in node_ids in either direction (i.e., the edges have a source or target node that is in node_ids).
if configuration["mode"] == "in":
mask = edges["target"].isin(node_ids)
elif configuration["mode"] == "out":
mask = edges["source"].isin(node_ids)
elif configuration["mode"] == "both":
mask = edges["target"].isin(node_ids) | edges["source"].isin(node_ids)
else:
raise ValueError(f"{configuration["mode"]} is not one of 'in', 'out' or 'both'.")
- The function creates a new dataframe called edge_return that contains only the rows of the edges dataframe that are specified by the boolean mask.
edge_return: pd.DataFrame = edges.loc[mask]
- Finally, the function returns a tuple containing two dataframes: edge_return (the filtered edge list) and
nodes.loc[nodes.name.isin(node_ids), :]
(the filtered node list, if a node list CSV file location was provided in the configuration options and there are nodes connected to the filtered edges). If no node list CSV file location was provided or there are no nodes connected to the filtered edges, the function returns an empty dataframe for nodes.
return (
edge_return,
nodes.loc[nodes.name.isin(node_ids), :]
if nodes is not None
else pd.DataFrame(),
)
And now: all together now! A complete implementation of the CSV connector is shown below; a more involved implementation is provided in the spiderexpress.connectors.csv
module in the connectors' directory.
from typing import Dict, List
import pandas as pd
def csv_connector(
node_ids: List[str], configuration: Dict
) -> (pd.DataFrame, pd.DataFrame):
"""The CSV connector!"""
# Read the edge table.
edges = pd.read_csv(configuration["edge_list_location"], dtype=str)
# Read the node table if provided.
nodes = (
pd.read_csv(configuration["node_list_location"], dtype=str)
if configuration["node_list_location"]
else None
)
# Filter edges based on the mode of operation.
if configuration["mode"] == "in":
mask = edges["target"].isin(node_ids)
elif configuration["mode"] == "out":
mask = edges["source"].isin(node_ids)
elif configuration["mode"] == "both":
mask = edges["target"].isin(node_ids) | edges["source"].isin(node_ids)
else:
raise ValueError(f"{configuration['mode']} is not one of 'in', 'out' or 'both'.")
# Filter edges that contain our input nodes
edge_return: pd.DataFrame = edges.loc[mask]
return (
edge_return, # return the filtered edges
nodes.loc[nodes.name.isin(node_ids), :] # return the filtered nodes
if nodes is not None # if there are nodes
else pd.DataFrame(),
)
This software uses poetry as a package management solution. In order to install all dependencies:
- Have
poetry
installed. - Clone this repository.
cd
into the newly created repositories' folder.- Install all dependencies with
poetry install
. - To enter the virtual environment, run
poetry shell
. - To run the test suite, run
pytest
inside the virtual environments shell. - To run this program, run
spiderexpress
.
2024, Philipp Kessling under the MIT license.