TensorFlow GNN (TF-GNN) provides the tfgnn_sampler
tool to
facilitate local neighborhood learning and convenient batching for graph
datasets. It provides a scalable and distributed means to sample even the
largest publicly-available graph datasets.
The Graph Sampler takes a sampling configuration, graph data, and optionally a
list of seed nodes as its inputs and produces sampled subgraphs as its output.
The graph data comes as tf.Example
s in sharded files for graph edges and node
features. The generated subgraphs are as serialized tf.Example
s that can be
parsed as tfgnn.GraphTensor
using tfgnn.parse_example()
.
The Graph Sampler is written in Apache Beam, an open-source SDK for expressing Dataflow-Model data processing pipelines with support for multiple infrastructure backends. A client writes an Apache Beam Pipeline and, at runtime, specifies a Runner to define the compute environment in which the pipeline will execute.
The two main abstractions defined by Apache Beam of concern are:
- Pipelines: computational steps expressed as a DAG (Directed Acyclic Graph)
- Runners: Environments for running Beam Pipelines
The simplest Beam runner is the DirectRunner which allows to test Beam pipelines on local hardware. It requires all data to fit in memory on a single machine and runs user code with extra debug checks enabled. It is slow and should be used only for small-scale testing or prototyping.
DataflowRunner that enables clients to connect to a Google Cloud Platform (GCP) and execute a Beam pipeline on GCP hardware through the Dataflow service. It enables horizontal scaling which allows to sample graphs even with billions of edges.
NOTE: Only use the DirectRunner for small-scale testing.
To successfully use the Graph Sampler, we need a few items set up. In
particular, we need a schema for the graph, a specification for the sampling
operations, and available data. As a motivating example, we can use a dataset of
fake credit card data in the examples/sampler/creditcard
directory.
In particular, let's use the following graph schema, with a graph of customers,
credit cards, and ownership linking them. For any node sets with features, there
should also be a feature called #id
, and edge sets should contain #source
and #target
features that correspond to node #id
s. These special features
should not be explicitly specified in the graph schema. For more information,
see the data prep guide. Here, both node sets have features
besides simple #id
s, so we need to specify the files that map the
#id
s to features.
node_sets {
key: "customer"
value {
features {
key: "name"
value: {
description: "Name"
dtype: DT_STRING
}
}
features {
key: "address"
value: {
description: "address"
dtype: DT_STRING
}
}
features {
key: "zipcode"
value: {
description: "Zipcode"
dtype: DT_INT64
}
}
features {
key: "score"
value: {
description: "Credit score"
dtype: DT_FLOAT
}
}
metadata {
filename: "customer.csv"
}
}
}
node_sets {
key: "creditcard"
value {
metadata {
filename: "creditcard.csv"
}
features {
key: "number"
value: {
description: "Credit card number"
dtype: DT_INT64
}
}
features {
key: "issuer"
value: {
description: "Credit card issuer institution"
dtype: DT_STRING
}
}
}
}
edge_sets {
key: "owns_card"
value {
description: "Owns and uses the credit card."
source: "customer"
target: "creditcard"
metadata {
filename: "owns_card.csv"
}
}
}
We also need to create a sampling spec to indicate how we want to create the subgraphs. Here, we'll treat the customer as the seed node, and sample up to 3 associated credit cards at random.
seed_op <
op_name: "seed"
node_set_name: "customer"
>
sampling_ops <
op_name: "seed->creditcard"
input_op_names: "seed"
edge_set_name: "owns_card"
sample_size: 3
strategy: RANDOM_UNIFORM
>
We can run the sampler on this data via the following command, using the Apache Beam direct runner.
cd <path-to>/gnn/examples/sampler/creditcard
tfgnn_sampler \
--data_path="." \
--graph_schema graph_schema.pbtxt \
--sampling_spec sampling_spec.pbtxt \
--output_samples outputs/examples.tfrecords \
--runner DirectRunner
The examples/sampler/mag
directory contains some of the components needed to
run the sampler for OGBN-MAG
described in detail in the data prep guide
The directory should contain the following:
-
The
graph_schema.pbtxt
is a Graph Schema with filenames in the metadata for all edge sets and all node sets with features. -
The
sampling_spec.pbtxt
is a sampling specification. -
The
run_mag.sh
script has the command to start the DataFlow pipeline run. This script configures location, machine types to use, sets desired parallelism as minimum/maximum number of workers and the number of threads for each worker. -
The
setup.py
file is used by the pipeline workers to install their dependencies, e.g., theapache-beam[gcp]
,tensorflow
andtensorflow_gnn
libraries.
This example further assumes that OGB data is converted to the unigraph format,
e.g. using tfgnn_convert_ogb_dataset
, and stored cloud storage as sharded
files for edges and node features:
nodes-paper.tfrecords-?????-of-?????
edges-affiliated-with.tfrecords-?????-of-?????
edges-cites.tfrecords-?????-of-?????
edges-has_topic.tfrecords-?????-of-?????
edges-writes.tfrecords-?????-of-?????
The sampler currently supports CSV files and TFRecord files corresponding to
each graph piece. For TFRecords, the filename should be a glob pattern that
identifies the relevant shards. The sampler also supports shorthand for a common
sharding pattern, where <filename>.tfrecords@<shard-count>
is read as
filename.tfrecords-?????-of-<5-digit shard-count>
.
Before running run_mag.sh
, users must edit the GOOGLE_CLOUD_PROJECT
and
DATA_PATH
variables in the script.
The sampler has achieved the following performance and costs with GCP DataFlow on the following datasets. The costs here reflect GCP pricing in June 2023, and may change in the future.
Dataset | Generated Subgraphs | Input Data Size | Subgraph Data Size | Machine Type | Min/Max Workers | Threads per Worker | Execution Time | Estimated Cost |
---|---|---|---|---|---|---|---|---|
OGBN-Arxiv | 169k | 1.4GB | 1.2GB | n1-highmem-2 | 5/15 | 2 | 18min | $0.18 |
OGBN-MAG | 736k | 2.4GB | 108GB | n1-highmem-2 | 30/100 | 2 | 45min | $11 |
MAG-240M (MAG240m LSC contest requirements) | 1.4M | 502GB | 945GB | n1-highmem-2 | 100/300 | 2 | 47min | $47 |
To demonstrate the scalability of this distributed sampler, we have also generated subgraphs for all 121M papers in the OGB LSC dataset.
Dataset | Generated Subgraphs | Input Data Size | Subgraph Data Size | Machine Type | Min/Max Workers | Threads per Worker | Execution Time | Estimated Cost |
---|---|---|---|---|---|---|---|---|
MAG-240M (sampling subgraphs for all 121M papers) | 121M | 502GB | 67TB | n1-highmem-4 | 300/1000 | 1 | 4h | $4100 |