Webinar: Scaleable Data Pipeline

This repo contains example Terraform code for deploying a base pipeline in AWS that loads data into Snowflake as well as detailing the core settings that you need to tweak to get the pipeline to scale up.

Usage

Running this module will require you to setup a few inputs first - take your time and walk through these carefully to ensure everything gets setup properly. These instructions are quite closely modelled on our existing quickstart-examples and you can find more detail / FAQs here.

Steps:

You will need to configure a Snowflake destination - you can follow the instructions noted here which will guide you through how to configure your Snowflake instance
Make a copy of the terraform.example.tfvars as terraform.tfvars and update the snowflake_* with your personal Snowflake settings
Update all other top level settings with your own vpc_id / subnet_ids, prefix, ssh_ip_allowlist and ssh_public_key:

The VPC settings you can use the default network made available in your AWS account
Prefix should be unique to "you" so as not to run into global conflicts
You should generate a new SSH key with something like ssh-keygen -t rsa -b 4096 - you will need to update ssh_public_key with the .pub part of the generated key

Setting up for scale

There are several exposed settings here that you will need to tune to get ready for scale - notably you will need to:

Ensure EC2 auto-scaling is setup and that "max" instance counts are increased to allow for head-room
Ensure Kinesis is auto-scaling to allow it to be more reactive to event volume changes
Ensure DynamoDB KCL tables can scale high enough to support the aggressive checkpointing needed

These settings and guidance are provided in the webinar (watch the recording!) but ultimately you need to tune the above settings until the pipeline absorbs all your traffic peaks without latency building up.

Testing

If you wanted to use Locust as we have you can find our plan under the locust directory.

Requirements

Name	Version
terraform	1.5.7
aws	~> 3.75.0
random	~> 3.1.0

Providers

Name	Version
aws	3.75.2

Modules

Name	Source	Version
bad_1_stream	snowplow-devops/kinesis-stream/aws	0.3.0
collector_kinesis	snowplow-devops/collector-kinesis-ec2/aws	0.9.1
collector_lb	snowplow-devops/alb/aws	0.2.0
enrich_kinesis	snowplow-devops/enrich-kinesis-ec2/aws	0.6.1
enriched_stream	snowplow-devops/kinesis-stream/aws	0.3.0
raw_stream	snowplow-devops/kinesis-stream/aws	0.3.0
sf_loader	snowplow-devops/snowflake-streaming-loader-ec2/aws	0.1.2

Resources

Name	Type
aws_cloudwatch_dashboard.pipeline	resource
aws_key_pair.pipeline	resource
aws_caller_identity.current	data source

Inputs

Name	Description	Type	Default	Required
aws_region	The region in which the pipeline gets deployed	`string`	n/a	yes
dyndb_kcl_read_max_capacity	Max read units for KCL Table	`number`	n/a	yes
dyndb_kcl_read_min_capacity	Min read units for KCL Table	`number`	n/a	yes
dyndb_kcl_write_max_capacity	Max write units for KCL Table	`number`	n/a	yes
dyndb_kcl_write_min_capacity	Min write units for KCL Table	`number`	n/a	yes
ec2_collector_instance_type	Instance type for Collector	`string`	n/a	yes
ec2_collector_max_size	Max number of nodes for Collector	`number`	n/a	yes
ec2_collector_min_size	Min number of nodes for Collector	`number`	n/a	yes
ec2_enable_auto_scaling	Whether to enable EC2 auto-scaling for Collector & Enrich	`bool`	n/a	yes
ec2_enrich_instance_type	Instance type for Enrich	`string`	n/a	yes
ec2_enrich_max_size	Max number of nodes for Enrich	`number`	n/a	yes
ec2_enrich_min_size	Min number of nodes for Enrich	`number`	n/a	yes
ec2_sf_loader_instance_type	Instance type for Snowflake Loader	`string`	n/a	yes
ec2_sf_loader_max_size	Max number of nodes for Snowflake Loader	`number`	n/a	yes
ec2_sf_loader_min_size	Min number of nodes for Snowflake Loader	`number`	n/a	yes
kinesis_stream_mode_details	The mode in which Kinesis Streams are setup	`string`	n/a	yes
prefix	Will be prefixed to all resource names. Use to easily identify the resources created	`string`	n/a	yes
public_subnet_ids	The list of public subnets to deploy the components across	`list(string)`	n/a	yes
snowflake_account_url	Snowflake account URL to use	`string`	n/a	yes
snowflake_database	Snowflake database name	`string`	n/a	yes
snowflake_loader_private_key	The private key to use for the loader user	`string`	n/a	yes
snowflake_loader_user	The Snowflake user used by Snowflake Streaming Loader	`string`	n/a	yes
snowflake_schema	Snowflake schema name	`string`	n/a	yes
ssh_ip_allowlist	The list of CIDR ranges to allow SSH traffic from	`list(any)`	n/a	yes
ssh_public_key	The SSH public key to use for the deployment	`string`	n/a	yes
vpc_id	The VPC to deploy the components within	`string`	n/a	yes
iam_permissions_boundary	The permissions boundary ARN to set on IAM roles created	`string`	`""`	no
telemetry_enabled	Whether or not to send telemetry information back to Snowplow Analytics Ltd	`bool`	`true`	no
user_provided_id	An optional unique identifier to identify the telemetry events emitted by this stack	`string`	`""`	no

Outputs

Name	Description
collector_dns_name	The ALB dns name for the Pipeline Collector

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
locust		locust
templates		templates
.gitignore		.gitignore
README.md		README.md
main.tf		main.tf
outputs.tf		outputs.tf
terraform.example.tfvars		terraform.example.tfvars
variables.tf		variables.tf
versions.tf		versions.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webinar: Scaleable Data Pipeline

Usage

Setting up for scale

Testing

Requirements

Providers

Modules

Resources

Inputs

Outputs

About

Releases

Packages

Languages

snowplow-devops/webinar-scalable-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Webinar: Scaleable Data Pipeline

Usage

Setting up for scale

Testing

Requirements

Providers

Modules

Resources

Inputs

Outputs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages