Skip to content

snowplow-devops/webinar-scalable-data-pipeline

Repository files navigation

Webinar: Scaleable Data Pipeline

Architecture Diagram

This repo contains example Terraform code for deploying a base pipeline in AWS that loads data into Snowflake as well as detailing the core settings that you need to tweak to get the pipeline to scale up.

Usage

Running this module will require you to setup a few inputs first - take your time and walk through these carefully to ensure everything gets setup properly. These instructions are quite closely modelled on our existing quickstart-examples and you can find more detail / FAQs here.

Steps:

  1. You will need to configure a Snowflake destination - you can follow the instructions noted here which will guide you through how to configure your Snowflake instance
  2. Make a copy of the terraform.example.tfvars as terraform.tfvars and update the snowflake_* with your personal Snowflake settings
  3. Update all other top level settings with your own vpc_id / subnet_ids, prefix, ssh_ip_allowlist and ssh_public_key:
  • The VPC settings you can use the default network made available in your AWS account
  • Prefix should be unique to "you" so as not to run into global conflicts
  • You should generate a new SSH key with something like ssh-keygen -t rsa -b 4096 - you will need to update ssh_public_key with the .pub part of the generated key

Setting up for scale

There are several exposed settings here that you will need to tune to get ready for scale - notably you will need to:

  1. Ensure EC2 auto-scaling is setup and that "max" instance counts are increased to allow for head-room
  2. Ensure Kinesis is auto-scaling to allow it to be more reactive to event volume changes
  3. Ensure DynamoDB KCL tables can scale high enough to support the aggressive checkpointing needed

These settings and guidance are provided in the webinar (watch the recording!) but ultimately you need to tune the above settings until the pipeline absorbs all your traffic peaks without latency building up.

Testing

If you wanted to use Locust as we have you can find our plan under the locust directory.

Requirements

Name Version
terraform 1.5.7
aws ~> 3.75.0
random ~> 3.1.0

Providers

Name Version
aws 3.75.2

Modules

Name Source Version
bad_1_stream snowplow-devops/kinesis-stream/aws 0.3.0
collector_kinesis snowplow-devops/collector-kinesis-ec2/aws 0.9.1
collector_lb snowplow-devops/alb/aws 0.2.0
enrich_kinesis snowplow-devops/enrich-kinesis-ec2/aws 0.6.1
enriched_stream snowplow-devops/kinesis-stream/aws 0.3.0
raw_stream snowplow-devops/kinesis-stream/aws 0.3.0
sf_loader snowplow-devops/snowflake-streaming-loader-ec2/aws 0.1.2

Resources

Name Type
aws_cloudwatch_dashboard.pipeline resource
aws_key_pair.pipeline resource
aws_caller_identity.current data source

Inputs

Name Description Type Default Required
aws_region The region in which the pipeline gets deployed string n/a yes
dyndb_kcl_read_max_capacity Max read units for KCL Table number n/a yes
dyndb_kcl_read_min_capacity Min read units for KCL Table number n/a yes
dyndb_kcl_write_max_capacity Max write units for KCL Table number n/a yes
dyndb_kcl_write_min_capacity Min write units for KCL Table number n/a yes
ec2_collector_instance_type Instance type for Collector string n/a yes
ec2_collector_max_size Max number of nodes for Collector number n/a yes
ec2_collector_min_size Min number of nodes for Collector number n/a yes
ec2_enable_auto_scaling Whether to enable EC2 auto-scaling for Collector & Enrich bool n/a yes
ec2_enrich_instance_type Instance type for Enrich string n/a yes
ec2_enrich_max_size Max number of nodes for Enrich number n/a yes
ec2_enrich_min_size Min number of nodes for Enrich number n/a yes
ec2_sf_loader_instance_type Instance type for Snowflake Loader string n/a yes
ec2_sf_loader_max_size Max number of nodes for Snowflake Loader number n/a yes
ec2_sf_loader_min_size Min number of nodes for Snowflake Loader number n/a yes
kinesis_stream_mode_details The mode in which Kinesis Streams are setup string n/a yes
prefix Will be prefixed to all resource names. Use to easily identify the resources created string n/a yes
public_subnet_ids The list of public subnets to deploy the components across list(string) n/a yes
snowflake_account_url Snowflake account URL to use string n/a yes
snowflake_database Snowflake database name string n/a yes
snowflake_loader_private_key The private key to use for the loader user string n/a yes
snowflake_loader_user The Snowflake user used by Snowflake Streaming Loader string n/a yes
snowflake_schema Snowflake schema name string n/a yes
ssh_ip_allowlist The list of CIDR ranges to allow SSH traffic from list(any) n/a yes
ssh_public_key The SSH public key to use for the deployment string n/a yes
vpc_id The VPC to deploy the components within string n/a yes
iam_permissions_boundary The permissions boundary ARN to set on IAM roles created string "" no
telemetry_enabled Whether or not to send telemetry information back to Snowplow Analytics Ltd bool true no
user_provided_id An optional unique identifier to identify the telemetry events emitted by this stack string "" no

Outputs

Name Description
collector_dns_name The ALB dns name for the Pipeline Collector

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published