This repo contains example Terraform code for deploying a base pipeline in AWS that loads data into Snowflake as well as detailing the core settings that you need to tweak to get the pipeline to scale up.
Running this module will require you to setup a few inputs first - take your time and walk through these carefully to ensure everything gets setup properly. These instructions are quite closely modelled on our existing quickstart-examples
and you can find more detail / FAQs here.
Steps:
- You will need to configure a Snowflake destination - you can follow the instructions noted here which will guide you through how to configure your Snowflake instance
- Make a copy of the
terraform.example.tfvars
asterraform.tfvars
and update thesnowflake_*
with your personal Snowflake settings - Update all other top level settings with your own
vpc_id
/subnet_ids
,prefix
,ssh_ip_allowlist
andssh_public_key
:
- The VPC settings you can use the default network made available in your AWS account
- Prefix should be unique to "you" so as not to run into global conflicts
- You should generate a new SSH key with something like
ssh-keygen -t rsa -b 4096
- you will need to updatessh_public_key
with the.pub
part of the generated key
There are several exposed settings here that you will need to tune to get ready for scale - notably you will need to:
- Ensure EC2 auto-scaling is setup and that "max" instance counts are increased to allow for head-room
- Ensure Kinesis is auto-scaling to allow it to be more reactive to event volume changes
- Ensure DynamoDB KCL tables can scale high enough to support the aggressive checkpointing needed
These settings and guidance are provided in the webinar (watch the recording!) but ultimately you need to tune the above settings until the pipeline absorbs all your traffic peaks without latency building up.
If you wanted to use Locust as we have you can find our plan under the locust
directory.
Name | Version |
---|---|
terraform | 1.5.7 |
aws | ~> 3.75.0 |
random | ~> 3.1.0 |
Name | Version |
---|---|
aws | 3.75.2 |
Name | Source | Version |
---|---|---|
bad_1_stream | snowplow-devops/kinesis-stream/aws | 0.3.0 |
collector_kinesis | snowplow-devops/collector-kinesis-ec2/aws | 0.9.1 |
collector_lb | snowplow-devops/alb/aws | 0.2.0 |
enrich_kinesis | snowplow-devops/enrich-kinesis-ec2/aws | 0.6.1 |
enriched_stream | snowplow-devops/kinesis-stream/aws | 0.3.0 |
raw_stream | snowplow-devops/kinesis-stream/aws | 0.3.0 |
sf_loader | snowplow-devops/snowflake-streaming-loader-ec2/aws | 0.1.2 |
Name | Type |
---|---|
aws_cloudwatch_dashboard.pipeline | resource |
aws_key_pair.pipeline | resource |
aws_caller_identity.current | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
aws_region | The region in which the pipeline gets deployed | string |
n/a | yes |
dyndb_kcl_read_max_capacity | Max read units for KCL Table | number |
n/a | yes |
dyndb_kcl_read_min_capacity | Min read units for KCL Table | number |
n/a | yes |
dyndb_kcl_write_max_capacity | Max write units for KCL Table | number |
n/a | yes |
dyndb_kcl_write_min_capacity | Min write units for KCL Table | number |
n/a | yes |
ec2_collector_instance_type | Instance type for Collector | string |
n/a | yes |
ec2_collector_max_size | Max number of nodes for Collector | number |
n/a | yes |
ec2_collector_min_size | Min number of nodes for Collector | number |
n/a | yes |
ec2_enable_auto_scaling | Whether to enable EC2 auto-scaling for Collector & Enrich | bool |
n/a | yes |
ec2_enrich_instance_type | Instance type for Enrich | string |
n/a | yes |
ec2_enrich_max_size | Max number of nodes for Enrich | number |
n/a | yes |
ec2_enrich_min_size | Min number of nodes for Enrich | number |
n/a | yes |
ec2_sf_loader_instance_type | Instance type for Snowflake Loader | string |
n/a | yes |
ec2_sf_loader_max_size | Max number of nodes for Snowflake Loader | number |
n/a | yes |
ec2_sf_loader_min_size | Min number of nodes for Snowflake Loader | number |
n/a | yes |
kinesis_stream_mode_details | The mode in which Kinesis Streams are setup | string |
n/a | yes |
prefix | Will be prefixed to all resource names. Use to easily identify the resources created | string |
n/a | yes |
public_subnet_ids | The list of public subnets to deploy the components across | list(string) |
n/a | yes |
snowflake_account_url | Snowflake account URL to use | string |
n/a | yes |
snowflake_database | Snowflake database name | string |
n/a | yes |
snowflake_loader_private_key | The private key to use for the loader user | string |
n/a | yes |
snowflake_loader_user | The Snowflake user used by Snowflake Streaming Loader | string |
n/a | yes |
snowflake_schema | Snowflake schema name | string |
n/a | yes |
ssh_ip_allowlist | The list of CIDR ranges to allow SSH traffic from | list(any) |
n/a | yes |
ssh_public_key | The SSH public key to use for the deployment | string |
n/a | yes |
vpc_id | The VPC to deploy the components within | string |
n/a | yes |
iam_permissions_boundary | The permissions boundary ARN to set on IAM roles created | string |
"" |
no |
telemetry_enabled | Whether or not to send telemetry information back to Snowplow Analytics Ltd | bool |
true |
no |
user_provided_id | An optional unique identifier to identify the telemetry events emitted by this stack | string |
"" |
no |
Name | Description |
---|---|
collector_dns_name | The ALB dns name for the Pipeline Collector |