A Terraform module which deploys the Snowplow Databricks Loader on an EC2 node.
This module by default collects and forwards telemetry information to Snowplow to understand how our applications are being used. No identifying information about your sub-account or account fingerprints are ever forwarded to us - it is very simple information about what modules and applications are deployed and active.
If you wish to subscribe to our mailing list for updates to these modules or security advisories please set the user_provided_id
variable to include a valid email address which we can reach you at.
To disable telemetry simply set variable telemetry_enabled = false
.
For details on what information is collected please see this module: https://github.com/snowplow-devops/terraform-snowplow-telemetry
Databricks Loader loads transformed events from S3 bucket to Databricks.
For more information on how it works, see this overview.
To configure Databricks, please refer to the quick start guide.
Duration settings such as folder_monitoring_period
or retry_period
should be given in the documented duration format.
Normally, this module would be used as part of our quick start guide. However, you can also use it standalone for a custom setup.
See example below:
# Note: This should be the same bucket that is used by the transformer to produce data to load
module "s3_pipeline_bucket" {
source = "snowplow-devops/s3-bucket/aws"
bucket_name = "your-bucket-name"
}
# Note: This should be the same queue that is passed to the transformer to produce data to load
resource "aws_sqs_queue" "db_message_queue" {
content_based_deduplication = true
kms_master_key_id = "alias/aws/sqs"
name = "db-loader.fifo"
fifo_queue = true
}
module "transformer_wrp" {
source = "snowplow-devops/transformer-kinesis-ec2/aws"
accept_limited_use_license = true
name = "transformer-server-wrp"
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
stream_name = module.enriched_stream.name
s3_bucket_name = module.s3_pipeline_bucket.id
s3_bucket_object_prefix = "transformed/good/widerow/parquet"
window_period_min = 1
sqs_queue_name = aws_sqs_queue.db_message_queue.name
transformation_type = "widerow"
widerow_file_format = "parquet"
ssh_key_name = "your-key-name"
ssh_ip_allowlist = ["0.0.0.0/0"]
# Linking in the custom Iglu Server here
custom_iglu_resolvers = [
{
name = "Iglu Server"
priority = 0
uri = "http://your-iglu-server-endpoint/api"
api_key = var.iglu_super_api_key
vendor_prefixes = []
}
]
}
module "db_loader" {
source = "snowplow-devops/databricks-loader-ec2/aws"
accept_limited_use_license = true
name = "db-loader-server"
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
sqs_queue_name = aws_sqs_queue.db_message_queue.name
deltalake_catalog = "<CATALOG>"
deltalake_schema = "<SCHEMA>"
deltalake_host = "<HOST>"
deltalake_port = "<PORT>"
deltalake_http_path = "<HTTP_PATH>"
deltalake_auth_token = "<AUTH_TOKEN>"
databricks_aws_s3_bucket_name = module.s3_pipeline_bucket.id
ssh_key_name = "your-key-name"
ssh_ip_allowlist = ["0.0.0.0/0"]
# Linking in the custom Iglu Server here
custom_iglu_resolvers = [
{
name = "Iglu Server"
priority = 0
uri = "http://your-iglu-server-endpoint/api"
api_key = var.iglu_super_api_key
vendor_prefixes = []
}
]
}
Name | Version |
---|---|
terraform | >= 1.0.0 |
aws | >= 3.72.0 |
Name | Version |
---|---|
aws | >= 3.72.0 |
Name | Source | Version |
---|---|---|
instance_type_metrics | snowplow-devops/ec2-instance-type-metrics/aws | 0.1.2 |
service | snowplow-devops/service-ec2/aws | 0.2.1 |
telemetry | snowplow-devops/telemetry/snowplow | 0.5.0 |
Name | Type |
---|---|
aws_cloudwatch_log_group.log_group | resource |
aws_iam_instance_profile.instance_profile | resource |
aws_iam_policy.iam_policy | resource |
aws_iam_policy.sts_credentials_policy | resource |
aws_iam_role.iam_role | resource |
aws_iam_role.sts_credentials_role | resource |
aws_iam_role_policy_attachment.policy_attachment | resource |
aws_iam_role_policy_attachment.sts_credentials_policy_attachment | resource |
aws_security_group.sg | resource |
aws_security_group_rule.egress_tcp_443 | resource |
aws_security_group_rule.egress_tcp_80 | resource |
aws_security_group_rule.egress_tcp_databricks | resource |
aws_security_group_rule.egress_udp_123 | resource |
aws_security_group_rule.egress_udp_statsd | resource |
aws_security_group_rule.ingress_tcp_22 | resource |
aws_caller_identity.current | data source |
aws_iam_policy_document.sts_credentials_role | data source |
aws_region.current | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
databricks_aws_s3_bucket_name | AWS bucket name where data to load is stored | string |
n/a | yes |
deltalake_auth_token | Databricks deltalake auth token | string |
n/a | yes |
deltalake_host | Databricks deltalake host | string |
n/a | yes |
deltalake_http_path | Databricks deltalake http path | string |
n/a | yes |
deltalake_schema | Databricks deltalake schema | string |
n/a | yes |
name | A name which will be prepended to the resources created | string |
n/a | yes |
sqs_queue_name | SQS queue name | string |
n/a | yes |
ssh_key_name | The name of the SSH key-pair to attach to all EC2 nodes deployed | string |
n/a | yes |
subnet_ids | The list of subnets to deploy Loader across | list(string) |
n/a | yes |
vpc_id | The VPC to deploy Loader within | string |
n/a | yes |
accept_limited_use_license | Acceptance of the SLULA terms (https://docs.snowplow.io/limited-use-license-1.0/) | bool |
false |
no |
amazon_linux_2_ami_id | The AMI ID to use which must be based of of Amazon Linux 2; by default the latest community version is used | string |
"" |
no |
app_version | Version of rdb loader databricks | string |
"5.6.0" |
no |
associate_public_ip_address | Whether to assign a public ip address to this instance | bool |
true |
no |
cloudwatch_logs_enabled | Whether application logs should be reported to CloudWatch | bool |
true |
no |
cloudwatch_logs_retention_days | The length of time in days to retain logs for | number |
7 |
no |
config_override_b64 | App config uploaded as a base64 encoded blob. This variable facilitates dev flow, if config is incorrect this can break the deployment. | string |
"" |
no |
custom_iglu_resolvers | The custom Iglu Resolvers that will be used by Stream Shredder | list(object({ |
[] |
no |
databricks_aws_s3_folder_monitoring_stage_url | AWS bucket URL of folder monitoring stage - must be within 'databricks_aws_s3_bucket_name' (NOTE: must be set if 'folder_monitoring_enabled' is true) | string |
"" |
no |
databricks_aws_s3_folder_monitoring_transformer_output_stage_url | AWS bucket URL of transformer output stage - must be within 'databricks_aws_s3_bucket_name' (NOTE: must be set if 'folder_monitoring_enabled' is true) | string |
"" |
no |
default_iglu_resolvers | The default Iglu Resolvers that will be used by Stream Shredder | list(object({ |
[ |
no |
deltalake_catalog | Databricks deltalake catalog | string |
"hive_metastore" |
no |
deltalake_port | Databricks deltalake port | number |
443 |
no |
folder_monitoring_enabled | Whether folder monitoring should be activated or not | bool |
false |
no |
folder_monitoring_period | How often to folder should be checked by folder monitoring | string |
"8 hours" |
no |
folder_monitoring_since | Specifies since when folder monitoring will check | string |
"14 days" |
no |
folder_monitoring_until | Specifies until when folder monitoring will check | string |
"6 hours" |
no |
health_check_enabled | Whether health check should be enabled or not | bool |
false |
no |
health_check_freq | Frequency of health check | string |
"1 hour" |
no |
health_check_timeout | How long to wait for a response for health check query | string |
"1 min" |
no |
iam_permissions_boundary | The permissions boundary ARN to set on IAM roles created | string |
"" |
no |
instance_type | The instance type to use | string |
"t3a.micro" |
no |
java_opts | Custom JAVA Options | string |
"-XX:InitialRAMPercentage=75 -XX:MaxRAMPercentage=75" |
no |
private_ecr_registry | The URL of an ECR registry that the sub-account has access to (e.g. '000000000000.dkr.ecr.cn-north-1.amazonaws.com.cn/') | string |
"" |
no |
retry_period | How often batch of failed folders should be pulled into a discovery queue | string |
"10 min" |
no |
retry_queue_enabled | Whether retry queue should be enabled or not | bool |
false |
no |
retry_queue_interval | Artificial pause after each failed folder being added to the queue | string |
"10 min" |
no |
retry_queue_max_attempt | How many attempt to make for each folder | number |
-1 |
no |
retry_queue_size | How many failures should be kept in memory | number |
-1 |
no |
sentry_dsn | DSN for Sentry instance | string |
"" |
no |
sentry_enabled | Whether Sentry should be enabled or not | bool |
false |
no |
sp_tracking_app_id | App id for Snowplow tracking | string |
"" |
no |
sp_tracking_collector_url | Collector URL for Snowplow tracking | string |
"" |
no |
sp_tracking_enabled | Whether Snowplow tracking should be activated or not | bool |
false |
no |
ssh_ip_allowlist | The list of CIDR ranges to allow SSH traffic from | list(any) |
[ |
no |
statsd_enabled | Whether Statsd should be enabled or not | bool |
false |
no |
statsd_host | Hostname of StatsD server | string |
"" |
no |
statsd_port | Port of StatsD server | number |
8125 |
no |
stdout_metrics_enabled | Whether logging metrics to stdout should be activated or not | bool |
false |
no |
tags | The tags to append to this resource | map(string) |
{} |
no |
telemetry_enabled | Whether or not to send telemetry information back to Snowplow Analytics Ltd | bool |
true |
no |
user_provided_id | An optional unique identifier to identify the telemetry events emitted by this stack | string |
"" |
no |
webhook_collector | URL of webhook collector | string |
"" |
no |
webhook_enabled | Whether webhook should be enabled or not | bool |
false |
no |
Name | Description |
---|---|
asg_id | ID of the ASG |
asg_name | Name of the ASG |
sg_id | ID of the security group attached to the Databricks Loader servers |
Copyright 2023-current Snowplow Analytics Ltd.
Licensed under the Snowplow Limited Use License Agreement. (If you are uncertain how it applies to your use case, check our answers to frequently asked questions.)