terraform-aws-databricks-loader-ec2

A Terraform module which deploys the Snowplow Databricks Loader on an EC2 node.

Telemetry

This module by default collects and forwards telemetry information to Snowplow to understand how our applications are being used. No identifying information about your sub-account or account fingerprints are ever forwarded to us - it is very simple information about what modules and applications are deployed and active.

If you wish to subscribe to our mailing list for updates to these modules or security advisories please set the user_provided_id variable to include a valid email address which we can reach you at.

How do I disable it?

To disable telemetry simply set variable telemetry_enabled = false.

What are you collecting?

For details on what information is collected please see this module: https://github.com/snowplow-devops/terraform-snowplow-telemetry

Usage

Databricks Loader loads transformed events from S3 bucket to Databricks.

For more information on how it works, see this overview.

To configure Databricks, please refer to the quick start guide.

Duration settings such as folder_monitoring_period or retry_period should be given in the documented duration format.

Example

Normally, this module would be used as part of our quick start guide. However, you can also use it standalone for a custom setup.

See example below:

# Note: This should be the same bucket that is used by the transformer to produce data to load
module "s3_pipeline_bucket" {
  source = "snowplow-devops/s3-bucket/aws"

  bucket_name = "your-bucket-name"
}

# Note: This should be the same queue that is passed to the transformer to produce data to load
resource "aws_sqs_queue" "db_message_queue" {
  content_based_deduplication = true
  kms_master_key_id           = "alias/aws/sqs"
  name                        = "db-loader.fifo"
  fifo_queue                  = true
}

module "transformer_wrp" {
  source  = "snowplow-devops/transformer-kinesis-ec2/aws"

  accept_limited_use_license = true

  name       = "transformer-server-wrp"
  vpc_id     = var.vpc_id
  subnet_ids = var.subnet_ids

  stream_name             = module.enriched_stream.name
  s3_bucket_name          = module.s3_pipeline_bucket.id
  s3_bucket_object_prefix = "transformed/good/widerow/parquet"
  window_period_min       = 1
  sqs_queue_name          = aws_sqs_queue.db_message_queue.name

  transformation_type = "widerow"
  widerow_file_format = "parquet"

  ssh_key_name     = "your-key-name"
  ssh_ip_allowlist = ["0.0.0.0/0"]

  # Linking in the custom Iglu Server here
  custom_iglu_resolvers = [
    {
      name            = "Iglu Server"
      priority        = 0
      uri             = "http://your-iglu-server-endpoint/api"
      api_key         = var.iglu_super_api_key
      vendor_prefixes = []
    }
  ]
}

module "db_loader" {
  source = "snowplow-devops/databricks-loader-ec2/aws"

  accept_limited_use_license = true

  name       = "db-loader-server"
  vpc_id     = var.vpc_id
  subnet_ids = var.subnet_ids

  sqs_queue_name = aws_sqs_queue.db_message_queue.name

  deltalake_catalog             = "<CATALOG>"
  deltalake_schema              = "<SCHEMA>"
  deltalake_host                = "<HOST>"
  deltalake_port                = "<PORT>"
  deltalake_http_path           = "<HTTP_PATH>"
  deltalake_auth_token          = "<AUTH_TOKEN>"
  databricks_aws_s3_bucket_name = module.s3_pipeline_bucket.id

  ssh_key_name     = "your-key-name"
  ssh_ip_allowlist = ["0.0.0.0/0"]

  # Linking in the custom Iglu Server here
  custom_iglu_resolvers = [
    {
      name            = "Iglu Server"
      priority        = 0
      uri             = "http://your-iglu-server-endpoint/api"
      api_key         = var.iglu_super_api_key
      vendor_prefixes = []
    }
  ]
}

Requirements

Name	Version
terraform	>= 1.0.0
aws	>= 3.72.0

Providers

Name	Version
aws	>= 3.72.0

Modules

Name	Source	Version
instance_type_metrics	snowplow-devops/ec2-instance-type-metrics/aws	0.1.2
service	snowplow-devops/service-ec2/aws	0.2.1
telemetry	snowplow-devops/telemetry/snowplow	0.5.0

Resources

Name	Type
aws_cloudwatch_log_group.log_group	resource
aws_iam_instance_profile.instance_profile	resource
aws_iam_policy.iam_policy	resource
aws_iam_policy.sts_credentials_policy	resource
aws_iam_role.iam_role	resource
aws_iam_role.sts_credentials_role	resource
aws_iam_role_policy_attachment.policy_attachment	resource
aws_iam_role_policy_attachment.sts_credentials_policy_attachment	resource
aws_security_group.sg	resource
aws_security_group_rule.egress_tcp_443	resource
aws_security_group_rule.egress_tcp_80	resource
aws_security_group_rule.egress_tcp_databricks	resource
aws_security_group_rule.egress_udp_123	resource
aws_security_group_rule.egress_udp_statsd	resource
aws_security_group_rule.ingress_tcp_22	resource
aws_caller_identity.current	data source
aws_iam_policy_document.sts_credentials_role	data source
aws_region.current	data source

Inputs

Name	Description	Type	Default	Required
databricks_aws_s3_bucket_name	AWS bucket name where data to load is stored	`string`	n/a	yes
deltalake_auth_token	Databricks deltalake auth token	`string`	n/a	yes
deltalake_host	Databricks deltalake host	`string`	n/a	yes
deltalake_http_path	Databricks deltalake http path	`string`	n/a	yes
deltalake_schema	Databricks deltalake schema	`string`	n/a	yes
name	A name which will be prepended to the resources created	`string`	n/a	yes
sqs_queue_name	SQS queue name	`string`	n/a	yes
ssh_key_name	The name of the SSH key-pair to attach to all EC2 nodes deployed	`string`	n/a	yes
subnet_ids	The list of subnets to deploy Loader across	`list(string)`	n/a	yes
vpc_id	The VPC to deploy Loader within	`string`	n/a	yes
accept_limited_use_license	Acceptance of the SLULA terms (https://docs.snowplow.io/limited-use-license-1.0/)	`bool`	`false`	no
amazon_linux_2_ami_id	The AMI ID to use which must be based of of Amazon Linux 2; by default the latest community version is used	`string`	`""`	no
app_version	Version of rdb loader databricks	`string`	`"5.6.0"`	no
associate_public_ip_address	Whether to assign a public ip address to this instance	`bool`	`true`	no
cloudwatch_logs_enabled	Whether application logs should be reported to CloudWatch	`bool`	`true`	no
cloudwatch_logs_retention_days	The length of time in days to retain logs for	`number`	`7`	no
config_override_b64	App config uploaded as a base64 encoded blob. This variable facilitates dev flow, if config is incorrect this can break the deployment.	`string`	`""`	no
custom_iglu_resolvers	The custom Iglu Resolvers that will be used by Stream Shredder	list(object({ name = string priority = number uri = string api_key = string vendor_prefixes = list(string) }))	`[]`	no
databricks_aws_s3_folder_monitoring_stage_url	AWS bucket URL of folder monitoring stage - must be within 'databricks_aws_s3_bucket_name' (NOTE: must be set if 'folder_monitoring_enabled' is true)	`string`	`""`	no
databricks_aws_s3_folder_monitoring_transformer_output_stage_url	AWS bucket URL of transformer output stage - must be within 'databricks_aws_s3_bucket_name' (NOTE: must be set if 'folder_monitoring_enabled' is true)	`string`	`""`	no
default_iglu_resolvers	The default Iglu Resolvers that will be used by Stream Shredder	list(object({ name = string priority = number uri = string api_key = string vendor_prefixes = list(string) }))	[ { "api_key": "", "name": "Iglu Central", "priority": 10, "uri": "http://iglucentral.com", "vendor_prefixes": [] }, { "api_key": "", "name": "Iglu Central - Mirror 01", "priority": 20, "uri": "http://mirror01.iglucentral.com", "vendor_prefixes": [] } ]	no
deltalake_catalog	Databricks deltalake catalog	`string`	`"hive_metastore"`	no
deltalake_port	Databricks deltalake port	`number`	`443`	no
folder_monitoring_enabled	Whether folder monitoring should be activated or not	`bool`	`false`	no
folder_monitoring_period	How often to folder should be checked by folder monitoring	`string`	`"8 hours"`	no
folder_monitoring_since	Specifies since when folder monitoring will check	`string`	`"14 days"`	no
folder_monitoring_until	Specifies until when folder monitoring will check	`string`	`"6 hours"`	no
health_check_enabled	Whether health check should be enabled or not	`bool`	`false`	no
health_check_freq	Frequency of health check	`string`	`"1 hour"`	no
health_check_timeout	How long to wait for a response for health check query	`string`	`"1 min"`	no
iam_permissions_boundary	The permissions boundary ARN to set on IAM roles created	`string`	`""`	no
instance_type	The instance type to use	`string`	`"t3a.micro"`	no
java_opts	Custom JAVA Options	`string`	`"-XX:InitialRAMPercentage=75 -XX:MaxRAMPercentage=75"`	no
private_ecr_registry	The URL of an ECR registry that the sub-account has access to (e.g. '000000000000.dkr.ecr.cn-north-1.amazonaws.com.cn/')	`string`	`""`	no
retry_period	How often batch of failed folders should be pulled into a discovery queue	`string`	`"10 min"`	no
retry_queue_enabled	Whether retry queue should be enabled or not	`bool`	`false`	no
retry_queue_interval	Artificial pause after each failed folder being added to the queue	`string`	`"10 min"`	no
retry_queue_max_attempt	How many attempt to make for each folder	`number`	`-1`	no
retry_queue_size	How many failures should be kept in memory	`number`	`-1`	no
sentry_dsn	DSN for Sentry instance	`string`	`""`	no
sentry_enabled	Whether Sentry should be enabled or not	`bool`	`false`	no
sp_tracking_app_id	App id for Snowplow tracking	`string`	`""`	no
sp_tracking_collector_url	Collector URL for Snowplow tracking	`string`	`""`	no
sp_tracking_enabled	Whether Snowplow tracking should be activated or not	`bool`	`false`	no
ssh_ip_allowlist	The list of CIDR ranges to allow SSH traffic from	`list(any)`	[ "0.0.0.0/0" ]	no
statsd_enabled	Whether Statsd should be enabled or not	`bool`	`false`	no
statsd_host	Hostname of StatsD server	`string`	`""`	no
statsd_port	Port of StatsD server	`number`	`8125`	no
stdout_metrics_enabled	Whether logging metrics to stdout should be activated or not	`bool`	`false`	no
tags	The tags to append to this resource	`map(string)`	`{}`	no
telemetry_enabled	Whether or not to send telemetry information back to Snowplow Analytics Ltd	`bool`	`true`	no
user_provided_id	An optional unique identifier to identify the telemetry events emitted by this stack	`string`	`""`	no
webhook_collector	URL of webhook collector	`string`	`""`	no
webhook_enabled	Whether webhook should be enabled or not	`bool`	`false`	no

Outputs

Name	Description
asg_id	ID of the ASG
asg_name	Name of the ASG
sg_id	ID of the security group attached to the Databricks Loader servers

Copyright and license

Licensed under the Snowplow Limited Use License Agreement. (If you are uncertain how it applies to your use case, check our answers to frequently asked questions.)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
templates		templates
.gitignore		.gitignore
CHANGELOG		CHANGELOG
LICENSE		LICENSE
README.md		README.md
main.tf		main.tf
outputs.tf		outputs.tf
variables.tf		variables.tf
versions.tf		versions.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

terraform-aws-databricks-loader-ec2

Telemetry

How do I disable it?

What are you collecting?

Usage

Example

Requirements

Providers

Modules

Resources

Inputs

Outputs

Copyright and license

About

Releases 6

Packages

Contributors 4

Languages

License

snowplow-devops/terraform-aws-databricks-loader-ec2

Folders and files

Latest commit

History

Repository files navigation

terraform-aws-databricks-loader-ec2

Telemetry

How do I disable it?

What are you collecting?

Usage

Example

Requirements

Providers

Modules

Resources

Inputs

Outputs

Copyright and license

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 4

Languages

Packages