From 8658903844449290082f797978dcdb7d232875b1 Mon Sep 17 00:00:00 2001 From: Date: Fri, 12 Jul 2024 13:39:24 +1000 Subject: [PATCH] Deployed 31a8f53 with MkDocs version: 1.6.0 --- 404.html | 21 + .../eks-best-practices/index.html | 21 + .../docs/cost-optimization/index.html | 35 +- cost-optimization/docs/index.html | 21 + .../docs/node-decommission/index.html | 21 + index.html | 21 + .../docs/aws-glue/index.html | 21 + .../docs/hive-metastore/index.html | 21 + metastore-integrations/docs/index.html | 21 + .../docs/eks-node-placement/index.html | 21 + .../docs/fargate-node-placement/index.html | 21 + node-placement/docs/index.html | 21 + .../emr-containers-on-outposts/index.html | 21 + outposts/index.html | 21 + performance/docs/dra/index.html | 21 + performance/docs/index.html | 21 + search/search_index.json | 2 +- security/docs/index.html | 21 + .../docs/spark/data-encryption/index.html | 21 + security/docs/spark/encryption/index.html | 21 + .../docs/spark/network-security/index.html | 21 + security/docs/spark/secrets/index.html | 21 + sitemap.xml.gz | Bin 127 -> 127 bytes storage/docs/index.html | 21 + storage/docs/spark/ebs/index.html | 231 +- storage/docs/spark/fsx-lustre/index.html | 21 + storage/docs/spark/instance-store/index.html | 21 + submit-applications/docs/spark/index.html | 21 + .../docs/spark/java-and-scala/index.html | 21 + .../docs/spark/multi-arch-image/index.html | 23 +- .../docs/spark/pyspark/index.html | 21 + .../docs/spark/sparkr/index.html | 21 + .../docs/spark/sparksql/index.html | 21 + .../docs/change-log-level/index.html | 21 + .../docs/connect-spark-ui/index.html | 23 +- .../docs/eks-cluster-auto-scaler/index.html | 21 + troubleshooting/docs/index.html | 21 + troubleshooting/docs/karpenter/index.html | 21 + .../docs/rbac-permissions-errors/index.html | 21 + .../docs/reverse-proxy-sparkui/index.html | 1906 +++++++++++++++++ .../docs/self-hosted-shs/index.html | 23 +- .../where-to-look-for-spark-logs/index.html | 21 + 42 files changed, 2874 insertions(+), 83 deletions(-) create mode 100644 troubleshooting/docs/reverse-proxy-sparkui/index.html diff --git a/404.html b/404.html index a4dc039..b1f36c1 100644 --- a/404.html +++ b/404.html @@ -975,6 +975,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/best-practices-and-recommendations/eks-best-practices/index.html b/best-practices-and-recommendations/eks-best-practices/index.html index 2f4d189..ede7ecb 100644 --- a/best-practices-and-recommendations/eks-best-practices/index.html +++ b/best-practices-and-recommendations/eks-best-practices/index.html @@ -986,6 +986,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/cost-optimization/docs/cost-optimization/index.html b/cost-optimization/docs/cost-optimization/index.html index 5836e1b..b965ef0 100644 --- a/cost-optimization/docs/cost-optimization/index.html +++ b/cost-optimization/docs/cost-optimization/index.html @@ -986,6 +986,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • @@ -1649,19 +1670,23 @@

    Spot Interruption and Sparkhere

    PVC Reuse:

    -

    A PersistentVolume is a Kubernetes feature to provide persistent storage to container Pods running stateful workloads, and PersistentVolumeClaim (PVC) is to request the above storage in the container Pod for storage by a user. Apache Spark 3.1.0 introduced the ability to dynamically generate, mount, and remove Persistent Volume Claims, SPARK-25299 for Kubernetes workloads, which are basically volumes mounted into your Spark pods. This means Apache Spark does not have to pre-create the claims/volumes for the executors and delete it during the executor decommissioning.

    -

    If a Spark executor is killed due to EC2 Spot interruption or any other failure then the PVC is not deleted but persisted and reattached to another executor. If there are shuffle files in that volume then they are reused. Previously if an external shuffle service process or node became unavailable, the executors were killed and all the shuffle blocks were lost, which needed to be recomputed.

    +

    A PersistentVolume is a Kubernetes feature to provide persistent storage to container Pods running stateful workloads, and PersistentVolumeClaim (PVC) is to request the above storage in the container Pod for storage by a user. Apache Spark 3.1.0 introduced the ability to dynamically generate, mount, and remove Persistent Volume Claims, SPARK-29873 for Kubernetes workloads, which are basically volumes mounted into your Spark pods. This means Apache Spark does not have to pre-create any claims/volumes for executors and delete it during the executor decommissioning.

    +

    Since Spark3.2, PVC reuse is introduced. In case of a Spark executor is killed due to EC2 Spot interruption or any other failure, then its PVC is not deleted but persisted throughtout the entire job lifetime. It will be reattached to a new executor for a faster recovery. If there are shuffle files on that volume, then they are reused. Without enabling this feature, the owner of dynamic PVCs is the executor pods. It means if a pod or a node became unavailable, the PVC would be terminated, resulting in all the shuffle data were lost, and the recompute would be triggered.

    -

    This feature is available on Amazon EMR version 6.8 and above. To set up this feature, you can add these lines to the executor configuration:

    +

    This feature is available started from Amazon EMR version 6.6+. To set it up, you can add these configurations to Spark jobs:

    "spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
     "spark.kubernetes.driver.reusePersistentVolumeClaim": "true
     
    -

    One key benefit is that if any Executor running on EC2 Spot becomes unavailable, the new executor replacement can reuse the shuffle files from the PVC, avoiding recompute of the shuffle block. Dynamic PVC or persistence volume claim enables ‘true’ decoupling of data and processing when we are running Spark jobs on Kubernetes, as it can be used as a local storage to spill in-process files too. We recommend to enable PVC reuse feature because the time taken to resume the task when there is a Spot interruption is optimized as the files are used in-situ and there is no time required to move the files around.

    -

    If one or more of the nodes which are running executors is interrupted the underlying pods gets deleted and the driver gets the update. Note the driver is the owner of the PVC of the executors and they are not deleted.

    +

    since Spark3.4 (EMR6.12), Spark driver is able to do PVC-oriented executor allocation which means Spark counts the total number of created PVCs which the job can have, and holds on a new executor creation if the driver owns the maximum number of PVCs. This helps the transition of the existing PVC from one executor to another executor. Add this extra config to improve your PVC reuse performance:

    +
    "spark.kubernetes.driver.waitToReusePersistentVolumeClaim": "true"
    +
    + +

    One key benefit of the PVC reuse is that if any Executor running on EC2 Spot becomes unavailable, the new executor replacement can reuse the shuffle data from the existing PVC, avoiding recompute of the shuffle blocks. Dynamic PVC or persistence volume claim enables ‘true’ decoupling of storage and compute when we run Spark jobs on Kubernetes, as it can be used as a local storage to spill in-process files too. We recommend to enable PVC reuse feature because the time taken to resume the task when there is a Spot interruption is optimized as the files are used in-situ and there is no time required to move the files around.

    +

    If one or more of the nodes which are running executors is interrupted the underlying pods gets deleted and the driver gets the update. Note the driver is the owner of those PVCs attaching to executor pods and they are not deleted throughout the job lifetime.

    22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, action DELETED
     22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, action MODIFIED
     22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, action DELETED
    diff --git a/cost-optimization/docs/index.html b/cost-optimization/docs/index.html
    index 411a555..4e2c84d 100644
    --- a/cost-optimization/docs/index.html
    +++ b/cost-optimization/docs/index.html
    @@ -975,6 +975,27 @@
       
       
       
    +    
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/cost-optimization/docs/node-decommission/index.html b/cost-optimization/docs/node-decommission/index.html index c4a5aaa..e603af6 100644 --- a/cost-optimization/docs/node-decommission/index.html +++ b/cost-optimization/docs/node-decommission/index.html @@ -984,6 +984,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/index.html b/index.html index cf1c904..ef88947 100644 --- a/index.html +++ b/index.html @@ -1043,6 +1043,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/metastore-integrations/docs/aws-glue/index.html b/metastore-integrations/docs/aws-glue/index.html index 4d050b4..01c1e27 100644 --- a/metastore-integrations/docs/aws-glue/index.html +++ b/metastore-integrations/docs/aws-glue/index.html @@ -1058,6 +1058,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/metastore-integrations/docs/hive-metastore/index.html b/metastore-integrations/docs/hive-metastore/index.html index 0701777..203eeac 100644 --- a/metastore-integrations/docs/hive-metastore/index.html +++ b/metastore-integrations/docs/hive-metastore/index.html @@ -1076,6 +1076,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/metastore-integrations/docs/index.html b/metastore-integrations/docs/index.html index 121e658..29adb22 100644 --- a/metastore-integrations/docs/index.html +++ b/metastore-integrations/docs/index.html @@ -975,6 +975,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/node-placement/docs/eks-node-placement/index.html b/node-placement/docs/eks-node-placement/index.html index f3cc104..4bc4e77 100644 --- a/node-placement/docs/eks-node-placement/index.html +++ b/node-placement/docs/eks-node-placement/index.html @@ -986,6 +986,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/node-placement/docs/fargate-node-placement/index.html b/node-placement/docs/fargate-node-placement/index.html index b24f6f4..855fa96 100644 --- a/node-placement/docs/fargate-node-placement/index.html +++ b/node-placement/docs/fargate-node-placement/index.html @@ -986,6 +986,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/node-placement/docs/index.html b/node-placement/docs/index.html index 9112958..8f1dd8f 100644 --- a/node-placement/docs/index.html +++ b/node-placement/docs/index.html @@ -975,6 +975,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/outposts/emr-containers-on-outposts/index.html b/outposts/emr-containers-on-outposts/index.html index fda42e9..04d1ef5 100644 --- a/outposts/emr-containers-on-outposts/index.html +++ b/outposts/emr-containers-on-outposts/index.html @@ -1091,6 +1091,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/outposts/index.html b/outposts/index.html index 39b0f0c..6384082 100644 --- a/outposts/index.html +++ b/outposts/index.html @@ -975,6 +975,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/performance/docs/dra/index.html b/performance/docs/dra/index.html index 7f949be..62c08a8 100644 --- a/performance/docs/dra/index.html +++ b/performance/docs/dra/index.html @@ -986,6 +986,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/performance/docs/index.html b/performance/docs/index.html index 85d793f..5039af4 100644 --- a/performance/docs/index.html +++ b/performance/docs/index.html @@ -975,6 +975,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/search/search_index.json b/search/search_index.json index d6ed94e..478de60 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Introduction","text":"

    Welcome to the EMR Containers Best Practices Guide. The primary goal of this project is to offer a set of best practices and templates to get started with Amazon EMR on EKS. We publish this guide on GitHub so we could iterate the content quickly, provide timely and effective recommendations for variety of concerns, and easily incorporate suggestions from the broader community.

    "},{"location":"#amazon-emr-on-eks-workshop","title":"Amazon EMR on EKS Workshop","text":"

    If you are interested in step-by-step tutorials that leverage the best practices contained in this guide, please visit the Amazon EMR on EKS Workshop.

    "},{"location":"#contributing","title":"Contributing","text":"

    We encourage you to contribute to these guides. If you have implemented a practice that has proven to be effective, please share it with us by opening an issue or a pull request. Similarly, if you discover an error or flaw in the guide, please submit a pull request to correct it.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/","title":"EKS Best Practices and Recommendations","text":"

    Amazon EMR on EKS team has run scale tests on EKS cluster and has compiled a list of recommendations. The purpose of this document is to share our recommendations for running large scale EKS clusters supporting EMR on EKS.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#amazon-vpc-cni-best-practices","title":"Amazon VPC CNI Best practices","text":""},{"location":"best-practices-and-recommendations/eks-best-practices/#recommendation-1-improve-ip-address-utilization","title":"Recommendation 1: Improve IP Address Utilization","text":"

    EKS clusters can run out of IP addresses for pods when they reached between 400 and 500 nodes. With the default CNI settings, each node can request more IP addresses than is required. To ensure that you don\u2019t run out of IP addresses, there are two solutions:

    1. Set MINIMUM_IP_TARGET and WARM_IP_TARGET instead of the default setting of WARM_ENI_TARGET=1. The values of these settings will depend on your instance type, expected pod density, and workload. More info about these CNI settings can be found here. The maximum number of IP addresses per node (and thus maximum number of pods per node) depends on instance type and can be looked up here.

    2. If you have found the right CNI settings as described above, the subnets created by eksctl still do not provide enough addresses (by default eksctl creates a \u201c/19\u201d subnet for each nodegroup, which contains ~8.1k addresses). You can configure CNI to take addresses from (larger) subnets that you create. For example, you could create a few \u201c/16\u201d subnets, which contain ~65k IP addresses per subnet. You should implement this option after you have configured the CNI settings as described in #1. To configure your pods to use IP addresses from larger manually-created subnets, use CNI custom networking (see below for more information):

    CNI custom networking

    By default, the CNI assigns the Pod\u2019s IP address from the worker node's primary elastic network interface's (ENI) security groups and subnet. If you don\u2019t have enough IP addresses in the worker node subnet, or prefer that the worker nodes and Pods reside in separate subnets to avoid IP address allocation conflicts between Pods and other resources in the VPC, you can use CNI custom networking.

    Enabling a custom network removes an available elastic network interface (and all of its available IP addresses for pods) from each worker node that uses it. The worker node's primary network interface is not used for pod placement when a custom network is enabled.

    If you want the CNI to assign IP addresses for Pods from a different subnet, you can set AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG environment variable to true.

    kubectl set env daemonset aws-node \\\n-n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true\n

    When AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true, the CNI will assign Pod IP address from a subnet defined in ENIConfig. The ENIConfig custom resource is used to define the subnet in which Pods will be scheduled.

    apiVersion: crd.k8s.amazonaws.com/v1alpha1\nkind: ENIConfig\nmetadata: \n  name: us-west-2a\nspec: \n  securityGroups: \n    - sg-0dff111a1d11c1c11\n  subnet: subnet-011b111c1f11fdf11\n

    You will need to create an ENIconfig custom resource for each subnet you want to use for Pod networking.

    • The securityGroups field should have the ID of the security group attached to the worker nodes.

    • The name field should be the name of the Availability Zone in your VPC. If you name your ENIConfig custom resources after each Availability Zone in your VPC, you can enable Kubernetes to automatically apply the corresponding ENIConfig for the worker node Availability Zone with the following command.

    kubectl set env daemonset aws-node \\\n-n kube-system ENI_CONFIG_LABEL_DEF=failure-domain.beta.kubernetes.io/zone\n

    Note

    Upon creating the ENIconfig custom resources, you will need to create new worker nodes. The existing worker nodes and Pods will remain unaffected.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#recommendation-2-prevent-ec2-vpc-api-throttling-from-assignprivateipaddresses-attachnetworkinterface","title":"Recommendation 2: Prevent EC2 VPC API throttling from AssignPrivateIpAddresses & AttachNetworkInterface","text":"

    Often EKS cluster scale-out time can increase because the CNI is being throttled by the EC2 VPC APIs. The following steps can be taken to prevent these issues:

    1. Use CNI version 1.8.0 or later as it reduces the calls to EC2 VPC APIs than earlier versions.

    2. Configure the MINIMUM_IP_TARGET and WARM_IP_TARGET parameters instead of the default parameter of WARM_ENI_TARGET=1. Only those IP addresses that are necessary are requested from EC2. The values of these settings will depend on your instance type and expected pod density. More info about these settings here.

    3. Request an API limit increase on the EC2 VPC APIs that are getting throttled. This option should be considered only after steps 1 & 2 have been done.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#other-recommendations-for-amazon-vpc-cni","title":"Other Recommendations for Amazon VPC CNI","text":""},{"location":"best-practices-and-recommendations/eks-best-practices/#plan-for-growth","title":"Plan for growth","text":"

    Size the subnets you will use for Pod networking for growth. If you have insufficient IP addresses available in the subnet that the CNI uses, your pods will not get an IP address. The pods will remain in the pending state until an IP address becomes available. This may impact application autoscaling and compromise its availability.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#monitor-ip-address-inventory","title":"Monitor IP address inventory","text":"

    You can monitor the IP addresses inventory of subnets using the CNI Metrics Helper, and set CloudWatch alarms to get notified if a subnet is running out of IP addresses.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#snat-setting","title":"SNAT setting","text":"

    Source Network Address Translation (source-nat or SNAT) allows traffic from a private network to go out to the internet. Virtual machines launched on a private network can get to the internet by going through a gateway capable of performing SNAT. If your Pods with private IP address need to communicate with other private IP address spaces (for example, Direct Connect, VPC Peering or Transit VPC), then you should enable external SNAT in the CNI:

    kubectl set env daemonset \\\n-n kube-system aws-node AWS_VPC_K8S_CNI_EXTERNALSNAT=true\n
    "},{"location":"best-practices-and-recommendations/eks-best-practices/#coredns-best-practices","title":"CoreDNS Best practices","text":""},{"location":"best-practices-and-recommendations/eks-best-practices/#prevent-coredns-from-being-overwhelmed-unknownhostexception-in-spark-jobs-and-other-pods","title":"Prevent CoreDNS from being overwhelmed (UnknownHostException in spark jobs and other pods)","text":"

    CoreDNS is a deployment, which means it runs a fixed number of replicas and thus does not scale out with the cluster. This can be a problem for workloads that do a lot of DNS lookups. One simple solution is to install dns-autoscaler, which adjusts the number of replicas of the CoreDNS deployment as the cluster grows and shrinks.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#monitor-coredns-metrics","title":"Monitor CoreDNS metrics","text":"

    CoreDNS is a deployment, which means it runs a fixed number of replicas and thus does not scale out with the cluster. This can cause workloads to timeout with unknownHostException as spark-executors will do a lot of DNS lookups which registering themselves to spark-driver. One simple solution to fix this is to install dns-autoscaler, which adjusts the number of replicas of the CoreDNS deployment as the cluster grows and shrinks.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#cluster-autoscaler-best-practices","title":"Cluster Autoscaler Best practices","text":""},{"location":"best-practices-and-recommendations/eks-best-practices/#increase-cluster-autoscaler-memory-to-avoid-unnecessary-exceptions","title":"Increase cluster-autoscaler memory to avoid unnecessary exceptions","text":"

    Cluster-autoscaler can require a lot of memory to run because it stores a lot of information about the state of the cluster, such as data about every pod and every node. If the cluster-autoscaler has insufficient memory, it can lead to the cluster-autoscaler crashing. Ensure that you provide the cluster-autoscaler deployment more memory, e.g., 1Gi memory instead of the default 300Mi. Useful information about configuring the cluster-autoscaler for improved scalability and performance can be found here

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#avoid-job-failures-when-cluster-autoscaler-attempts-scale-in","title":"Avoid job failures when Cluster Autoscaler attempts scale-in","text":"

    Cluster Autoscaler will attempt scale-in action for any under utilized instance within your EKScluster. When scale-in action is performed, all pods from that instance is relocated to another node. This could cause disruption for critical workloads. For example, if driver pod is restarted, the entire job needs to restart. For this reason, we recommend using Kubernetes annotations on all critical pods (especially driver pods) and for cluster autoscaler deployment. Please see here for more info

    cluster-autoscaler.kubernetes.io/safe-to-evict=false\n
    "},{"location":"best-practices-and-recommendations/eks-best-practices/#configure-overprovisioning-with-cluster-autoscaler-for-higher-priority-jobs","title":"Configure overprovisioning with Cluster Autoscaler for higher priority jobs","text":"

    If the required resources is not available in the cluster, pods go into pending state. Cluster Autoscaler uses this metric to scale out the cluster and this activity can be time-consuming (several minutes) for higher priority jobs. In order to minimize time required for scaling, we recommend overprovisioning resources. You can launch pause pods(dummy workloads which sleeps until it receives SIGINT or SIGTERM) with negative priority to reserve EC2 capacity. Once the higher priority jobs are scheduled, these pause pods are preempted to make room for high priority pods which in turn scales out additional capacity as a buffer. You need to be aware that this is a trade-off as it adds slightly higher cost while minimizing scheduling latency. You can read more about over provisioning best practice here.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#eks-control-plane-best-practices","title":"EKS Control Plane Best practices","text":""},{"location":"best-practices-and-recommendations/eks-best-practices/#api-server-overwhelmed","title":"API server overwhelmed","text":"

    System pods, workload pods, and external systems can make many calls to the Kubernetes API server. This can decrease performance and also increase EMR on EKS job failures. There are multiple ways to avoid API server availability issues including but not limited to:

    • By default, the EKS API servers are automatically scaled to meet your workload demand. If you see increased latencies, please contact AWS via a support ticket and work with engineering team to resolve the issue.

    • Consider reducing the scan interval of cluster-autoscaler from the 10 second default value. Each time the cluster-autoscaler runs, it makes many calls to the API server. However, this will result in the cluster scaling-out less frequently and in larger steps (and same with scaling back in when load is reduced). More information can be found about the cluster-autoscaler here. This is not recommended if you need jobs to start ASAP.

    • If you are running your own deployment of fluentd, an increased load on the APIserver can be observed. Consider using fluent-bit instead which makes fewer calls to the API server. More info can be found here

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#monitor-control-plane-metrics","title":"Monitor Control Plane Metrics","text":"

    Monitoring Kubernetes API metrics can give you insights into control plane performance and identify issues. An unhealthy control plane can compromise the availability of the workloads running inside the cluster. For example, poorly written controllers can overload the API servers, affecting your application's availability.

    Kubernetes exposes control plane metrics at the /metrics endpoint.

    You can view the metrics exposed using kubectl:

    kubectl get --raw /metrics\n

    These metrics are represented in a Prometheus text format.

    You can use Prometheus to collect and store these metrics. In May 2020, CloudWatch added support for monitoring Prometheus metrics in CloudWatch Container Insights. So you can also use Amazon CloudWatch to monitor the EKS control plane. You can follow the Tutorial for Adding a New Prometheus Scrape Target: Prometheus KPI Server Metrics to collect metrics and create CloudWatch dashboard to monitor your cluster\u2019s control plane.

    You can also find Kubernetes API server metrics here. For example, apiserver_request_duration_seconds can indicate how long API requests are taking to run.

    Consider monitoring these control plane metrics:

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#api-server","title":"API Server","text":"Metric Description apiserver_request_total Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, client, and HTTP response contentType and code. apiserver_request_duration_seconds* Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope, and component. rest_client_request_duration_seconds Request latency in seconds. Broken down by verb and URL. apiserver_admission_controller_admission_duration_seconds Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit). rest_client_request_duration_seconds Request latency in seconds. Broken down by verb and URL. rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host."},{"location":"best-practices-and-recommendations/eks-best-practices/#etcd","title":"etcd","text":"Metric Description etcd_request_duration_seconds Etcd request latency in seconds for each operation and object type.

    You can visualize and monitor these Kubernetes API server requests, latency and etcD metrics on Grafana via Grafana dashboard 12006.

    "},{"location":"cost-optimization/docs/cost-optimization/","title":"Cost Optimization using EC2 Spot Instances","text":""},{"location":"cost-optimization/docs/cost-optimization/#ec2-spot-best-practices","title":"EC2 Spot Best Practices","text":"

    Amazon EMR on Amazon EKS enables you to submit Apache Spark jobs on demand on Amazon Elastic Kubernetes Service (EKS) without provisioning dedicated EMR clusters. With EMR on EKS, you can consolidate analytical workloads with your other Kubernetes-based applications on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management. Cost Optimization of the underlying infrastructure is often the key requirement for our customers, and this can be achieved by using Amazon EC2 Spot Instances. Spot Instances are spare EC2 capacity and is available at up to 90% discount compared to On-Demand Instance prices. If EC2 needs capacity back for On-Demand Instance usage, Spot Instances can be interrupted. Handling interruptions to build resilient workloads is simple and there are best practices to manage interruption by automation or AWS services like EKS.

    This document will describe how to architect with EC2 spot best practices and apply to EMR on EKS jobs. We will also cover Spark features related to EC2 Spot when you run EMR on EKS jobs

    "},{"location":"cost-optimization/docs/cost-optimization/#ec2-spot-capacity-provisioning","title":"EC2 Spot Capacity Provisioning","text":"

    EMR on EKS runs open-source big data framework like Spark on Amazon EKS, so basically when you are run on Spot instances you are, provisioning capacity for the underlying EKS cluster. The key point to remember when you are using Spot instances is instance diversification. There are three ways that EC2 Spot capacity can be provisioned in an EKS cluster.

    EKS Managed Nodegroup:

    We highly recommend to use Managed Nodegroup for provisioning Spot instances. This requires significantly less operational effort when compared to self-managed nodegroups. The Spot instance interruption is handled proactively using the Instance Rebalancing Recommendation and Spot best practice of using Capacity Optimized Allocation strategy is adopted by default along with other useful features. If you are planning to scale your cluster then Cluster Autoscaler can be used but keep in mind, one caveat with this approach is to maintain same vCPU to memory ratio for nodes defined in a nodegroup.

    Karpenter:

    An open-source node provisioning tool for Kubernetes which works seamlessly with EMR on EKS. Karpenter can help to improve the efficiency and cost of running workloads. It provisions nodes based on pod resource requirements. The key advantage of Karpenter is flexibility not only in terms of EC2 pricing (Spot/On-Demand) but it also aligns with the Spot best practice of instance diversification, and uses capacity optimized prioritized allocation strategy; more details can be found in this workshop. Karpenter will also be useful to scale the infrastructure which will be further discussed under the scaling section below.

    Self-Managed Nodegroup:

    EMR on EKS clusters can also run on self-managed nodegroups on EKS. You need to manage the Spot instance lifecycle if there is an interruption by installing an open-source tool named AWS Node Termination Handler. AWS Node Termination Handler ensures that the Kubernetes control plane responds appropriately to events that can cause your EC2 instance to become unavailable, such as EC2 maintenance events, EC2 Spot interruptions, ASG Scale-In, ASG AZ Rebalance, and EC2 Instance Termination via the API or Console. Please remember you need to manage all the software updates manually if you plan to use this. When you are using dynamic allocation the nodegroups needs to autoscale, and if you are using cluster autoscaler then you need to maintain the vCPU to memory ratio for nodes defined in a nodegroup.

    "},{"location":"cost-optimization/docs/cost-optimization/#spot-interruption-and-spark","title":"Spot Interruption and Spark","text":"

    EC2 Spot instances are suitable for flexible and fault tolerant workloads. Spark is a semi-resilient by design because if the executor fails, new executors are spun up by the driver to continue the job. However, if the driver fails, the entire job fails. For added resiliency, EMR of EKS retries up to 5 times for driver pods so that the k8s can find suitable host and job starts successfully. If k8s fails to find a host, job is cancelled after 15 min timeout. If driver pod fails for other reasons, job is cancelled with an error message for troubleshooting. Hence, we recommend to run Spark driver on On-Demand instances and executors on Spot instances to cost optimize the workloads. You can use PodTemplates to configure this scheduling constraint. NodeSelector can be used as the node selection constraint to run executors on Spot instances as in the example below. This is simple to use and works well with Karpenter too. The pod template for this would look like

    apiVersion: v1\nkind: Pod\nspec:\n  nodeSelector:\n    eks.amazonaws.com/capacityType: SPOT\n  containers:\n  - name: spark-kubernetes-executor\n

    Node affinity can also be used here, this allows for more flexibility for the constraints defined. We recommend to use \u2018hard affinity\u2019 as highlighted in the code below for this purpose. For jobs which have strict SLA and are not suitable to run on Spot we suggest to use NoSchedule taint effect to ensure no Pods are scheduled. The key thing to note here is that the bulk of the compute required in a Spark job runs on executors and if they can be run on EC2 Spot instances you can benefit from the steep discount available with Spot instances.

    apiVersion: v1\nkind: Pod\nmetadata:\n  labels:\n    spark-role: driver\n  namespace: emr-eks-workshop-namespace\nspec:\n  affinity: \n      nodeAffinity: \n          requiredDuringSchedulingIgnoredDuringExecution: \n            nodeSelectorTerms: \n            - matchExpressions: \n              - key: 'eks.amazonaws.com/capacityType' \n                operator: In \n                values: \n                - ON_DEMAND\n
    apiVersion: v1\nkind: Pod\nmetadata:\n  labels:\n    spark-role: executor\n  namespace: emr-eks-workshop-namespace\nspec:\n  affinity: \n      nodeAffinity: \n          requiredDuringSchedulingIgnoredDuringExecution: \n            nodeSelectorTerms: \n            - matchExpressions: \n              - key: 'eks.amazonaws.com/capacityType' \n                operator: In \n                values: \n                - SPOT\n

    When Spot instances are interrupted the executors running on them may lose (if any) the shuffle and cached RDDs which would require re-computation. This requires more compute cycles to be spent which will impact the overall SLA of the EMR on EKS jobs. EMR on EKS has incorporated two new Spark features which can help to address these issues. In the following sections we will discuss them.

    Node Decommissioning:

    Node decommissioning is a Spark feature that enables the removal of an executor gracefully, by preserving its state before removing it and not scheduling any new jobs on it. This feature is particularly useful when the Spark executors are running on Spot instances, and the Spark executor node is interrupted via a \u2018rebalance recommendation\u2019 or \u2018instance termination\u2019 notice to reclaim the instance.

    Node decommission begins when a Spark executor node receives a Spot Interruption Notice or Spot Rebalance Recommendation signal. The executor node immediately starts the process of decommissioning by sending a message to the Spark driver. The driver will identify the RDD/Shuffle files that it needs to migrate off the executor node in question, and will try to identify another Executor node which can take over the execution. If an executor is identified, the RDD/Shuffle files are copied to the new executor and the job execution continues on the new executor. If all the executors are busy, the RDD/Shuffle files are copied to an external storage.

    The key advantage of this process is that it enables block and shuffle data of a Spark executor that receives EC2 Spot Interruption signal to be migrated, reducing the re-computation of the Spark tasks. The reduction in the re-computation for the interrupted Spark tasks improves the resiliency of the system and reduces overall execution time. We recommend to enable node decommissioning feature because it would help to reduce the overall compute cycles when there is a Spot interruption.

    This feature is available on Amazon EMR version 6.3 and above. To set up this feature, add this configuration to the Spark job under the executor section:

    \"spark.decommission.enabled\": \"true\"\n\"spark.storage.decommission.rddBlocks.enabled\": \"true\"\n\"spark.storage.decommission.shuffleBlocks.enabled\" : \"true\"\n\"spark.storage.decommission.enabled\": \"true\"\n\"spark.storage.decommission.fallbackStorage.path\": \"s3://<<bucket>>\"\n

    The Spark executor logs sample shown below shows the process of decommission and sending message to the driver:

    21/05/05 17:41:41 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 7 decommissioned message\n21/05/05 17:41:41 DEBUG TaskSetManager: Valid locality levels for TaskSet 2.0: NO_PREF, ANY\n21/05/05 17:41:41 INFO KubernetesClusterSchedulerBackend: Decommission executors: 7\n21/05/05 17:41:41 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_2.0, runningTasks: 10\n21/05/05 17:41:41 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(7, 192.168.82.107, 39007, None)) as being decommissioning.\n
    21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Decommission executor 1.\n21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning\n21/05/05 20:22:17 INFO BlockManager: Starting block manager decommissioning process...\n21/05/05 20:22:17 DEBUG FileSystem: Looking for FS supporting s3a\n

    The Spark driver logs sample below shows the process of looking for an executor to migrate the shuffle data:

    22/06/07 20:41:38 INFO ShuffleStatus: Updating map output for 46 to BlockManagerId(4, 192.168.13.235, 34737, None)\n22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Received shuffle data block update for 0 46, ignore.\n22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Received shuffle index block update for 0 46, updating.\n

    The Spark executor logs sample below shows the process of reusing the shuffle files:

    22/06/07 20:42:50 INFO BasicExecutorFeatureStep: Adding decommission script to lifecycle\n22/06/07 20:42:50 DEBUG ExecutorPodsAllocator: Requested executor with id 19 from Kubernetes.\n22/06/07 20:42:50 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-bfd0a5813fd1b80f-exec-19, action ADDED\n22/06/07 20:42:50 DEBUG BlockManagerMasterEndpoint: Received shuffle index block update for 0 52, updating.\n22/06/07 20:42:50 INFO ShuffleStatus: Recover 52 BlockManagerId(fallback, remote, 7337, None)\n

    More details on this can be found here

    PVC Reuse:

    A PersistentVolume is a Kubernetes feature to provide persistent storage to container Pods running stateful workloads, and PersistentVolumeClaim (PVC) is to request the above storage in the container Pod for storage by a user. Apache Spark 3.1.0 introduced the ability to dynamically generate, mount, and remove Persistent Volume Claims, SPARK-25299 for Kubernetes workloads, which are basically volumes mounted into your Spark pods. This means Apache Spark does not have to pre-create the claims/volumes for the executors and delete it during the executor decommissioning.

    If a Spark executor is killed due to EC2 Spot interruption or any other failure then the PVC is not deleted but persisted and reattached to another executor. If there are shuffle files in that volume then they are reused. Previously if an external shuffle service process or node became unavailable, the executors were killed and all the shuffle blocks were lost, which needed to be recomputed.

    This feature is available on Amazon EMR version 6.8 and above. To set up this feature, you can add these lines to the executor configuration:

    \"spark.kubernetes.driver.ownPersistentVolumeClaim\": \"true\"\n\"spark.kubernetes.driver.reusePersistentVolumeClaim\": \"true\n

    One key benefit is that if any Executor running on EC2 Spot becomes unavailable, the new executor replacement can reuse the shuffle files from the PVC, avoiding recompute of the shuffle block. Dynamic PVC or persistence volume claim enables \u2018true\u2019 decoupling of data and processing when we are running Spark jobs on Kubernetes, as it can be used as a local storage to spill in-process files too. We recommend to enable PVC reuse feature because the time taken to resume the task when there is a Spot interruption is optimized as the files are used in-situ and there is no time required to move the files around.

    If one or more of the nodes which are running executors is interrupted the underlying pods gets deleted and the driver gets the update. Note the driver is the owner of the PVC of the executors and they are not deleted.

    22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, action DELETED\n22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, action MODIFIED\n22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, action DELETED\n22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, action MODIFIED\n

    The ExecutorPodsAllocator tries to allocate new executor pods to replace the ones killed due to interruption. During the allocation it tries to figure out how many of the existing PVC has some files and can be reused.

    "},{"location":"cost-optimization/docs/cost-optimization/#scaling-emr-on-eks-and-ec2-spot","title":"Scaling EMR on EKS and EC2 Spot","text":"

    One of the key advantages of using Spot instances is it helps to increase the throughput of Big Data workloads at a fraction of the cost of On-Demand instances. There are spark workloads where there is a need to scale the \u2018number of executors\u2019 and the infrastructure dynamically. Scaling in a Spark process is done by spawning pod replicas and when they cannot be scheduled in the existing cluster the cluster need to be scaled up by adding more nodes. When you scale up using Spot instances you get the cost benefits of using the lowest price for EC2 Compute and thus increase the throughput of the job at a lower cost, as you can provision more compute capacity (at the same cost of On-Demand instances) to reduce the time taken to process large data sets.

    Dynamic Resource Allocation (DRA) enables the Spark driver to spawn the initial number of executors (pod replicas) and then scale up the number until the specified maximum number of executors is met to process the pending tasks. When the executors have no tasks running on them, they are terminated. This enables the nodes deployed in the Amazon EKS cluster to be better utilized while running multiple Spark jobs. DRA has mechanisms to dynamically adjust the resources your application occupies based on the workload. Idle executors are terminated when there are no pending tasks. This feature is available on Amazon EMR version 6.x. More details can be found here.

    Scaling of the infrastructure by adding more nodes can be achieved by using Cluster Autoscaler or Karpenter.

    Cluster Autoscaler:

    Cluster Autoscaler (CAS) is a Kubernetes open-source tool that automatically scale-out the size of the Kubernetes cluster when there are pending pods due to insufficient capacity on existing cluster, or scale-in when there are underutilized nodes in a cluster for extended period of time. The configuration below shows multiple Nodegroups with different vCPU and RAM configurations which adheres to the Spot best practice of diversification. Note each nodegroup has the same vCPU to memory ratio as discussed above. CAS works with EKS Managed and Self-Managed Nodegroups.

    Karpenter

    Karpenter is an open-source, flexible, high-performance auto-scaler built for Kubernetes. Karpenter automatically launches just the right compute resources to handle your cluster's applications. Karpenter observes aggregate resource requests of un-schedulable pods, computes and launches best-fit new capacity.

    The Provisioner CRD\u2019s configuration flexibility is very useful in adopting Spot best practices of diversification. It can include as many Spot Instance types as possible as we do not restrict specific instance types in the configuration. This approach is also future-proof when AWS launches new instance types. It also manages Spot instance lifecycle management through Spot interruptions. We recommend to use Karpenter with Spot Instances as it has faster node scheduling with early pod binding and binpacking to optimize the resource utilization. An example of a Karpenter provisioner with Spot instances below.

    apiVersion: karpenter.sh/v1alpha5\nkind: Provisioner\nmetadata:\n  name: default\nspec:\n  labels:\n    intent: apps\n  requirements:\n    - key: karpenter.sh/capacity-type\n      operator: In\n      values: [\"spot\"]\n    - key: karpenter.k8s.aws/instance-size\n      operator: NotIn\n      values: [nano, micro, small, medium, large]\n  limits:\n    resources:\n      cpu: 1000\n      memory: 1000Gi\n  ttlSecondsAfterEmpty: 30\n  ttlSecondsUntilExpired: 2592000\n  providerRef:\n    name: default\n
    "},{"location":"cost-optimization/docs/cost-optimization/#emr-on-eks-and-ec2-spot-instances-best-practices","title":"EMR on EKS and EC2 Spot Instances: Best Practices","text":"

    In summary, our recommendations are:

    • Use EC2 Spot instances for Spark executors and On-Demand instances for drivers.
      • Diversify the instances types (Instance family and size) used in a cluster.
    • Use a single AZ to launch a cluster to save Inter-AZ data transfer cost and improve job performance.
    • Use Karpenter for capacity provisioning and scaling when running EMR on EKS jobs.
    • If use Cluster Autoscaler not Karpenter, use EKS Managed Nodegroups.
    • If using EKS self-managed nodegroups, enuse the Capacity Optimized Allocation strategy and AWS Node Termination Handler are in place.
    • Utilizing Node decommissioning and PVC Reuse techniques can help reduce the time taken to complete EMR on EKS job when EC2 Spot interruptions occur. However, they do not guarantee 100% avoidance of data loss during shuffling interruptions.
    • Implementing a Remote Shuffle Service (RSS) solution can enhance job stability and availability if Node decommissioning and PVC Reuse features do not fully meet your requirements.
    • Spark's Dynamic Resource Allocation (DRA) feature is particularly useful for reducing job costs, as it releases idle resources if not needed. The cost of EMR on EKS is determined by resource consumption at various stages of a job and is not calculated by the EMR unit price * job run time.
    • DRA implementation on EKS is different from Spark on YARN. Check out the details here.
    • Decouple Compute and Storage. For example use S3 to store Input/Output data or use RSS to store shuffle data. It allows independent scaling of processing and storage. There is low chance of losing data in case of a Spot interruption too.
    • Reduce Spark\u2019s Shuffle Size and Blast Radius. This allows to select more Spot instances for diversification and also reduces the time taken to recompute/move the shuffle files in case of an interruption.
    • Automate Spot Interruption handling via existing tools and services.
    "},{"location":"cost-optimization/docs/cost-optimization/#conclusion","title":"Conclusion","text":"

    In this document, we covered best practices to cost effectively run EMR on EKS workloads using EC2 Spot Instances. We have outlined three key areas: Provisioning, Interruption Handling, and Scaling, along with the corresponding best practices for each. We aim for this document to offer prescriptive guidance on running EMR on EKS workloads with substantial cost savings through the utilization of Spot instances.

    "},{"location":"cost-optimization/docs/node-decommission/","title":"Node Decommission","text":"

    This section shows how to use an Apache Spark feature that allows you to store the shuffle data and cached RDD blocks present on the terminating executors to peer executors before a Spot node gets decommissioned. Consequently, your job does not need to recalculate the shuffle and RDD blocks of the terminating executor that would otherwise be lost, thus allowing the job to have minimal delay in completion.

    This feature is supported for releases EMR 6.3.0+.

    "},{"location":"cost-optimization/docs/node-decommission/#how-does-it-work","title":"How does it work?","text":"

    When spark.decommission.enabled is true, Spark will try its best to shut down the executor gracefully. spark.storage.decommission.enabled will enable migrating data stored on the executor. Spark will try to migrate all the cached RDD blocks (controlled by spark.storage.decommission.rddBlocks.enabled) and shuffle blocks (controlled by spark.storage.decommission.shuffleBlocks.enabled) from the decommissioning executor to all remote executors when spark decommission is enabled. Relevant Spark configurations for using node decommissioning in the jobs are

    Configuration Description Default Value spark.decommission.enabled Whether to enable decommissioning false spark.storage.decommission.enabled Whether to decommission the block manager when decommissioning executor false spark.storage.decommission.rddBlocks.enabled Whether to transfer RDD blocks during block manager decommissioning. false spark.storage.decommission.shuffleBlocks.enabled Whether to transfer shuffle blocks during block manager decommissioning. Requires a migratable shuffle resolver (like sort based shuffle) false spark.storage.decommission.maxReplicationFailuresPerBlock Maximum number of failures which can be handled for migrating shuffle blocks when block manager is decommissioning and trying to move its existing blocks. 3 spark.storage.decommission.shuffleBlocks.maxThreads Maximum number of threads to use in migrating shuffle files. 8

    This feature can currently be enabled through a temporary workaround on EMR 6.3.0+ releases. To enable it, Spark\u2019s decom.sh file permission must be modified using a custom image. Once the code is fixed, the page will be updated.

    Dockerfile for custom image:

    FROM <release account id>.dkr.ecr.<aws region>.amazonaws.com/spark/<release>\nUSER root\nWORKDIR /home/hadoop\nRUN chown hadoop:hadoop /usr/bin/decom.sh\n

    Setting decommission timeout:

    Each executor has to be decommissioned within a certain time limit controlled by the pod\u2019s terminationGracePeriodSeconds configuration. The default value is 30 secs but can be modified using a custom pod template. The pod template for this modification would look like

    apiVersion: v1\nkind: Pod\nspec:\n  terminationGracePeriodSeconds: <seconds>\n

    Note: terminationGracePeriodSeconds timeout should be lesser than spot instance timeout with around 5 seconds buffer kept aside for triggering the node termination

    Request:

    cat >spark-python-with-node-decommissioning.json << EOF\n{\n   \"name\": \"my-job-run-with-node-decommissioning\",\n   \"virtualClusterId\": \"<virtual-cluster-id>\",\n   \"executionRoleArn\": \"<execution-role-arn>\",\n   \"releaseLabel\": \"emr-6.3.0-latest\", \n   \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n   }, \n   \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n       \"classification\": \"spark-defaults\",\n       \"properties\": {\n       \"spark.kubernetes.container.image\": \"<account_id>.dkr.ecr.<region>.amazonaws.com/<custom_image_repo>\",\n       \"spark.executor.instances\": \"5\",\n        \"spark.decommission.enabled\": \"true\",\n        \"spark.storage.decommission.rddBlocks.enabled\": \"true\",\n        \"spark.storage.decommission.shuffleBlocks.enabled\" : \"true\",\n        \"spark.storage.decommission.enabled\": \"true\"\n       }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"<log group>\", \n        \"logStreamNamePrefix\": \"<log-group-prefix>\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"<S3 URI>\"\n      }\n    }\n   } \n}\nEOF\n

    Observed Behavior:

    When executors begin decommissioning, its shuffle data gets migrated to peer executors instead of recalculating the shuffle blocks again. If sending shuffle blocks to an executor fails, spark.storage.decommission.maxReplicationFailuresPerBlock will give the number of retries for migration. The driver log\u2019s stderr will see log lines Updating map output for <shuffle_id> to BlockManagerId(<executor_id>, <ip_address>, <port>, <topology_info>) denoting details about shuffle block \u2018s migration. This feature does not emit any other metrics for validation yet."},{"location":"metastore-integrations/docs/aws-glue/","title":"EMR Containers integration with AWS Glue","text":""},{"location":"metastore-integrations/docs/aws-glue/#aws-glue-catalog-in-same-account-as-eks","title":"AWS Glue catalog in same account as EKS","text":"

    In the below example a Spark application will be configured to use AWS Glue data catalog as the hive metastore.

    gluequery.py

    cat > gluequery.py <<EOF\nfrom os.path import expanduser, join, abspath\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql import Row\n# warehouse_location points to the default location for managed databases and tables\nwarehouse_location = abspath('spark-warehouse')\nspark = SparkSession \\\n    .builder \\\n    .appName(\"Python Spark SQL Hive integration example\") \\\n    .config(\"spark.sql.warehouse.dir\", warehouse_location) \\\n    .enableHiveSupport() \\\n    .getOrCreate()\nspark.sql(\"CREATE EXTERNAL TABLE `sparkemrnyc`( `dispatching_base_num` string, `pickup_datetime` string, `dropoff_datetime` string, `pulocationid` bigint, `dolocationid` bigint, `sr_flag` bigint) STORED AS PARQUET LOCATION 's3://<s3 prefix>/trip-data.parquet/'\")\nspark.sql(\"SELECT count(*) FROM sparkemrnyc\").show()\nspark.stop()\nEOF\n
    LOCATION 's3://<s3 prefix>/trip-data.parquet/'\n

    Configure the above property to point to the S3 location containing the data.

    Request

    cat > Spark-Python-in-s3-awsglue-log.json << EOF\n{\n  \"name\": \"spark-python-in-s3-awsglue-log\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/gluequery.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=3 --conf spark.executor.memory=8G --conf spark.driver.memory=6G --conf spark.executor.cores=3\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.hadoop.hive.metastore.client.factory.class\":\"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\",\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-awsglue-log.json\n

    Output from driver logs - Displays the number of rows.

    +----------+\n|  count(1)|\n+----------+\n|2716504499|\n+----------+\n
    "},{"location":"metastore-integrations/docs/aws-glue/#aws-glue-catalog-in-different-account","title":"AWS Glue catalog in different account","text":"

    The Spark application is submitted to EMR Virtual cluster in Account A and is configured to connect to AWS Glue catalog in Account B. The IAM policy attached to the job execution role (\"executionRoleArn\": \"<execution-role-arn>\")is in Account A

    {\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"glue:*\"\n            ],\n            \"Resource\": [\n                \"arn:aws:glue:<region>:<account>:catalog\",\n                \"arn:aws:glue:<region>:<account>:database/default\",\n                \"arn:aws:glue:<region>:<account>:table/default/sparkemrnyc\"\n            ]\n        }\n    ]\n}\n

    IAM policy attached to the AWS Glue catalog in Account B

    {\n  \"Version\" : \"2012-10-17\",\n  \"Statement\" : [ {\n    \"Effect\" : \"Allow\",\n    \"Principal\" : {\n      \"AWS\" : \"<execution-role-arn>\"\n    },\n    \"Action\" : \"glue:*\",\n    \"Resource\" : [ \"arn:aws:glue:<region>:<account>:catalog\", \"arn:aws:glue:<region>:<account>:database/default\", \"arn:aws:glue:<region>:<account>:table/default/sparkemrnyc\" ]\n  } ]\n}\n

    Request

    cat > Spark-Python-in-s3-awsglue-crossaccount.json << EOF\n{\n  \"name\": \"spark-python-in-s3-awsglue-crossaccount\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/gluequery.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 \"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.hadoop.hive.metastore.client.factory.class\":\"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\",\n          \"spark.hadoop.hive.metastore.glue.catalogid\":\"<account B>\",\n          }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-awsglue-crossaccount.json\n

    Configuration of interest To specify the accountID where the AWS Glue catalog is defined reference the following:

    Spark-Glue integration

    \"spark.hadoop.hive.metastore.glue.catalogid\":\"<account B>\",\n

    Output from driver logs - displays the number of rows.

    +----------+\n|  count(1)|\n+----------+\n|2716504499|\n+----------+\n
    "},{"location":"metastore-integrations/docs/aws-glue/#sync-hudi-table-with-aws-glue-catalog","title":"Sync Hudi table with AWS Glue catalog","text":"

    In this example, a Spark application will be configured to use AWS Glue data catalog as the hive metastore.

    Starting from Hudi 0.9.0, we can synchronize Hudi table's latest schema to Glue catalog via the Hive Metastore Service (HMS) in hive sync mode. This example runs a Hudi ETL job with EMR on EKS, and interact with AWS Glue metaStore to create a Hudi table. It provides you the native and serverless capabilities to manage your technical metadata. Also you can query Hudi tables in Athena straigt away after the ETL job, which provides your end user an easy data access and shortens the time to insight.

    HudiEMRonEKS.py

    cat > HudiEMRonEKS.py <<EOF\nimport sys\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\", sys.argv[1]+\"/warehouse/\" ) \\\n    .enableHiveSupport() \\\n    .getOrCreate()\n\n# Create a DataFrame\ninputDF = spark.createDataFrame(\n    [\n        (\"100\", \"2015-01-01\", \"2015-01-01T13:51:39.340396Z\"),\n        (\"101\", \"2015-01-01\", \"2015-01-01T12:14:58.597216Z\"),\n        (\"102\", \"2015-01-01\", \"2015-01-01T13:51:40.417052Z\"),\n        (\"103\", \"2015-01-01\", \"2015-01-01T13:51:40.519832Z\"),\n        (\"104\", \"2015-01-02\", \"2015-01-01T12:15:00.512679Z\"),\n        (\"105\", \"2015-01-02\", \"2015-01-01T13:51:42.248818Z\"),\n    ],\n    [\"id\", \"creation_date\", \"last_update_time\"]\n)\n\n# Specify common DataSourceWriteOptions in the single hudiOptions variable\ntest_tableName = \"hudi_tbl\"\nhudiOptions = {\n'hoodie.table.name': test_tableName,\n'hoodie.datasource.write.recordkey.field': 'id',\n'hoodie.datasource.write.partitionpath.field': 'creation_date',\n'hoodie.datasource.write.precombine.field': 'last_update_time',\n'hoodie.datasource.hive_sync.enable': 'true',\n'hoodie.datasource.hive_sync.table': test_tableName,\n'hoodie.datasource.hive_sync.database': 'default',\n'hoodie.datasource.write.hive_style_partitioning': 'true',\n'hoodie.datasource.hive_sync.partition_fields': 'creation_date',\n'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',\n'hoodie.datasource.hive_sync.mode': 'hms'\n}\n\n\n# Write a DataFrame as a Hudi dataset\ninputDF.write \\\n.format('org.apache.hudi') \\\n.option('hoodie.datasource.write.operation', 'bulk_insert') \\\n.options(**hudiOptions) \\\n.mode('overwrite') \\\n.save(sys.argv[1]+\"/hudi_hive_insert\")\nEOF\n

    NOTE: configure the warehouse dir property to point to a S3 location as your hive warehouse storage. The s3 location can be dynamic, which is based on an argument passed in or an environament vairable.

    .config(\"spark.sql.warehouse.dir\", sys.argv[1]+\"/warehouse/\" )\n

    Request

    export S3BUCKET=YOUR_S3_BUCKET_NAME\n\naws emr-containers start-job-run \\\n--virtual-cluster-id $VIRTUAL_CLUSTER_ID \\\n--name hudi-test1 \\\n--execution-role-arn $EMR_ROLE_ARN \\\n--release-label emr-6.3.0-latest \\\n--job-driver '{\n  \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://'$S3BUCKET'/app_code/job/HudiEMRonEKS.py\",\n      \"entryPointArguments\":[\"s3://'$S3BUCKET'\"],\n      \"sparkSubmitParameters\": \"--jars https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar --conf spark.executor.cores=1 --conf spark.executor.instances=2\"}}' \\\n--configuration-overrides '{\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.serializer\": \"org.apache.spark.serializer.KryoSerializer\",\n          \"spark.sql.hive.convertMetastoreParquet\": \"false\",\n          \"spark.hadoop.hive.metastore.client.factory.class\": \"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\"\n        }}\n    ], \n    \"monitoringConfiguration\": {\n      \"s3MonitoringConfiguration\": {\"logUri\": \"s3://'$S3BUCKET'/elasticmapreduce/emr-containers\"}}}'\n

    NOTE: To get a correct verison of hudi library, we directly download the jar from the maven repository with the synctax of \"sparkSubmitParameters\": \"--jars https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar. Starting from EMR 6.5, the Hudi-spark3-bundle library will be included in EMR docker images.

    "},{"location":"metastore-integrations/docs/hive-metastore/","title":"EMR Containers integration with Hive Metastore","text":"

    For more details, check out the github repository, which includes CDK/CFN templates that help you to get started quickly.

    "},{"location":"metastore-integrations/docs/hive-metastore/#1-hive-metastore-database-through-jdbc","title":"1-Hive metastore Database through JDBC","text":"

    In this example, a Spark application is configured to connect to a Hive Metastore database provisioned with Amazon RDS Aurora MySql via a JDBC connection. The Amazon RDS and EKS cluster should be in same VPC or else the Spark job will not be able to connect to RDS.

    You directly pass in the JDBC credentials at the job/application level, which is a simple and quick solution to make a connection to the HMS. However, it is not recommended in a production environment. From the security perspective, the password management could be a risk since the JDBC credentials will appear in all of your job logs. Also engineers may be required to hold the password when it is not necessary.

    Request:

    cat > Spark-Python-in-s3-hms-jdbc.json << EOF\n{\n  \"name\": \"spark-python-in-s3-hms-jdbc\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/hivejdbc.py\", \n       \"sparkSubmitParameters\": \"--jars s3://<s3 prefix>/mariadb-connector-java.jar --conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver --conf spark.hadoop.javax.jdo.option.ConnectionUserName=<connection-user-name> --conf spark.hadoop.javax.jdo.option.ConnectionPassword=<connection-password> --conf spark.hadoop.javax.jdo.option.ConnectionURL=<JDBC-Connection-string> --conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n          }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-hms-jdbc.json\n

    In this example we are connecting to mysql db, so mariadb-connector-java.jar needs to be passed with --jars option. If you are using postgres, Oracle or any other database, the appropriate connector jar needs to be included.

    Configuration of interest:

    --jars s3://<s3 prefix>/mariadb-connector-java.jar\n--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver \n--conf spark.hadoop.javax.jdo.option.ConnectionUserName=<connection-user-name>  \n--conf spark.hadoop.javax.jdo.option.ConnectionPassword=<connection-password>\n--conf spark.hadoop.javax.jdo.option.ConnectionURL**=<JDBC-Connection-string>\n

    hivejdbc.py

    from os.path import expanduser, join, abspath\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql import Row\n# warehouse_location points to the default location for managed databases and tables\nwarehouse_location = abspath('spark-warehouse')\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\", warehouse_location) \\\n    .enableHiveSupport() \\\n    .getOrCreate()\nspark.sql(\"SHOW DATABASES\").show()\nspark.sql(\"CREATE EXTERNAL TABLE `ehmsdb`.`sparkemrnyc5`( `dispatching_base_num` string, `pickup_datetime` string, `dropoff_datetime` string, `pulocationid` bigint, `dolocationid` bigint, `sr_flag` bigint) STORED AS PARQUET LOCATION 's3://<s3 prefix>/nyctaxi_parquet/'\")\nspark.sql(\"SELECT count(*) FROM ehmsdb.sparkemrnyc5 \").show()\nspark.stop()\n

    The above job lists databases from a remote RDS Hive Metastore, creates a new table and then queries it.

    "},{"location":"metastore-integrations/docs/hive-metastore/#2-hive-metastore-thrift-service-through-thrift-protocol","title":"2-Hive metastore thrift service through thrift:// protocol","text":"

    In this example, the spark application is configured to connect to an external Hive metastore thrift server. The thrift server is running on EMR on EC2's master node and AWS RDS Aurora is used as database for the Hive metastore.

    Running an EMR on EC2 cluster as a thrift server, simplify the application configuration and setup. You can start quickly with reduced engineering effort. However, your maintenance overhead may increase, since you will be monitoring two types of clusters, i.e. EMR on EC2 and EMR on EKS.

    thriftscript.py: hive.metastore.uris config needs to be set to read from external Hive metastore. The URI format looks like this: thrift://EMR_ON_EC2_MASTER_NODE_DNS_NAME:9083

    from os.path import expanduser, join, abspath\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql import Row\n# warehouse_location points to the default location for managed databases and tables\nwarehouse_location = abspath('spark-warehouse')\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\", warehouse_location) \\\n    .config(\"hive.metastore.uris\",\"<hive metastore thrift uri>\") \\\n    .enableHiveSupport() \\\n    .getOrCreate()\nspark.sql(\"SHOW DATABASES\").show()\nspark.sql(\"CREATE EXTERNAL TABLE ehmsdb.`sparkemrnyc2`( `dispatching_base_num` string, `pickup_datetime` string, `dropoff_datetime` string, `pulocationid` bigint, `dolocationid` bigint, `sr_flag` bigint) STORED AS PARQUET LOCATION 's3://<s3 prefix>/nyctaxi_parquet/'\")\nspark.sql(\"SELECT * FROM ehmsdb.sparkemrnyc2\").show()\nspark.stop()\n

    Request:

    The below job lists databases from remote Hive Metastore, creates a new table and then queries it.

    cat > Spark-Python-in-s3-hms-thrift.json << EOF\n{\n  \"name\": \"spark-python-in-s3-hms-thrift\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/thriftscript.py\", \n       \"sparkSubmitParameters\": \"--jars s3://<s3 prefix>/mariadb-connector-java.jar --conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n          }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-hms-thrift.json\n
    "},{"location":"metastore-integrations/docs/hive-metastore/#3-connect-hive-metastore-via-thrift-service-hosted-on-eks","title":"3-Connect Hive metastore via thrift service hosted on EKS","text":"

    In this example, our Spark application connects to a standalone Hive metastore service (HMS) running in EKS.

    Running the standalone HMS in EKS unifies your analytics applications with other business critical apps in a single platform. It simplifies your solution architecture and infrastructure design. The helm chart solution includes autoscaling feature, so your EKS cluster can automatically expand or shrink when the HMS request volume changes. Also it follows the security best practice to manage JDBC credentials via AWS Secrets Manager. However, you will need a combination of analytics and k8s skills to maintain this solution.

    To install the HMS helm chart, simply replace the environment variables in values.yaml, then manually helm install via the command below. Otherwise, deploy the HMS via a CDK/CFN template with a security best practice. Check out the CDK project for more details.

    cd hive-emr-on-eks/hive-metastore-chart\n\nsed -i '' -e 's/{RDS_JDBC_URL}/\"jdbc:mysql:\\/\\/'$YOUR_HOST_NAME':3306\\/'$YOUR_DB_NAME'?createDatabaseIfNotExist=true\"/g' values.yaml \nsed -i '' -e 's/{RDS_USERNAME}/'$YOUR_USER_NAME'/g' values.yaml \nsed -i '' -e 's/{RDS_PASSWORD}/'$YOUR_PASSWORD'/g' values.yaml\nsed -i '' -e 's/{S3BUCKET}/s3:\\/\\/'$YOUR_S3BUCKET'/g' values.yaml\n\nhelm repo add hive-metastore https://aws-samples.github.io/hive-metastore-chart \nhelm install hive hive-metastore/hive-metastore -f values.yaml --namespace=emr --debug\n

    hivethrift_eks.py

    from os import environ\nimport sys\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\",environ['warehouse_location']) \\\n    .config(\"hive.metastore.uris\",\"thrift://\"+environ['HIVE_METASTORE_SERVICE_HOST']+\":9083\") \\\n    .enableHiveSupport() \\\n    .getOrCreate()\n\nspark.sql(\"SHOW DATABASES\").show()\nspark.sql(\"CREATE DATABASE IF NOT EXISTS `demo`\")\nspark.sql(\"DROP TABLE IF EXISTS demo.amazonreview3\")\nspark.sql(\"CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview3`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_parent` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`vine` string,`verified_purchase` string,`review_headline` string,`review_body` string,`review_date` date,`year` integer) STORED AS PARQUET LOCATION '\"+sys.argv[1]+\"/app_code/data/toy/'\")\nspark.sql(\"SELECT coount(*) FROM demo.amazonreview3\").show()\nspark.stop()\n

    An environment variable HIVE_METASTORE_SERVICE_HOST appears in your Spark application pods automatically, once the standalone HMS is up running in EKS. You can directly set the hive.metastore.uris to thrift://\"+environ['HIVE_METASTORE_SERVICE_HOST']+\":9083\".

    Can set the spark.sql.warehouse.dir property to a S3 location as your hive warehouse storage. The s3 location can be dynamic, which is based on an argument passed in or an environment variable.

    Request:

    #!/bin/bash\naws emr-containers start-job-run \\\n--virtual-cluster-id $VIRTUAL_CLUSTER_ID \\\n--name spark-hive-via-thrift \\\n--execution-role-arn $EMR_ROLE_ARN \\\n--release-label emr-6.2.0-latest \\\n--job-driver '{\n  \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://'$S3BUCKET'/app_code/job/hivethrift_eks.py\",\n      \"entryPointArguments\":[\"s3://'$S3BUCKET'\"],\n      \"sparkSubmitParameters\": \"--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2\"}}' \\\n--configuration-overrides '{\n    \"monitoringConfiguration\": {\n      \"s3MonitoringConfiguration\": {\"logUri\": \"s3://'$S3BUCKET'/elasticmapreduce/emr-containers\"}}}'\n
    "},{"location":"metastore-integrations/docs/hive-metastore/#4-run-thrift-service-as-a-sidecar-in-spark-drivers-pod","title":"4-Run thrift service as a sidecar in Spark Driver's pod","text":"

    This advanced solution runs the standalone HMS thrift service inside a Spark driver as a sidecar. It means each Spark job will have its dedicated thrift server. The benefit of the design is HMS is no long a single point of failure, since each Spark application has its own HMS. Also it is no long a long running service, i.e. it spins up when your Spark job starts, then terminates when your job is done. The sidecar follows the security best practice via leveraging Secrets Manager to extract JDBC crednetials. However, the maintenance of the sidecar increases because you now need to manage the hms sidecar, custom configmaps and sidecar pod templates. Also this solution requires combination skills of analytics and k8s.

    The CDK/CFN template is available to simplify the installation against a new EKS cluster. If you have an existing EKS cluster, the prerequisite details can be found in the github repository

    sidecar_hivethrift_eks.py:

    import sys\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\",environ['warehouse_location']) \\\n    .enableHiveSupport() \\\n    .getOrCreate()\n\nspark.sql(\"SHOW DATABASES\").show()\nspark.sql(\"CREATE DATABASE IF NOT EXISTS `demo`\")\nspark.sql(\"DROP TABLE IF EXISTS demo.amazonreview4\")\nspark.sql(\"CREATE EXTERNAL TABLE `demo`.`amazonreview4`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_parent` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`vine` string,`verified_purchase` string,`review_headline` string,`review_body` string,`review_date` date,`year` integer) STORED AS PARQUET LOCATION '\"+sys.argv[1]+\"/app_code/data/toy/'\")\nspark.sql(\"SELECT coount(*) FROM demo.amazonreview4\").show()\nspark.stop()\n

    Request:

    Now that the HMS is running inside your Spark driver, it shares common attributes such as the network config, the spark.hive.metastore.uris can set to \"thrift://localhost:9083\". Don't forget to assign the sidecar pod template to the Spark Driver like this \"spark.kubernetes.driver.podTemplateFile\": \"s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml\"

    For more details, check out the github repo

    #!/bin/bash\n# test HMS sidecar on EKS\naws emr-containers start-job-run \\\n--virtual-cluster-id $VIRTUAL_CLUSTER_ID \\\n--name sidecar-hms \\\n--execution-role-arn $EMR_ROLE_ARN \\\n--release-label emr-6.3.0-latest \\\n--job-driver '{\n  \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://'$S3BUCKET'/app_code/job/sidecar_hivethrift_eks.py\",\n      \"entryPointArguments\":[\"s3://'$S3BUCKET'\"],\n      \"sparkSubmitParameters\": \"--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2\"}}' \\\n--configuration-overrides '{\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.driver.podTemplateFile\": \"s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml\",\n          \"spark.hive.metastore.uris\": \"thrift://localhost:9083\"\n        }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"s3MonitoringConfiguration\": {\"logUri\": \"s3://'$S3BUCKET'/elasticmapreduce/emr-containers\"}}}'\n
    "},{"location":"metastore-integrations/docs/hive-metastore/#5-hudi-remote-hive-metastore-integration","title":"5-Hudi + Remote Hive metastore integration","text":"

    Starting from Hudi 0.9.0, we can synchronize Hudi table's latest schema to Hive metastore in HMS sync mode, with this setting 'hoodie.datasource.hive_sync.mode': 'hms'.

    This example runs a Hudi job with EMR on EKS, and interact with a remote RDS hive metastore to create a Hudi table. As a serverless option, it can interact with AWS Glue catalog. check out the AWS Glue section for more details.

    HudiEMRonEKS.py

    from os import environ\nimport sys\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\", sys.argv[1]+\"/warehouse/\" ) \\\n    .enableHiveSupport() \\\n    .getOrCreate()\n\n# Create a DataFrame\ninputDF = spark.createDataFrame(\n    [\n        (\"100\", \"2015-01-01\", \"2015-01-01T13:51:39.340396Z\"),\n        (\"101\", \"2015-01-01\", \"2015-01-01T12:14:58.597216Z\"),\n        (\"102\", \"2015-01-01\", \"2015-01-01T13:51:40.417052Z\"),\n        (\"103\", \"2015-01-01\", \"2015-01-01T13:51:40.519832Z\"),\n        (\"104\", \"2015-01-02\", \"2015-01-01T12:15:00.512679Z\"),\n        (\"105\", \"2015-01-02\", \"2015-01-01T13:51:42.248818Z\"),\n    ],\n    [\"id\", \"creation_date\", \"last_update_time\"]\n)\n\n# Specify common DataSourceWriteOptions in the single hudiOptions variable\ntest_tableName = \"hudi_tbl\"\nhudiOptions = {\n'hoodie.table.name': test_tableName,\n'hoodie.datasource.write.recordkey.field': 'id',\n'hoodie.datasource.write.partitionpath.field': 'creation_date',\n'hoodie.datasource.write.precombine.field': 'last_update_time',\n'hoodie.datasource.hive_sync.enable': 'true',\n'hoodie.datasource.hive_sync.table': test_tableName,\n'hoodie.datasource.hive_sync.database': 'default',\n'hoodie.datasource.write.hive_style_partitioning': 'true',\n'hoodie.datasource.hive_sync.partition_fields': 'creation_date',\n'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',\n'hoodie.datasource.hive_sync.mode': 'hms'\n}\n\n\n# Write a DataFrame as a Hudi dataset\ninputDF.write \\\n.format('org.apache.hudi') \\\n.option('hoodie.datasource.write.operation', 'bulk_insert') \\\n.options(**hudiOptions) \\\n.mode('overwrite') \\\n.save(sys.argv[1]+\"/hudi_hive_insert\")\n\nprint(\"After {}\".format(spark.catalog.listTables()))\n

    Request:

    The latest Hudi-spark3-bundle library is needed to support the new HMS hive sync functionality. In the following sample script, it is downloaded from maven repository when submitting a job with EMR 6.3. Starting from EMR 6.5, you don't need the --jars setting anymore, because EMR 6.5+ includes the Hudi-spark3-bundle library.

    aws emr-containers start-job-run \\\n--virtual-cluster-id $VIRTUAL_CLUSTER_ID \\\n--name hudi-test1 \\\n--execution-role-arn $EMR_ROLE_ARN \\\n--release-label emr-6.3.0-latest \\\n--job-driver '{\n  \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://'$S3BUCKET'/app_code/job/HudiEMRonEKS.py\",\n      \"entryPointArguments\":[\"s3://'$S3BUCKET'\"],\n      \"sparkSubmitParameters\": \"--jars https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar --conf spark.executor.cores=1 --conf spark.executor.instances=2\"}}' \\\n--configuration-overrides '{\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.serializer\": \"org.apache.spark.serializer.KryoSerializer\",\n          \"spark.sql.hive.convertMetastoreParquet\": \"false\",\n          \"spark.hive.metastore.uris\": \"thrift://localhost:9083\",\n          \"spark.kubernetes.driver.podTemplateFile\": \"s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml\"\n        }}\n    ], \n    \"monitoringConfiguration\": {\n      \"s3MonitoringConfiguration\": {\"logUri\": \"s3://'$S3BUCKET'/elasticmapreduce/emr-containers\"}}}'\n
    "},{"location":"node-placement/docs/eks-node-placement/","title":"EKS Node Placement","text":""},{"location":"node-placement/docs/eks-node-placement/#single-az-placement","title":"Single AZ placement","text":"

    AWS EKS clusters can span multiple AZs in a VPC. A Spark application whose driver and executor pods are distributed across multiple AZs can incur inter-AZ data transfer costs. To minimize or eliminate inter-AZ data transfer costs, you can configure the application to only run on the nodes within a single AZ. In this example, we use the kubernetes node selector to specify which AZ should the job run on.

    Request:

    cat >spark-python-in-s3-nodeselector.json << EOF\n{\n  \"name\": \"spark-python-in-s3-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.kubernetes.node.selector.topology.kubernetes.io/zone='<availability zone>' --conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector.json\n

    Observed Behavior: When the job starts the driver pod and executor pods are scheduled only on those EKS worker nodes with the label topology.kubernetes.io/zone: <availability zone>. This ensures the spark job is run within a single AZ. If there are not enough resources within the specified AZ, the pods will be in the pending state until the Autoscaler(if configured) kicks in or more resources become available.

    Spark on kubernetes Node selector configuration Kubernetes Node selector reference

    Configuration of interest -

    --conf spark.kubernetes.node.selector.zone='<availability zone>'\n

    zone is a built-in label that EKS assigns to every EKS worker Node. The above config will ensure to schedule the driver and executor pod on those EKS worker nodes labeled - topology.kubernetes.io/zone: <availability zone>. However, user defined labels can also be assigned to EKS worker nodes and used as node selector.

    Other common use cases are using node labels to force the job to run on on demand/spot, machine type, etc.

    "},{"location":"node-placement/docs/eks-node-placement/#single-az-and-ec2-instance-type-placement","title":"Single AZ and ec2 instance type placement","text":"

    Multiple key value pairs for spark.kubernetes.node.selector.[labelKey] can be passed to add filter conditions for selecting the EKS worker node. If you want to schedule on EKS worker nodes in <availability zone> and instance-type as m5.4xlarge - it is done as below

    Request:

    {\n  \"name\": \"spark-python-in-s3-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.kubernetes.pyspark.pythonVersion=3 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 --conf spark.sql.shuffle.partitions=1000\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\",\n          \"spark.kubernetes.node.selector.topology.kubernetes.io/zone\":\"<availability zone>\",\n          \"spark.kubernetes.node.selector.node.kubernetes.io/instance-type\":\"m5.4xlarge\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n      }\n      }\n    }\n  }\n}\n

    Configuration of interest

    spark.kubernetes.node.selector.[labelKey] - Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier will result in the driver pod and executors having a node selector with key identifier and value myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix.

    "},{"location":"node-placement/docs/eks-node-placement/#job-submitter-pod-placement","title":"Job submitter pod placement","text":"

    Similar to driver and executor pods, you can configure the job submitter pod's node selectors as well using the emr-job-submitter classification. It is recommended for job submitter pods to have node placement on ON_DEMAND nodes and not SPOT nodes as the job will fail if the job submitter pod gets Spot instance interruptions. You can also place the job submitter pod in a single AZ or use any Kubernetes labels that are applied to the nodes.

    Note: The job submitter pod is also referred as the job-runner pod

    StartJobRun request with ON_DEMAND node placement for job submitter pod

    cat >spark-python-in-s3-nodeselector-job-submitter.json << EOF\n{\n  \"name\": \"spark-python-in-s3-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      },\n      {\n        \"classification\": \"emr-job-submitter\",\n        \"properties\": {\n            \"jobsubmitter.node.selector.eks.amazonaws.com/capacityType\": \"ON_DEMAND\"\n        }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector-job-submitter.json\n

    StartJobRun request with Single AZ node placement for job submitter pod:

    cat >spark-python-in-s3-nodeselector-job-submitter-az.json << EOF\n{\n  \"name\": \"spark-python-in-s3-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      },\n      {\n        \"classification\": \"emr-job-submitter\",\n        \"properties\": {\n            \"jobsubmitter.node.selector.topology.kubernetes.io/zone\": \"<availability zone>\"\n        }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector-job-submitter-az.json\n

    StartJobRun request with single AZ and ec2 instance type placement for job submitter pod:

    {\n  \"name\": \"spark-python-in-s3-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.kubernetes.pyspark.pythonVersion=3 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 --conf spark.sql.shuffle.partitions=1000\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\",\n         }\n      },\n      {\n        \"classification\": \"emr-job-submitter\",\n        \"properties\": {\n            \"jobsubmitter.node.selector.topology.kubernetes.io/zone\": \"<availability zone>\",\n            \"jobsubmitter.node.selector.node.kubernetes.io/instance-type\":\"m5.4xlarge\"\n        }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n      }\n      }\n    }\n  }\n}\n

    Configurations of interest:

    jobsubmitter.node.selector.[labelKey]: Adds to the node selector of the job submitter pod, with key labelKey and the value as the configuration's value. For example, setting jobsubmitter.node.selector.identifier to myIdentifier will result in the job-runner pod having a node selector with key identifier and value myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix.

    "},{"location":"node-placement/docs/fargate-node-placement/","title":"EKS Fargate Node Placement","text":""},{"location":"node-placement/docs/fargate-node-placement/#fargate-node-placement","title":"Fargate Node Placement","text":"

    AWS Fargate is a technology that provides on-demand, right-sized compute capacity for containers. With AWS Fargate, you don't have to provision, configure, or scale groups of EC2 instances on your own to run containers. You also don't need to choose server types, decide when to scale your node groups, or optimize cluster packing. Instead you can control which pods start on Fargate and how they run with Fargate profiles.

    "},{"location":"node-placement/docs/fargate-node-placement/#aws-fargate-profile","title":"AWS Fargate profile","text":"

    Before you can schedule pods on Fargate in your cluster, you must define at least one Fargate profile that specifies which pods use Fargate when launched. You must define a namespace for every selector. The Fargate profile allows an administrator to declare which pods run on Fargate. This declaration is done through the profile\u2019s selectors. If a namespace selector is defined without any labels, Amazon EKS attempts to schedule all pods that run in that namespace onto Fargate using the profile.

    Create Fargate Profile Create your Fargate profile with the following eksctl command, replacing the <variable text> (including <>) with your own values. You're required to specify a namespace. The --labels option is not required to create your Fargate profile, but will be required if you want to only run Spark executors on Fargate.

    eksctl create fargateprofile \\\n    --cluster <cluster_name> \\\n    --name <fargate_profile_name> \\\n    --namespace <virtual_cluster_mapped_namespace> \\\n    --labels spark-node-placement=fargate\n
    "},{"location":"node-placement/docs/fargate-node-placement/#1-place-entire-job-including-driver-pod-on-fargate","title":"1- Place entire job including driver pod on Fargate","text":"

    When both Driver and Executors use the same labels as the Fargate Selector, the entire job including the driver pod will run on Fargate.

    Request:

    cat >spark-python-in-s3-nodeselector.json << EOF\n{\n  \"name\": \"spark-python-in-s3-fargate-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.3.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=4  --conf spark.executor.memory=20G --conf spark.driver.memory=20G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n            \"spark.kubernetes.driver.label.spark-node-placement\": \"fargate\",\n            \"spark.kubernetes.executor.label.spark-node-placement\": \"fargate\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector.json\n

    Observed Behavior: When the job starts, the driver pod and executor pods are scheduled only on Fargate since both are labeled with the spark-node-placement: fargate. This is useful when we want to run the entire job on Fargate nodes. The maximum vCPU available for the driver pod is 4vCPU.

    "},{"location":"node-placement/docs/fargate-node-placement/#2-place-driver-pod-on-ec2-and-executor-pod-on-fargate","title":"2- Place driver pod on EC2 and executor pod on Fargate","text":"

    Remove the label from the driver pod to schedule the driver pod on EC2 instances. This is especially helpful when driver pod needs more resources (i.e. > 4 vCPU).

    Request:

    cat >spark-python-in-s3-nodeselector.json << EOF\n{\n  \"name\": \"spark-python-in-s3-fargate-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.3.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=6 --conf spark.executor.memory=20G --conf spark.driver.memory=30G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n            \"spark.kubernetes.executor.label.spark-node-placement\": \"fargate\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector.json\n

    Observed Behavior: When the job starts, the driver pod schedules on an EC2 instance. EKS picks an instance from the first Node Group that has the matching resources available to the driver pod.

    "},{"location":"node-placement/docs/fargate-node-placement/#3-define-a-nodeselector-in-pod-templates","title":"3- Define a NodeSelector in Pod Templates","text":"

    Beginning with Amazon EMR versions 5.33.0 or 6.3.0, Amazon EMR on EKS supports Spark\u2019s pod template feature. Pod templates are specifications that determine how to run each pod. You can use pod template files to define the driver or executor pod\u2019s configurations that Spark configurations do not support. For example Spark configurations do not support defining individual node selectors for the driver pod and the executor pods. Define a node selector only for the driver pod when you want to choose on which pool of EC2 instance it should schedule. Let the Fargate Profile schedule the executor pods.

    Driver Pod Template

    apiVersion: v1\nkind: Pod\nspec:\n  volumes:\n    - name: source-data-volume\n      emptyDir: {}\n    - name: metrics-files-volume\n      emptyDir: {}\n  nodeSelector:\n    <ec2-instance-node-label-key>: <ec2-instance-node-label-value>\n  containers:\n  - name: spark-kubernetes-driver # This will be interpreted as Spark driver container\n

    Store the pod template file onto a S3 location:

    aws s3 cp /driver-pod-template.yaml s3://<your-bucket-name>/driver-pod-template.yaml

    Request

    cat >spark-python-in-s3-nodeselector.json << EOF\n{\n  \"name\": \"spark-python-in-s3-fargate-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.3.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=30G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n            \"spark.kubernetes.executor.label.spark-node-placement\": \"fargate\",\n            \"spark.kubernetes.driver.podTemplateFile\": \"s3://<your-bucket-name>/driver-pod-template.yaml\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector.json\n

    Observed Behavior: The driver pod schedules on an EC2 instance with enough capacity and matching label key / value with the node selector.

    "},{"location":"outposts/emr-containers-on-outposts/","title":"Running EMR Containers on AWS Outposts","text":""},{"location":"outposts/emr-containers-on-outposts/#background","title":"Background","text":"

    You can now run Amazon EMR container jobs on EKS clusters that are running on AWS Outposts. AWS Outposts enables native AWS services, infrastructure, and operating models in on-premises facilities. In AWS Outposts environments, you can use the same AWS APIs, tools, and infrastructure that you use in the AWS Cloud. Amazon EKS nodes on AWS Outposts is ideal for low-latency workloads that need to be run in close proximity to on-premises data and applications. For more information, see the Amazon EKS on Outposts documentation page.

    This document provides the steps to set up EMR containers on AWS Outposts.

    "},{"location":"outposts/emr-containers-on-outposts/#key-considerations-and-recommendations","title":"Key Considerations and Recommendations","text":"
    • The EKS cluster on an Outpost must be created with self-managed node groups.
    • Use the AWS Management Console and AWS CloudFormation to create a self-managed node group in Outposts.
    • For EMR workloads, we recommend creating EKS clusters where all the worker nodes reside in the self-managed node group of Outposts.
    • The Kubernetes client in the Spark driver pod creates and monitor executor pods by communicating with the EKS managed Kubernetes API server residing in the parent AWS Region. For reliable monitoring of executor pods during a job run, we also recommend having a reliable low latency link between the Outpost and the parent Region.
    • AWS Fargate is not available on Outposts.
    • For more information about the supported Regions, prerequisites and considerations for Amazon EKS on AWS Outposts, see the EKS on Outposts documentation page.
    "},{"location":"outposts/emr-containers-on-outposts/#infrastructure-setup","title":"Infrastructure Setup","text":""},{"location":"outposts/emr-containers-on-outposts/#setup-eks-on-outposts","title":"Setup EKS on Outposts","text":"

    Network Setup

    • Setup a VPC
    aws ec2 create-vpc \\\n--region <us-west-2> \\\n--cidr-block '<10.0.0.0/16>'\n

    In the output, take note of the VPC ID.

    {\n    \"Vpc\": {\n        \"VpcId\": \"vpc-123vpc\", \n        ...\n    }\n}\n
    • Create two subnets in the parent Region.
    aws ec2 create-subnet \\\n    --region '<us-west-2>' \\\n    --availability-zone-id '<usw2-az1>' \\\n    --vpc-id '<vpc-123vpc>' \\\n    --cidr-block '<10.0.1.0/24>'\n\naws ec2 create-subnet \\\n    --region '<us-west-2>' \\\n    --availability-zone-id '<usw2-az2>' \\\n    --vpc-id '<vpc-123vpc>' \\\n    --cidr-block '<10.0.2.0/24>'\n

    In the output, take note of the Subnet ID.

    {\n    \"Subnet\": {\n        \"SubnetId\": \"subnet-111\",\n        ...\n    }\n}\n{\n    \"Subnet\": {\n        \"SubnetId\": \"subnet-222\",\n        ...\n    }\n}\n
    • Create a subnet in the Outpost Availability Zone. (This step is different for Outposts)
    aws ec2 create-subnet \\\n    --region '<us-west-2>' \\\n    --availability-zone-id '<usw2-az1>' \\\n    --outpost-arn 'arn:aws:outposts:<us-west-2>:<123456789>:outpost/<op-123op>' \\\n    --vpc-id '<vpc-123vpc>' \\\n    --cidr-block '<10.0.3.0/24>'\n

    In the output, take note of the Subnet ID.

    {\n    \"Subnet\": {\n        \"SubnetId\": \"subnet-333outpost\",\n        \"OutpostArn\": \"...\"\n        ...\n    }\n}\n

    EKS Cluster Creation

    • Create an EKS cluster using the three subnet Ids created earlier.
    aws eks create-cluster \\\n    --region '<us-west-2>' \\\n    --name '<outposts-eks-cluster>' \\\n    --role-arn 'arn:aws:iam::<123456789>:role/<cluster-service-role>' \\\n    --resources-vpc-config  subnetIds='<subnet-111>,<subnet-222>,<subnet-333outpost>'\n
    • Check until the cluster status becomes active.
    aws eks describe-cluster \\\n    --region '<us-west-2>' \\\n    --name '<outposts-eks-cluster>'\n

    Note the values of resourcesVpcConfig.clusterSecurityGroupId and identity.oidc.issuer.

    {\n    \"cluster\": {\n        \"name\": \"outposts-eks-cluster\",\n        ...\n        \"resourcesVpcConfig\": {\n            \"clusterSecurityGroupId\": \"sg-123clustersg\",\n        },\n        \"identity\": {\n            \"oidc\": {\n                \"issuer\": \"https://oidc.eks.us-west-2.amazonaws.com/id/oidcid\"\n            }\n        },\n        \"status\": \"ACTIVE\",\n    }\n}\n
    • Add the Outposts nodes to the EKS Cluster.

    At this point, eksctl cannot be used to launch self-managed node groups in Outposts. Please follow the steps listed in the self-managed nodes documentation page. In order to use the cloudformation script lised in the AWS Management Console tab, make note of the following values created in the earlier steps: * ClusterName: <outposts-eks-cluster> * ClusterControlPlaneSecurityGroup: <sg-123clustersg> * Subnets: <subnet-333outpost>

    Apply the aws-auth-cm config map listed on the documentation page to allow the nodes to join the cluster.

    "},{"location":"outposts/emr-containers-on-outposts/#register-cluster-with-emr-containers","title":"Register cluster with EMR Containers","text":"

    Once the EKS cluster has been created and the nodes have been registered with the EKS control plane, take the following steps:

    • Enable cluster access for Amazon EMR on EKS.
    • Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster.
    • Create a job execution role.
    • Update the trust policy of the job execution role.
    • Grant users access to Amazon EMR on EKS.
    • Register the Amazon EKS cluster with Amazon EMR.
    "},{"location":"outposts/emr-containers-on-outposts/#conclusion","title":"Conclusion","text":"

    EMR-EKS on Outposts allows users to run their big data jobs in close proximity to on-premises data and applications.

    "},{"location":"performance/docs/dra/","title":"Dynamic Resource Allocation","text":"

    DRA is available in Spark 3 (EMR 6.x) without the need for an external shuffle service. Spark on Kubernetes doesn't support external shuffle service as of spark 3.1, but DRA can be achieved by enabling shuffle tracking.

    Spark DRA without external shuffle service: With DRA, the spark driver spawns the initial number of executors and then scales up the number until the specified maximum number of executors is met to process the pending tasks. Idle executors are terminated when there are no pending tasks, the executor idle time exceeds the idle timeout(spark.dynamicAllocation.executorIdleTimeout)and it doesn't have any cached or shuffle data.

    If the executor idle threshold is reached and it has cached data, then it has to exceed the cache data idle timeout(spark.dynamicAllocation.cachedExecutorIdleTimeout) and if the executor doesn't have shuffle data, then the idle executor is terminated.

    If the executor idle threshold is reached and it has shuffle data, then without external shuffle service the executor will never be terminated. These executors will be terminated when the job is completed. This behavior is enforced by \"spark.dynamicAllocation.shuffleTracking.enabled\":\"true\" and \"spark.dynamicAllocation.enabled\":\"true\"

    If \"spark.dynamicAllocation.shuffleTracking.enabled\":\"false\"and \"spark.dynamicAllocation.enabled\":\"true\" then the spark application will error out since external shuffle service is not available.

    Request:

    cat >spark-python-in-s3-dra.json << EOF\n{\n  \"name\": \"spark-python-in-s3-dra\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"true\",\n          \"spark.dynamicAllocation.shuffleTracking.enabled\":\"true\",\n          \"spark.dynamicAllocation.minExecutors\":\"5\",\n          \"spark.dynamicAllocation.maxExecutors\":\"100\",\n          \"spark.dynamicAllocation.initialExecutors\":\"10\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n
    aws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-dra.json\n

    Observed Behavior: When the job gets started, the driver pod gets created and 10 executors are initially created. (\"spark.dynamicAllocation.initialExecutors\":\"10\") Then the number of executors can scale up to a maximum of 100 (\"spark.dynamicAllocation.maxExecutors\":\"100\"). Configurations to note:

    spark.dynamicAllocation.shuffleTracking.enabled - **Experimental**. Enables shuffle file tracking for executors, which allows dynamic allocation without the need for an external shuffle service. This option will try to keep alive executors that are storing shuffle data for active jobs.

    spark.dynamicAllocation.shuffleTracking.timeout - When shuffle tracking is enabled, controls the timeout for executors that are holding shuffle data. The default value means that Spark will rely on the shuffles being garbage collected to be able to release executors. If for some reason garbage collection is not cleaning up shuffles quickly enough, this option can be used to control when to time out executors even when they are storing shuffle data.

    "},{"location":"security/docs/spark/data-encryption/","title":"EMR Containers Spark - In transit and At Rest data encryption","text":""},{"location":"security/docs/spark/data-encryption/#encryption-at-rest","title":"Encryption at Rest","text":""},{"location":"security/docs/spark/data-encryption/#amazon-s3-client-side-encryption","title":"Amazon S3 Client-Side Encryption","text":"

    To utilize S3 Client side encryption, you will need to create a KMS Key to be used to encrypt and decrypt data. If you do not have an KMS key, please follow this guide - AWS KMS create keys. Also please note the job execution role needs access to this key, please see Add to Key policy for instructions on how to add these permissions.

    trip-count-encrypt-write.py:

    cat> trip-count-encrypt-write.py<<EOF\nimport sys\n\nfrom pyspark.sql import SparkSession\n\n\nif __name__ == \"__main__\":\n\n    spark = SparkSession\\\n        .builder\\\n        .appName(\"trip-count-join-fsx\")\\\n        .getOrCreate()\n\n    df = spark.read.parquet('s3://<s3 prefix>/trip-data.parquet')\n    print(\"Total trips: \" + str(df.count()))\n\n    df.write.parquet('s3://<s3 prefix>/write-encrypt-trip-data.parquet')\n    print(\"Encrypt - KMS- CSE writew to s3 compeleted\")\n    spark.stop()\n    EOF\n

    Request:

    cat > spark-python-in-s3-encrypt-cse-kms-write.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-encrypt-cse-kms-write\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>trip-count-encrypt-write.py\", \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=10 --conf spark.driver.cores=2  --conf spark.executor.memory=20G --conf spark.driver.memory=20G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n       },\n       {\n         \"classification\": \"emrfs-site\", \n         \"properties\": {\n          \"fs.s3.cse.enabled\":\"true\",\n          \"fs.s3.cse.encryptionMaterialsProvider\":\"com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider\",\n          \"fs.s3.cse.kms.keyId\":\"<KMS Key Id>\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"persistentAppUI\": \"ENABLED\", \n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-encrypt-cse-kms-write.json\n

    In the above request, EMRFS encrypts the parquet file with the specified KMS key and the encrypted object is persisted to the specified s3 location.

    To verify the encryption - use the same KMS key to decrypt - the KMS key used is a symmetric key ( the same key can be used to both encrypt and decrypt)

    trip-count-encrypt-read.py

    cat > trip-count-encrypt-read.py<<EOF\nimport sys\n\nfrom pyspark.sql import SparkSession\n\n\nif __name__ == \"__main__\":\n\n    spark = SparkSession\\\n        .builder\\\n        .appName(\"trip-count-join-fsx\")\\\n        .getOrCreate()\n\n    df = spark.read.parquet('s3://<s3 prefix>/trip-data.parquet')\n    print(\"Total trips: \" + str(df.count()))\n\n    df_encrypt = spark.read.parquet('s3://<s3 prefix>/write-encrypt-trip-data.parquet')\n    print(\"Encrypt data - Total trips: \" + str(df_encrypt.count()))\n    spark.stop()\n   EOF\n

    Request

    cat > spark-python-in-s3-encrypt-cse-kms-read.json<<EOF\n{\n  \"name\": \"spark-python-in-s3-encrypt-cse-kms-read\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>trip-count-encrypt-write.py\", \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=10 --conf spark.driver.cores=2  --conf spark.executor.memory=20G --conf spark.driver.memory=20G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n       },\n       {\n         \"classification\": \"emrfs-site\", \n         \"properties\": {\n          \"fs.s3.cse.enabled\":\"true\",\n          \"fs.s3.cse.encryptionMaterialsProvider\":\"com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider\",\n          \"fs.s3.cse.kms.keyId\":\"<KMS Key Id>\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"persistentAppUI\": \"ENABLED\", \n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-encrypt-cse-kms-read.json\n

    Validate encryption: Try to read the encrypted data without specifying \"fs.s3.cse.enabled\":\"true\" - will get an error message in the driver and executor logs because the content is encrypted and cannot be read without decryption.

    "},{"location":"security/docs/spark/encryption/","title":"EMR on EKS - Encryption Best Practices","text":"

    This document will describe how to think about security and its best practices when applying to EMR on EKS service. We will cover topics related to encryption at rest and in-transit when you run EMR on EKS jobs on EKS cluster.

    It's important to understand the shared responsibility model when using managed services such as EMR on EKS in order to improve the overall security posture of your environment. Generally speaking AWS is responsible for security \"of\" the cloud whereas you, the customer, are responsible for security \"in\" the cloud. The diagram below depicts this high level definition.

    "},{"location":"security/docs/spark/encryption/#shared-responsibility-model","title":"Shared responsibility model","text":"

    EMR on EKS provides simple way to run spark jobs on top of EKS clusters. The architecture itself is loosely coupled and is abstracted from customers so that they can run secure environment for running spark applications. Because EMR on EKS uses combination of two services (EMR and EKS) at a minimal, we will cover how EKS enables infrastructure components that are consumable by EMR spark workload and how to handle encryption for each service.

    AWS assumes different levels of responsibility depending on the features being consumed by EMR on EKS customers. At this time of writing, the features from EKS are managed node groups, self-managed workers, and Fargate. We won\u2019t go in-depth on these architectures as they are detailed in EKS best practices guide (https://aws.github.io/aws-eks-best-practices/security/docs/). Below diagrams depict how this responsibility changes between customer and AWS based on consumed features.

    "},{"location":"security/docs/spark/encryption/#encryption-for-data-in-transit","title":"Encryption for data in-transit","text":"

    In this section, we will cover encryption for data in-transit. We will highlight AWS platform capabilities from the physical layer and then review how AWS handles encryption in the EMR on EKS architecture layer. Lastly, we will cover how customers can enable encryption between spark drivers and executors.

    "},{"location":"security/docs/spark/encryption/#aws-infrastructure-physical-layer","title":"AWS Infrastructure - Physical layer","text":"

    AWS provides secure and private connectivity between EC2 instances of all types. All data flowing across AWS Regions over the AWS global network is automatically encrypted at the physical layer before it leaves AWS secured facilities. All traffic between AZs is encrypted. All cross-Region traffic that uses Amazon VPC and Transit Gateway peering is automatically bulk-encrypted when it exits a Region. In addition, if you use Nitro family of instances, all traffic between instances is encrypted in-transit using AEAD algorithms with 256-bit encryption. We highly recommend reviewing EC2 documentation for more information.

    "},{"location":"security/docs/spark/encryption/#amazon-emr-on-eks","title":"Amazon EMR on EKS","text":"

    Below diagram depicts high-level architecture implementation of EMR on EKS. In this section, we will cover encryption in-transit for communication between managed services such as EMR & EKS. All traffic with AWS API\u2019s that support EMR and EKS are encrypted by default. EKS enables Kubernetes API server using https endpoint. Both the kubelet that runs on EKS worker nodes and Kubernetes client such as kubectl interacts with EKS cluster API using TLS. Amazon EMR on EKS uses the same secure channel to interact with EKS cluster API to run spark jobs on worker nodes. In addition, EMR on EKS provides an encrypted endpoint for accessing spark history server.

    Spark offers AES-based encryption for RPC connections. EMR on EKS customers may choose to encrypt the traffic between spark drivers and executors using this encryption mechanism. In order to enable encryption, RPC authentication must also be enabled in your spark configuration.

    --conf spark.authenticate=true \\\n--conf spark.network.crypto.enabled=true \\\n

    The encryption key is generated by the driver and distributed to executors via environment variables. Because these environment variables can be accessed by users who have access to Kubernetes API (kubectl), we recommend securing access so that only authorized users have access to your environment. You should also configure proper Kubernetes RBAC permissions so that only authorized service accounts can use these variables.

    "},{"location":"security/docs/spark/encryption/#encryption-for-data-at-rest","title":"Encryption for data at-rest","text":"

    In this section, we will cover encryption for data at-rest. We will review how to enable storage-level encryption so that it is transparent for spark application to use this data securely. We will also see how to enable encryption from spark application while using AWS native storage options.

    "},{"location":"security/docs/spark/encryption/#amazon-s3","title":"Amazon S3","text":"

    Amazon S3 offers server-side encryption for encrypting all data that is stored in an S3 bucket. You can enable default encryption using either S3 managed keys (SSE-S3) or KMS managed keys (SSE-KMS). Amazon S3 will encrypt all data before storing it on disks based on the keys specified. We recommend using server-side encryption at a minimum so that your data at-rest is encrypted. Please review Amazon S3 documentation and use the mechanisms that apply to your encryption standards and acceptable performance.

    Amazon S3 supports client-side encryption as well. Using this approach, you can let spark application to encrypt all data with desired KMS keys and upload this data to S3 buckets. Below examples shows spark application reading and writing parquet data in S3. During job submission, we use EMRFS encryption mechanism to encrypt all data with KMS key into the desired S3 location.

    import sys\n\nfrom pyspark.sql import SparkSession\n\n\nif __name__ == \"__main__\":\n\n    spark = SparkSession\\\n        .builder\\\n        .appName(\"trip-count-join-fsx\")\\\n        .getOrCreate()\n\n    df = spark.read.parquet('s3://<s3 prefix>/trip-data.parquet')\n    print(\"Total trips: \" + str(df.count()))\n\n    df.write.parquet('s3://<s3 prefix>/write-encrypt-trip-data.parquet')\n    print(\"Encrypt - KMS- CSE writew to s3 compeleted\")\n    spark.stop()\n

    Below is the job submission request that depicts KMS specification needed for EMRFS to perform this encryption. For complete end-to-end example, please see EMR on EKS best practices documentation

    cat > spark-python-in-s3-encrypt-cse-kms-write.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-encrypt-cse-kms-write\",\n  \"virtualClusterId\": \"<virtual-cluster-id>\",\n  \"executionRoleArn\": \"<execution-role-arn>\",\n  \"releaseLabel\": \"emr-6.2.0-latest\",\n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>trip-count-encrypt-write.py\",\n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=10 --conf spark.driver.cores=2  --conf spark.executor.memory=20G --conf spark.driver.memory=20G --conf spark.executor.cores=2\"\n    }\n  },\n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\",\n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n       },\n       {\n         \"classification\": \"emrfs-site\",\n         \"properties\": {\n          \"fs.s3.cse.enabled\":\"true\",\n          \"fs.s3.cse.encryptionMaterialsProvider\":\"com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider\",\n          \"fs.s3.cse.kms.keyId\":\"<KMS Key Id>\"\n         }\n      }\n    ],\n    \"monitoringConfiguration\": {\n      \"persistentAppUI\": \"ENABLED\",\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\",\n        \"logStreamNamePrefix\": \"demo\"\n      },\n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-encrypt-cse-kms-write.json\n

    Amazon EKS offers three different storage offerings (EBS, EFS, FSx) that can be directly consumed by pods. Each storage offering provides encryption mechanism that can be enabled at the storage level.

    "},{"location":"security/docs/spark/encryption/#amazon-ebs","title":"Amazon EBS","text":"

    Amazon EBS supports default encryption that can be turned on a per-region basis. Once it's turned on, you can have newly created EBS volumes and snapshots encrypted using AWS managed KMS keys. Please review EBS documentation to learn more on how to enable this feature

    You can use Kubernetes (k8s) in-tree storage driver or choose to use EBS CSI driver to consume EBS volumes within your pods. Both choices offer options to enable encryption. In the below example, we use k8s in-tree storage driver to create storage class and persistent volume claim. You can create similar resources using EBS CSI driver as well.

    apiVersion: storage.k8s.io/v1\nkind: StorageClass\nmetadata:\n  name: encrypted-sc\nprovisioner: kubernetes.io/aws-ebs\nvolumeBindingMode: WaitForFirstConsumer\nparameters:\n  type: gp2\n  fsType: ext4\n  encrypted: \"true\"\n\napiVersion: v1\nkind: PersistentVolumeClaim\nmetadata:\n  name: spark-driver-pvc\nspec:\n  storageClassName: encrypted-sc\n  accessModes:\n    - ReadWriteOnce\n  resources:\n    requests:\n      storage: 10Gi\n

    Once these resources are created, you can specify them in your drivers and executors. You can see an example of this specification below. Keep in mind, you can only attach an EBS volume to single EC2 instance or a Kubernetes pod. Therefore, if you have multiple executor pods, you need to create multiple PVCs to fulfill this request

    --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=spark-driver-pvc\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.readOnly=false\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data\n...\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=spark-executor-pvc\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/data\n

    Another approach is to let k8s create EBS volumes dynamically based on your spark workload. You can do so by specifying just the storageClass and sizeLimit options and specify OnDemand for the persistent volume claim (PVC). This is useful in case of Dynamic Resource Allocation. Please be sure to use EMR 6.3.0 release and above to use this feature because dynamic PVC support was added in Spark 3.1. Below is an example for dynamically creating volumes for executors within your job

    --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=spark-driver-pvc\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.readOnly=false\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=encrypted-sc\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=10Gi\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/data\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.options.claimName=OnDemand\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.options.storageClass=encrypted-sc\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.options.sizeLimit=10Gi\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.mount.path=/var/data/spill\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.mount.readOnly=false\n

    For a complete list of available options, please refer to the Spark Documentation

    "},{"location":"security/docs/spark/encryption/#amazon-efs","title":"Amazon EFS","text":"

    Similar to EBS, you can consume EFS volumes via EFS CSI driver and FSx for Lustre volumes via FSx CSI driver. There are two provisioning methods before these storage volumes are consumed by workloads, namely static provisioning and dynamic provisioning. For static provisioning, you have to pre-create volumes using AWS API\u2019s, CLI or AWS console. For dynamic provisioning, volume is created dynamically by the CSI drivers as workloads are deployed onto Kubernetes cluster. Currently, EFS CSI driver doesn\u2019t support dynamic volume provisioning. However, you can create the volume using EFS API or AWS console before creating a persistent volume (PV) that can be used within your spark application. If you plan to encrypt the data stored in EFS, you need to specify encryption during volume creation. For further information about EFS file encryption, please refer to Encrypting Data at Rest. One of the advantages of using EFS is that it provides encryption in transit support using TLS and it's enabled by default by the CSI driver. You can see the example below if you need to enforce TLS encryption during PV creation

    apiVersion: v1\nkind: PersistentVolume\nmetadata:\n  name: efs-pv\nspec:\n  capacity:\n    storage: 5Gi\n  volumeMode: Filesystem\n  accessModes:\n    - ReadWriteOnce\n  persistentVolumeReclaimPolicy: Retain\n  storageClassName: efs-sc\n  csi:\n    driver: efs.csi.aws.com\n    volumeHandle: fs-4af69aab\n    volumeAttributes:\n      encryptInTransit: \"true\"\n
    "},{"location":"security/docs/spark/encryption/#amazon-fsx-for-lustre","title":"Amazon FSx for Lustre","text":"

    Amazon FSx CSI driversupports both static and dynamic provisioning. Encryption for data in-transit is automatically enabled from Amazon EC2 instances that support encryption in transit. To learn which EC2 instances support encryption in transit, see Encryption in Transit in the Amazon EC2 User Guide for Linux Instances. Encryption for data at rest is automatically enabled when you create the FSx filesystem. Amazon FSx for Lustre supports two types of filesystems, namely persistent and scratch. You can use the default encryption method where encryption keys are managed by Amazon FSx. However, if you prefer to manage your own KMS keys, you can do so for persistent filesystem. The example below shows how to create storage class using FSx for Lustre for persistent filesystem using your own KMS managed keys.

    kind: StorageClass\napiVersion: storage.k8s.io/v1\nmetadata:\n  name: fsx-sc\nprovisioner: fsx.csi.aws.com\nparameters:\n  subnetId: subnet-056da83524edbe641\n  securityGroupIds: sg-086f61ea73388fb6b\n  deploymentType: PERSISTENT_1\n  kmsKeyId: <kms_arn>\n

    You can then create persistent volume claim (see an example in FSx repo) and use within your spark application as below

    --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=fsx-claim\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.readOnly=false\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data\n
    "},{"location":"security/docs/spark/encryption/#using-spark-to-encrypt-data","title":"Using Spark to encrypt data","text":"

    Apache Spark supports encrypting temporary data that is stored on storage volumes. These volumes can be instance storage such as NVMe SSD volumes, EBS, EFS or FSx volumes. Temporary data can be shuffle files, shuffle spills and data blocks stored on disk (for both caching and broadcast variables). It's important to note that the data on NVMe instance storage is encrypted using an XTS-AES-256 block cipher implemented in a hardware module on the instance. Even though, instance storage is available, you need to format and mount them while you bootstrap EC2 instances. Below is an example to show how to use instance storage using eksctl

    managedNodeGroups:\n- name: nvme\n  minSize: 2\n  desiredCapacity: 2\n  maxSize: 10\n  instanceType: r5d.4xlarge\n  ssh:\n    enableSsm: true\n  preBootstrapCommands:\n    - IDX=1\n    - for DEV in /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_*-ns-1; do  mkfs.xfs ${DEV};mkdir -p /local${IDX};echo ${DEV} /local${IDX} xfs defaults,noatime 1 2 >> /etc/fstab; IDX=$((${IDX} + 1)); done\n    - mount -a\n

    If you use non-NVMe SSD volumes, you can follow the best practice to encrypt shuffle data before you write them to disks. You can see an example below that shows this example. For more information about the type of instance store volume supported by each instance type, see Instance store volumes.

    --conf spark.io.encryption.enabled=true\n
    "},{"location":"security/docs/spark/encryption/#conclusion","title":"Conclusion","text":"

    In this document, we covered shared responsibility model for running EMR on EKS workload. We then reviewed platform capabilities available through AWS infrastructure and how to enable encryption for both storage-level and via spark application. To quote Werner Vogels, AWS CTO \u201cSecurity is everyone\u2019s job now, not just the security team\u2019s\u201d. We hope this document provides prescriptive guidance into how to enable encryption for running secure EMR on EKS workload.

    "},{"location":"security/docs/spark/network-security/","title":"** Managing VPC for EMR on EKS**","text":"

    This section address network security at VPC level. If you want to read more on network security for Spark in EMR on EKS please refer to this section.

    "},{"location":"security/docs/spark/network-security/#security-group","title":"Security Group","text":"

    The applications running on your EMR on EKS cluster often would need access to services that are running outside the cluster, for example, these can Amazon Redshift, Amazon Relational Database Service, a service self hosted on an EC2 instance. To access these resource you need to allow network traffic at the security group level. The default mechanism in EKS is using security groups at the node level, this means all the pods running on the node will inherit the rules on the security group. For security conscious customers, this is not a desired behavior and you would want to use security groups at the pod level.

    This section address how you can use Security Groups with EMR on EKS.

    "},{"location":"security/docs/spark/network-security/#configure-eks-cluster-to-use-security-groups-for-pods","title":"Configure EKS Cluster to use Security Groups for Pods","text":"

    In order to use Security Groups at the pod level, you need to configure the VPC CNI for EKS. The following link guide through the prerequisites as well as configuring the EKS Cluster.

    "},{"location":"security/docs/spark/network-security/#define-securitygrouppolicy","title":"Define SecurityGroupPolicy","text":"

    Once you have configured the VPC CNI, you need to create a SecurityGroupPolicy object. This object define which security group (up to 5) to use, podselector to define which pod to apply the security group to and the namespace in which the Security Group should be evaluated. Below you find an example of SecurityGroupPolicy.

    apiVersion: vpcresources.k8s.aws/v1beta1\nkind: SecurityGroupPolicy\nmetadata:\n  name: <>\n  namespace: <NAMESPACE FOR VC>\nspec:\n  podSelector: \n    matchLabels:\n      role: spark\n  securityGroups:\n    groupIds:\n      - sg-xxxxx\n
    "},{"location":"security/docs/spark/network-security/#define-pod-template-to-use-security-group-for-pod","title":"Define pod template to use Security Group for pod","text":"

    In order for the security group to be applied to the Spark driver and executors, you need to provide a podtemplate which add label(s) to the pods. The labels should match the one defined above in the podSelector in our example it is role: spark. The snippet below define the pod template that you can upload in S3 and then reference when launching your job.

    apiVersion: v1\nkind: Pod\nmetadata:\n  labels:\n    role: spark\n
    "},{"location":"security/docs/spark/network-security/#launch-a-job","title":"Launch a job","text":"

    The command below can be used to run a job.

        aws emr-containers start-job-run --virtual-cluster-id <EMR-VIRTUAL-CLUSTER-ID> --name spark-jdbc --execution-role-arn <EXECUTION-ROLE-ARN> --release-label emr-6.7.0-latest --job-driver '{\n    \"sparkSubmitJobDriver\": {\n    \"entryPoint\": \"<S3-URI-FOR-PYSPARK-JOB-DEFINED-ABOVE>\",\n    \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1\"\n    }\n    }' --configuration-overrides '{\n    \"applicationConfiguration\": [\n    {\n    \"classification\": \"spark-defaults\", \n    \"properties\": {\n    \"spark.hadoop.hive.metastore.client.factory.class\": \"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\",\n    \"spark.sql.catalogImplementation\": \"hive\",\n    \"spark.dynamicAllocation.enabled\":\"true\",\n    \"spark.dynamicAllocation.minExecutors\": \"8\",\n    \"spark.dynamicAllocation.maxExecutors\": \"40\",\n    \"spark.kubernetes.allocation.batch.size\": \"8\",\n    \"spark.dynamicAllocation.executorAllocationRatio\": \"1\",\n    \"spark.dynamicAllocation.shuffleTracking.enabled\": \"true\",\n    \"spark.dynamicAllocation.shuffleTracking.timeout\": \"300s\",\n    \"spark.kubernetes.driver.podTemplateFile\":<S3-URI-TO-DRIVER-POD-TEMPLATE>,\n    \"spark.kubernetes.executor.podTemplateFile\":<S3-URI-TO-EXECUTOR-POD-TEMPLATE>\n    }\n    }\n    ],\n    \"monitoringConfiguration\": {\n        \"persistentAppUI\": \"ENABLED\",\n        \"cloudWatchMonitoringConfiguration\": {\n            \"logGroupName\": \"/aws/emr-containers/\",\n            \"logStreamNamePrefix\": \"default\"\n        }\n    }\n    }'\n
    "},{"location":"security/docs/spark/network-security/#verify-a-security-group-attached-to-the-pod-eni","title":"Verify a security group attached to the Pod ENI","text":"

    To verify that spark driver and executor driver have the security group attached to, apply the first command to get the podname then the second one to see the annotation in pod with the ENI associated to the pod which has the secuity group defined in the SecurityGroupPolicy.

    export POD_NAME=$(kubectl -n <NAMESPACE> get pods -l role=spark -o jsonpath='{.items[].metadata.name}')\n\nkubectl -n <NAMESPACE>  describe pod $POD_NAME | head -11\n
    Annotations:  kubernetes.io/psp: eks.privileged\n              vpc.amazonaws.com/pod-eni:\n                [{\"eniId\":\"eni-xxxxxxx\",\"ifAddress\":\"xx:xx:xx:xx:xx:xx\",\"privateIp\":\"x.x.x.x\",\"vlanId\":1,\"subnetCidr\":\"x.x.x.x/x\"}]\n
    "},{"location":"security/docs/spark/secrets/","title":"** Using Secrets in EMR on EKS**","text":"

    Secrets can be credentials to APIs, Databases or other resources. There are various ways these secrets can be passed to your containers, some of them are pod environment variable or Kubernetes Secrets. These methods are not secure, as for environment variable, secrets are stored in clear text and any authorized user who has access to Kubernetes cluster with admin privileges can read those secrets. Storing secrets using Kubernetes secrets is also not secure because they are not encrypted and only base36 encoded.

    There is a secure method to expose these secrets in EKS through the Secrets Store CSI Driver.

    The Secrets Store CSI Driver integrate with a secret store like AWS Secrets manager and mount the secrets as volume that can be accessed through your application code. This document describes how to set and use AWS Secrets Manager with EMR on EKS through the Secrets Store CSI Driver.

    "},{"location":"security/docs/spark/secrets/#deploy-secrets-store-csi-drivers-and-aws-secrets-and-configuration-provider","title":"Deploy Secrets Store CSI Drivers and AWS Secrets and Configuration Provider","text":""},{"location":"security/docs/spark/secrets/#secrets-store-csi-drivers","title":"Secrets Store CSI Drivers","text":"

    Configure EKS Cluster with Secrets Store CSI Driver.

    To learn more about AWS Secrets Manager CSI Driver you can refer to this link

    helm repo add secrets-store-csi-driver \\\n  https://kubernetes-sigs.github.io/secrets-store-csi-driver/charts\n\nhelm install -n kube-system csi-secrets-store \\\n  --set syncSecret.enabled=true \\\n  --set enableSecretRotation=true \\\n  secrets-store-csi-driver/secrets-store-csi-driver\n

    Deploy the AWS Secrets and Configuration Provider to use AWS Secrets Manager

    "},{"location":"security/docs/spark/secrets/#aws-secrets-and-configuration-provider","title":"AWS Secrets and Configuration Provider","text":"
    kubectl apply -f https://raw.githubusercontent.com/aws/secrets-store-csi-driver-provider-aws/main/deployment/aws-provider-installer.yaml\n
    "},{"location":"security/docs/spark/secrets/#define-the-secretproviderclass","title":"Define the SecretProviderClass","text":"

    The SecretProviderClass is how you present your secret in Kubernetes, below you find a definition of a SecretProviderClass. There are few parameters that are important:

    • The provider must be set to aws.
    • The objectName must be the name of the secret you want to use as defined in AWS. Here the secret is called db-creds.
    • The objectType must be set to secretsmanager.
    cat > db-cred.yaml << EOF\n\napiVersion: secrets-store.csi.x-k8s.io/v1\nkind: SecretProviderClass\nmetadata:\n  name: mysql-spark-secret\nspec:\n  provider: aws\n  parameters:\n    objects: |\n        - objectName: \"db-creds\"\n          objectType: \"secretsmanager\"\nEOF\n
    kubectl apply -f db-cred.yaml -n <NAMESPACE>\n

    In the terminal apply the above command to create SecretProviderClass, The kubectl command must include the namespace where your job will be executed.

    "},{"location":"security/docs/spark/secrets/#pod-template","title":"Pod Template","text":"

    In the executor podtemplate you should define it as follows to mount the secret. The example below show how you can define it. There are few points that are important to mount the secret:

    • secretProviderClass: this should have the same name as the one define above. In this case it is mysql-spark-secret.
    • mountPath: Is where the secret is going to be available to the pod. In this example it will be in /var/secrets When defining the mountPath make sure you do not specify the ones reserved by EMR on EKS as defined here.
    apiVersion: v1\nkind: Pod\n\nspec:\n  containers:\n    - name: spark-kubernetes-executors\n      volumeMounts:\n        - mountPath: \"/var/secrets\"\n          name: mysql-cred\n          readOnly: true\n  volumes:\n      - name: mysql-cred\n        csi:\n          driver: secrets-store.csi.k8s.io\n          readOnly: true\n          volumeAttributes:\n            secretProviderClass: mysql-spark-secret\n

    This podtemplate must be uploaded to S3 and referenced in the job submit command as shown below.

    Note You must make sure that the RDS instance or your Database allow traffic from the instances where your driver and executors pods are running.

    "},{"location":"security/docs/spark/secrets/#pyspark-code","title":"PySpark code","text":"

    The example below shows pyspark code for connecting with a MySQL DB. The example assume the secret is stored in AWS secrets manager as defined above. The username is the key to retrieve the database user as stored in AWS Secrets Manager, and password is the key to retrieve the database password.

    It shows how you can retrieve the credentials from the mount point /var/secrets/. The secret is stored in a file with the same name as it is defined in AWS in this case it is db-creds. This has been set in the podTemplate above.

    from pyspark.sql import SparkSession\nimport json\n\nsecret_path = \"/var/secrets/db-creds\"\n\nf = open(secret_path, \"r\")\nmySecretDict = json.loads(f.read())\n\nspark = SparkSession.builder.getOrCreate()\n\nstr_jdbc_url=\"jdbc:<DB endpoint>\"\nstr_Query= <QUERY>\nstr_username=mySecretDict['username']\nstr_password=mySecretDict['password']\ndriver = \"com.mysql.jdbc.Driver\"\n\njdbcDF = spark.read \\\n    .format(\"jdbc\") \\\n    .option(\"url\", str_jdbc_url) \\\n    .option(\"driver\", driver)\\\n    .option(\"query\", str_Query) \\\n    .option(\"user\", str_username) \\\n    .option(\"password\", str_password) \\\n    .load()\n\njdbcDF.show()\n
    "},{"location":"security/docs/spark/secrets/#execute-the-job","title":"Execute the job","text":"

    The command below can be used to run a job.

    Note: The supplied execution role MUST have access an IAM policy that allow it to access to the secret defined in SecretProviderClass above. The IAM policy below shows the IAM actions that are needed.

    {\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [ {\n        \"Effect\": \"Allow\",\n        \"Action\": [\"secretsmanager:GetSecretValue\", \"secretsmanager:DescribeSecret\"],\n        \"Resource\": [<SECRET-ARN>]\n    }]\n}\n
        aws emr-containers start-job-run --virtual-cluster-id <EMR-VIRTUAL-CLUSTER-ID> --name spark-jdbc --execution-role-arn <EXECUTION-ROLE-ARN> --release-label emr-6.7.0-latest --job-driver '{\n    \"sparkSubmitJobDriver\": {\n    \"entryPoint\": \"<S3-URI-FOR-PYSPARK-JOB-DEFINED-ABOVE>\",\n    \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1 --conf spark.jars=<S3-URI-TO-MYSQL-JDBC-JAR>\"\n    }\n    }' --configuration-overrides '{\n    \"applicationConfiguration\": [\n    {\n    \"classification\": \"spark-defaults\", \n    \"properties\": {\n    \"spark.hadoop.hive.metastore.client.factory.class\": \"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\",\n    \"spark.sql.catalogImplementation\": \"hive\",\n    \"spark.dynamicAllocation.enabled\":\"true\",\n    \"spark.dynamicAllocation.minExecutors\": \"8\",\n    \"spark.dynamicAllocation.maxExecutors\": \"40\",\n    \"spark.kubernetes.allocation.batch.size\": \"8\",\n    \"spark.dynamicAllocation.executorAllocationRatio\": \"1\",\n    \"spark.dynamicAllocation.shuffleTracking.enabled\": \"true\",\n    \"spark.dynamicAllocation.shuffleTracking.timeout\": \"300s\",\n    \"spark.kubernetes.driver.podTemplateFile\":<S3-URI-TO-DRIVER-POD-TEMPLATE>,\n    \"spark.kubernetes.executor.podTemplateFile\":<S3-URI-TO-EXECUTOR-POD-TEMPLATE>\n    }\n    }\n    ],\n    \"monitoringConfiguration\": {\n        \"persistentAppUI\": \"ENABLED\",\n        \"cloudWatchMonitoringConfiguration\": {\n            \"logGroupName\": \"/aws/emr-containers/\",\n            \"logStreamNamePrefix\": \"default\"\n        }\n    }\n    }'\n
    "},{"location":"storage/docs/spark/ebs/","title":"Mount EBS Volume to spark driver and executor pods","text":"

    Amazon EBS volumes can be mounted on Spark driver and executor pods through static and dynamic provisioning.

    EKS support for EBS CSI driver

    Documentation for EBS CSI driver

    "},{"location":"storage/docs/spark/ebs/#static-provisioning","title":"Static Provisioning","text":""},{"location":"storage/docs/spark/ebs/#eks-admin-tasks","title":"EKS Admin Tasks","text":"

    First, create your EBS volumes:

    aws ec2 --region <region> create-volume --availability-zone <availability zone> --size 50\n{\n    \"AvailabilityZone\": \"<availability zone>\", \n    \"MultiAttachEnabled\": false, \n    \"Tags\": [], \n    \"Encrypted\": false, \n    \"VolumeType\": \"gp2\", \n    \"VolumeId\": \"<vol -id>\", \n    \"State\": \"creating\", \n    \"Iops\": 150, \n    \"SnapshotId\": \"\", \n    \"CreateTime\": \"2020-11-03T18:36:21.000Z\", \n    \"Size\": 50\n}\n

    Create Persistent Volume(PV) that has the EBS volume created above hardcoded:

    cat > ebs-static-pv.yaml << EOF\napiVersion: v1\nkind: PersistentVolume\nmetadata:\n  name: ebs-static-pv\nspec:\n  capacity:\n    storage: 5Gi\n  accessModes:\n    - ReadWriteOnce\n  storageClassName: gp2\n  awsElasticBlockStore:\n    fsType: ext4\n    volumeID: <vol -id>\nEOF\n\nkubectl apply -f ebs-static-pv.yaml -n <namespace>\n

    Create Persistent Volume Claim(PVC) for the Persistent Volume created above:

    cat > ebs-static-pvc.yaml << EOF\nkind: PersistentVolumeClaim\napiVersion: v1\nmetadata:\n  name: ebs-static-pvc\nspec:\n  accessModes:\n    - ReadWriteOnce\n  resources:\n    requests:\n      storage: 5Gi\n  volumeName: ebs-static-pv\nEOF\n\nkubectl apply -f ebs-static-pvc.yaml -n <namespace>\n

    PVC - ebs-static-pvc can be used by spark developer to mount to the spark pod

    NOTE: Pods running in EKS worker nodes can only attach to the EBS volume provisioned in the same AZ as the EKS worker node. Use node selectors to schedule pods on EKS worker nodes the specified AZ.

    "},{"location":"storage/docs/spark/ebs/#spark-developer-tasks","title":"Spark Developer Tasks","text":"

    Request

    cat >spark-python-in-s3-ebs-static-localdir.json << EOF\n{\n  \"name\": \"spark-python-in-s3-ebs-static-localdir\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count-fsx.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5 --conf spark.executor.instances=10 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 \"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.options.claimName\":\"ebs-static-pvc\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.mount.path\":\"/var/spark/spill/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.mount.readOnly\":\"false\",\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-ebs-static-localdir.json\n

    Observed Behavior: When the job gets started, the pre-provisioned EBS volume is mounted to driver pod. You can exec into the driver container to verify that the EBS volume is mounted. Also you can verify the mount from the driver pod's spec.

    kubectl get pod <driver pod name> -n <namespace> -o yaml --export\n
    "},{"location":"storage/docs/spark/ebs/#dynamic-provisioning","title":"Dynamic Provisioning","text":"

    Dynamic Provisioning of volumes is supported for both, driver and executors for EMR versions >= 6.3.0

    "},{"location":"storage/docs/spark/ebs/#eks-admin-tasks_1","title":"EKS Admin Tasks","text":"

    Create EBS Storage Class

    cat >demo-gp2-sc.yaml << EOF\napiVersion: storage.k8s.io/v1\nkind: StorageClass\nmetadata:\n  name: demo-gp2-sc\nprovisioner: kubernetes.io/aws-ebs\nparameters:\n  type: gp2\nreclaimPolicy: Retain\nallowVolumeExpansion: true\nmountOptions:\n  - debug\nvolumeBindingMode: Immediate\nEOF\n\nkubectl apply -f demo-gp2-sc.yaml\n

    create Persistent Volume for the EBS storage class - demo-gp2-sc

    cat >ebs-demo-gp2-claim.yaml <<EOF\napiVersion: v1\nkind: PersistentVolumeClaim\nmetadata:\n  name: ebs-demo-gp2-claim\n  labels:\n    app: chicago\nspec:\n  storageClassName: demo-gp2-sc\n  accessModes:\n    - ReadWriteOnce\n  resources:\n    requests:\n      storage: 100Gi\nEOF\n\nkubectl apply -f ebs-demo-gp2-claim.yaml -n <namespace>\n
    "},{"location":"storage/docs/spark/ebs/#spark-developer-tasks_1","title":"Spark Developer Tasks","text":"

    Request

    cat >spark-python-in-s3-ebs-dynamic-localdir.json << EOF\n{\n  \"name\": \"spark-python-in-s3-ebs-dynamic-localdir\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count-fsx.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5 --conf spark.executor.instances=10 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.options.claimName\":\"ebs-demo-gp2-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.mount.path\":\"/var/spark/spill/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.options.claimName\":\"ebs-demo-gp2-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.mount.path\":\"/var/spark/spillexec/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-ebs-dynamic-localdir.json\n

    Observed Behavior: When the job gets started an EBS volume is provisioned dynamically by the EBS CSI driver and mounted to the driver and executor pods. You can exec into the driver / executor container to verify that the EBS volume is mounted. Also, you can verify the mount from driver / executor pod spec.

    kubectl get pod <driver pod name> -n <namespace> -o yaml --export\n
    "},{"location":"storage/docs/spark/fsx-lustre/","title":"EMR Containers integration with FSx for Lustre","text":"

    Amazon EKS clusters provide the compute and ephemeral storage for Spark workloads. Ephemeral storage provided by EKS is allocated from the EKS worker node's disk storage and the lifecycle of the storage is bound by the lifecycle of the driver and executor pod.

    Need for durable storage: When multiple spark applications are executed as part of a data pipeline, there are scenarios where data from one spark application is passed to subsequent spark applications - in this case data can be persisted in S3. Alternatively, this data can also be persisted in FSx for Lustre. FSx for Lustre provides a fully managed, scalable, POSIX compliant native filesystem interface for the data in s3. With FSx, your torage is decoupled from your compute and has its own lifecycle.

    FSx for Lustre Volumes can be mounted on spark driver and executor pods through static and dynamic provisioning.

    Data used in the below example is from AWS Open data Registry

    "},{"location":"storage/docs/spark/fsx-lustre/#fsx-for-lustre-posix-permissions","title":"FSx for Lustre POSIX permissions","text":"

    When a Lustre filesystem is mounted to driver and executor pods, and if the S3 objects does not have required metadata, the mounted volume defaults ownership of the file system to root. EMR on EKS executes the driver and executor pods with UID(999), GID (1000) and groups(1000 and 65534). In this scenario, the spark application has read only access to the mounted Lustre file system. Below are a few approaches that can be considered:

    "},{"location":"storage/docs/spark/fsx-lustre/#tag-metadata-to-s3-object","title":"Tag Metadata to S3 object","text":"

    Applications writing to S3 can tag the S3 objects with the metadata that FSx for Lustre requires.

    Walkthrough: Attaching POSIX permissions when uploading objects into an S3 bucket provides a guided tutorial. FSx for Lustre will convert this tagged metadata to corresponding POSIX permissions when mounting Lustre file system to the driver and executor pods.

    EMR on EKS spawns the driver and executor pods as non-root user(UID -999, GID - 1000, groups - 1000, 65534). To enable the spark application to write to the mounted file system, (UID - 999) can be made as the file-owner and supplemental group 65534 be made as the file-group.

    For S3 objects that already exists with no metadata tagging, there can be a process that recursively tags all the S3 objects with the required metadata. Below is an example: 1. Create FSx for Lustre file system to the S3 prefix. 2. Create Persistent Volume and Persistent Volume claim for the created FSx for Lustre file system 3. Run a pod as root user with FSx for Lustre mounted with the PVC created in Step 2.

    ```\napiVersion: v1\nkind: Pod\nmetadata:\n  name: chmod-fsx-pod\n  namespace: test-demo\nspec:\n  containers:\n  - name: ownership-change\n    image: amazonlinux:2\n    command: [\"sh\", \"-c\", \"chown -hR +999:+65534 /data\"]\n    volumeMounts:\n    - name: persistent-storage\n      mountPath: /data\n  volumes:\n  - name: persistent-storage\n    persistentVolumeClaim:\n      claimName: fsx-static-root-claim\n```\n

    Run a data repository task with import path and export path pointing to the same S3 prefix. This will export the POSIX permission from FSx for Lustre file system as metadata, that is tagged on S3 objects.

    Now that the S3 objects are tagged with metadata, the spark application with FSx for Lustre filesystem mounted will have write access.

    "},{"location":"storage/docs/spark/fsx-lustre/#static-provisioning","title":"Static Provisioning","text":""},{"location":"storage/docs/spark/fsx-lustre/#provision-a-fsx-for-lustre-cluster","title":"Provision a FSx for Lustre cluster","text":"

    FSx for Luster can also be provisioned through aws cli

    How to decide what type of FSx for Lustre file system you need ? Create a Security Group to attach to FSx for Lustre file system as below Points to Note: Security group attached to the EKS worker nodes is given access on port number 988, 1021-1023 in inbound rules. Security group specified when creating the FSx for Lustre filesystem is given access on port number 988, 1021-1023 in inbound rules.

    Fsx for Lustre Provisioning through aws cli

    cat fsxLustreConfig.json << EOF \n{\n    \"ClientRequestToken\": \"EMRContainers-fsxLustre-demo\", \n    \"FileSystemType\": \"LUSTRE\",\n    \"StorageCapacity\": 1200, \n    \"StorageType\": \"SSD\", \n    \"SubnetIds\": [\n        \"<subnet-id>\"\n    ], \n    \"SecurityGroupIds\": [\n        \"<securitygroup-id>\"\n    ], \n    \"LustreConfiguration\": {\n        \"ImportPath\": \"s3://<s3 prefix>/\", \n        \"ExportPath\": \"s3://<s3 prefix>/\", \n        \"DeploymentType\": \"PERSISTENT_1\", \n        \"AutoImportPolicy\": \"NEW_CHANGED\",\n        \"PerUnitStorageThroughput\": 200\n    }\n}\nEOF\n

    Run the aws-cli command to create the FSx for Lustre filesystem as below.

    aws fsx create-file-system --cli-input-json file:///fsxLustreConfig.json\n

    Response is as below

    {\n    \"FileSystem\": {\n        \"VpcId\": \"<vpc id>\", \n        \"Tags\": [], \n        \"StorageType\": \"SSD\", \n        \"SubnetIds\": [\n            \"<subnet-id>\"\n        ], \n        \"FileSystemType\": \"LUSTRE\", \n        \"CreationTime\": 1603752401.183, \n        \"ResourceARN\": \"<fsx resource arn>\", \n        \"StorageCapacity\": 1200, \n        \"LustreConfiguration\": {\n            \"CopyTagsToBackups\": false, \n            \"WeeklyMaintenanceStartTime\": \"7:11:30\", \n            \"DataRepositoryConfiguration\": {\n                \"ImportPath\": \"s3://<s3 prefix>\", \n                \"AutoImportPolicy\": \"NEW_CHANGED\", \n                \"ImportedFileChunkSize\": 1024, \n                \"Lifecycle\": \"CREATING\", \n                \"ExportPath\": \"s3://<s3 prefix>/\"\n            }, \n            \"DeploymentType\": \"PERSISTENT_1\", \n            \"PerUnitStorageThroughput\": 200, \n            \"MountName\": \"mvmxtbmv\"\n        }, \n        \"FileSystemId\": \"<filesystem id>\", \n        \"DNSName\": \"<filesystem id>.fsx.<region>.amazonaws.com\", \n        \"KmsKeyId\": \"arn:aws:kms:<region>:<account>:key/<key id>\", \n        \"OwnerId\": \"<account>\", \n        \"Lifecycle\": \"CREATING\"\n    }\n}\n
    "},{"location":"storage/docs/spark/fsx-lustre/#eks-admin-tasks","title":"EKS admin tasks","text":"
    1. Attach IAM policy to EKS worker node IAM role to enable access to FSx for Lustre - Mount FSx for Lustre on EKS and Create a Security Group for FSx for Lustre
    2. Install the FSx CSI Driver in EKS
    3. Configure Storage Class for FSx for Lustre
    4. Configure Persistent Volume and Persistent Volume Claim for FSx for Lustre

    FSx for Lustre file system is created as described above -Provision a FSx for Lustre cluster Once provisioned, a persistent volume - as specified below is created with a direct (hard-coded) reference to the created lustre file system. A Persistent Volume claim for this persistent volume will always use the same file system.

    cat >fsxLustre-static-pv.yaml <<EOF\napiVersion: v1\nkind: PersistentVolume\nmetadata:\n  name: fsx-pv\nspec:\n  capacity:\n    storage: 1200Gi\n  volumeMode: Filesystem\n  accessModes:\n    - ReadWriteMany\n  mountOptions:\n    - flock\n  persistentVolumeReclaimPolicy: Recycle\n  csi:\n    driver: fsx.csi.aws.com\n    volumeHandle: <filesystem id>\n    volumeAttributes:\n      dnsname: <filesystem id>.fsx.<region>.amazonaws.com\n      mountname: mvmxtbmv\nEOF\n
    kubectl apply -f fsxLustre-static-pv.yaml\n

    Now, a Persistent Volume Claim (PVC) needs to be created that references PV created above.

    cat >fsxLustre-static-pvc.yaml <<EOF\napiVersion: v1\nkind: PersistentVolumeClaim\nmetadata:\n  name: fsx-claim\n  namespace: ns1\nspec:\n  accessModes:\n    - ReadWriteMany\n  storageClassName: \"\"\n  resources:\n    requests:\n      storage: 1200Gi\n  volumeName: fsx-pv\nEOF\n
    kubectl apply -f fsxLustre-static-pvc.yaml -n <namespace registered with EMR on EKS Virtual Cluster>\n
    "},{"location":"storage/docs/spark/fsx-lustre/#spark-developer-tasks","title":"Spark Developer Tasks","text":"

    Now spark applications can use fsx-claim in their spark application config to mount the FSx for Lustre filesystem to driver and executor container volumes.

    cat >spark-python-in-s3-fsx.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-fsx\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count-repartition-fsx.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n
    aws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-fsx.json\n

    Expected Behavior: All spark jobs that are run with persistent volume claims as fsx-claim will mount to the statically created FSx for Lustre file system.

    Use case:

    1. A data pipeline consisting of 10 spark applications can all be mounted to the statically created FSx for Lustre file system and can write the intermediate output to a particular folder. The next spark job in the data pipeline that is dependent on this data can read from FSx for Lustre. Data that needs to be persisted beyond the scope of the data pipeline can be exported to S3 by creating data repository tasks
    2. Data that is used often by multiple spark applications can also be stored in FSx for Lustre for improved performance.
    "},{"location":"storage/docs/spark/fsx-lustre/#dynamic-provisioning","title":"Dynamic Provisioning","text":"

    A FSx for Lustre file system can be provisioned on-demand. A Storage-class resource is created and that provisions FSx for Lustre file system dynamically. A PVC is created and refers to the storage class resource that was created. Whenever a pod refers to the PVC, the storage class invokes the FSx for Lustre Container Storage Interface (CSI) to provision a Lustre file system on the fly dynamically. In this model, FSx for Lustre of type Scratch File Systems is provisioned.

    "},{"location":"storage/docs/spark/fsx-lustre/#eks-admin-tasks_1","title":"EKS Admin Tasks","text":"
    1. Attach IAM policy to EKS worker node IAM role to enable access to FSx for Lustre - Mount FSx for Lustre on EKS and Create a Security Group for FSx for Lustre
    2. Install the FSx CSI Driver in EKS
    3. Configure Storage Class for FSx for Lustre
    4. Configure Persistent Volume Claim(fsx-dynamic-claim) for FSx for Lustre.

    Create PVC for dynamic provisioning with fsx-sc storage class.

    cat >fsx-dynamic-claim.yaml <<EOF\napiVersion: v1\nkind: PersistentVolumeClaim\nmetadata:\n  name: fsx-dynamic-claim\nspec:\n  accessModes:\n    - ReadWriteMany\n  storageClassName: fsx-sc\n  resources:\n    requests:\n      storage: 3600Gi\nEOF \n
    kubectl apply -f fsx-dynamic-pvc.yaml -n <namespace registered with EMR on EKS Virtual Cluster>\n
    "},{"location":"storage/docs/spark/fsx-lustre/#spark-developer-tasks_1","title":"Spark Developer Tasks","text":"
    cat >spark-python-in-s3-fsx-dynamic.json << EOF\n{\n  \"name\": \"spark-python-in-s3-fsx-dynamic\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count-repartition-fsx.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.kubernetes.pyspark.pythonVersion=3 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 --conf spark.sql.shuffle.partitions=1000\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.local.dir\":\"/var/spark/spill/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.options.claimName\":\"fsx-dynamic-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.mount.path\":\"/var/spark/spill/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n
    aws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-fsx-dynamic.json\n

    Expected Result: Statically provisioned FSx for Lustre is mounted to /var/data/ as before for the driver pod. For all the executors a SCRATCH 1 deployment type FSx for Lustre is provisioned on the fly dynamically by the Storage class that was created. There will be a latency before the first executor can start running - because the Lustre has to be created. Once it is created the same Lustre file system is mounted to all the executors. Also note - \"spark.local.dir\":\"/var/spark/spill/\" is used to force executor to use this folder mounted to Lustre for all spill and shuffle data. Once the spark job is completed, the Lustre file system is deleted or retained based on the PVC configuration. This dynamically created Lustre file system is mapped to a S3 path like the statically created filesystem. FSx-csi user guide

    "},{"location":"storage/docs/spark/instance-store/","title":"Instance Store Volumes","text":"

    When working with Spark workloads, it might be useful to use instances powered by SSD instance store volumes to improve the performance of your jobs. This storage is located on disks that are physically attached to the host computer and can provide better performance compared to traditional EBS volumes. In the context of Spark, this might be beneficial for wide transformations (e.g. JOIN, GROUP BY) that generate a significant amount of shuffle data that Spark persists on the local filesystem of the instances where the executors are running.

    In this document, we highlight two approaches to leverage NVMe disks in your workloads when using EMR on EKS. For a list of instances supporting NVMe disks, see Instance store volumes in the Amazon EC2 documentation.

    "},{"location":"storage/docs/spark/instance-store/#mount-kubelet-pod-directory-on-nvme-disks","title":"Mount kubelet pod directory on NVMe disks","text":"

    The kublet service manages the lifecycle of pod containers that are created using Kubernetes. When a pod is launched on an instance, an ephemeral volume is automatically created for the pod, and this volume is mapped in a subdirectory within the path /var/lib/kubelet of the host node. This volume folder exists for the lifetime of K8s pod, and it will be automatically deleted once the pod ceases to exist.

    In order to leverage NVMe disk attached to an EC2 node in our Spark application, we should perform the following actions during node bootstrap:

    • Prepare the NVMe disks attached to the instance (format disks and create a partition)
    • Mount the /var/lib/kubelet/pods path on the NVMe

    By doing this, all local files generated by your Spark job (blockmanager data, shuffle data, etc.) will be automatically written to NVMe disks. This way, you don't have to configure Spark volume path when launching the pod (driver or executor). This approach is easier to adopt because it doesn\u2019t require any additional configuration in your job. Besides, once the job is completed, all the data stored in ephemeral volumes will be automatically deleted when the EC2 instance is deleted.

    However, if you have multiple NVMe disks attached to the instance, you need to create RAID0 configuration of all the disks before mounting the /var/lib/kubelet/pods directory on the RAID partition. Without a RAID setup, it will not be possible to leverage all the disks capacity available on the node.

    The following example shows how to create a node group in your cluster using this approach. In order to prepare our NVMe disks, we can use the eksctl preBootstrapCommands definition while creating the node group. The script will perform the following actions:

    • For instances with a single NVMe disk, format the filesystem, create a Linux partition (e.g. ext4, xfs, etc.)
    • For instances with multiple NVMe disks, create a RAID 0 configuration across all available volumes

    Once the disks are formatted and ready to use, we will mount the folder /var/lib/kubelet/pods using the filesystem and setup correct permissions. Below, you can find an example of an eksctl configuration to create a managed node group using this approach.

    Example

    apiVersion: eksctl.io/v1alpha5\nkind: ClusterConfig\n\nmetadata:\n  name: YOUR_CLUSTER_NAME\n  region: YOUR_REGION\n\nmanagedNodeGroups:\n  - name: ng-c5d-9xlarge\n    instanceType: c5d.9xlarge\n    desiredCapacity: 1\n    privateNetworking: true\n    subnets:\n      - YOUR_NG_SUBNET\n    preBootstrapCommands: # commands executed as root\n      - yum install -y mdadm nvme-cli\n      - nvme_disks=($(nvme list | grep \"Amazon EC2 NVMe Instance Storage\" | awk -F'[[:space:]][[:space:]]+' '{print $1}')) && [[ ${#nvme_disks[@]} -eq 1 ]] && mkfs.ext4 -F ${nvme_disks[*]} && systemctl stop docker && mkdir -p /var/lib/kubelet/pods && mount ${nvme_disks[*]} /var/lib/kubelet/pods && chmod 750 /var/lib/docker && systemctl start docker\n      - nvme_disks=($(nvme list | grep \"Amazon EC2 NVMe Instance Storage\" | awk -F'[[:space:]][[:space:]]+' '{print $1}')) && [[ ${#nvme_disks[@]} -ge 2 ]] && mdadm --create --verbose /dev/md0 --level=0 --raid-devices=${#nvme_disks[@]} ${nvme_disks[*]} && mkfs.ext4 -F /dev/md0 && systemctl stop docker && mkdir -p /var/lib/kubelet/pods && mount /dev/md0 /var/lib/kubelet/pods && chmod 750 /var/lib/docker && systemctl start docker\n

    Benefits

    • No need to mount the disk using Spark configurations or pod templates
    • Data generated by the application, will immediately be deleted at the pod termination. Data will be also purged in case of pod failures.
    • One time configuration for the node group

    Cons

    • If multiple jobs are allocated on the same EC2 instance, contention of disk resources will occur because it is not possible to allocate instance store volume resources across jobs
    "},{"location":"storage/docs/spark/instance-store/#mount-nvme-disks-as-data-volumes","title":"Mount NVMe disks as data volumes","text":"

    In this section, we\u2019re going to explicitly mount instance store volumes as the mount path in Spark configuration for drivers and executors

    As in the previous example, this script will automatically format the instance store volumes and create an xfs partition. The disks are then mounted in local folders called /spark_data_IDX where IDX is an integer that corresponds to the disk mounted.

    Example

    apiVersion: eksctl.io/v1alpha5\nkind: ClusterConfig\n\nmetadata:\n  name: YOUR_CLUSTER_NAME\n  region: YOUR_REGION\n\nmanagedNodeGroups:\n  - name: ng-m5d-4xlarge\n    instanceType: m5d.4xlarge\n    desiredCapacity: 1\n    privateNetworking: true\n    subnets:\n      - YOUR_NG_SUBNET\n    preBootstrapCommands: # commands executed as root\n      - \"IDX=1;for DEV in /dev/nvme[1-9]n1;do mkfs.xfs ${DEV}; mkdir -p /spark_data_${IDX}; echo ${DEV} /spark_data_${IDX} xfs defaults,noatime 1 2 >> /etc/fstab; IDX=$((${IDX} + 1)); done\"\n      - \"mount -a\"\n      - \"chown 999:1000 /spark_data_*\"\n

    In order to successfully use ephemeral volumes within Spark, you need to specify additional configurations. In addition to spark configuration, the mounted volume name should start with spark-local-dir-.

    Below an example configuration provided during the EMR on EKS job submission, that shows how to configure Spark to use 2 volumes as local storage for the job.

    Spark Configurations

    {\n  \"name\": ....,\n  \"virtualClusterId\": ....,\n  \"executionRoleArn\": ....,\n  \"releaseLabel\": ....,\n  \"jobDriver\": ....,\n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\",\n        \"properties\": {\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path\": \"/spark_data_1\",\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly\": \"false\",\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path\": \"/spark_data_1\",\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path\": \"/spark_data_2\",\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.readOnly\": \"false\",\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path\": \"/spark_data_2\"\n        }\n      }\n    ]\n  }\n}\n

    Please note that for this approach it is required to specify the following configurations for each volume that you want to use. (IDX is a label to identify the volume mounted)

    # Mount path on the host node\nspark.kubernetes.executor.volumes.hostPath.spark-local-dir-IDX.options.path\n\n# Mount path on the k8s pod\nspark.kubernetes.executor.volumes.hostPath.spark-local-dir-IDX.mount.path\n\n# (boolean) Should be defined as false to allow Spark to write in the path\nspark.kubernetes.executor.volumes.hostPath.spark-local-dir-IDX.mount.readOnly\n

    Benefits

    • You can allocate dedicated resources of instance store volumes across your Spark jobs (For example, lets take a scenario where an EC2 instance has two instance store volumes. If you run two spark jobs on this node, you can dedicate one volume per Spark job)

    Cons

    • Additional configurations are required for Spark jobs to use instance store volumes. This approach can be error-prone if you don\u2019t control the instance types being used (for example, multiple node groups with different instance types). You can mitigate this issue by using k8s node selectors and specify instance type in your spark configuraiton: spark.kubernetes.node.selector.node.kubernetes.io/instance-type
    • Data created on the volumes is automatically deleted once the job is completed and instance is terminated. However, you need to extra measures to delete the data on instance store volumes if EC2 instance is re-used or is not terminated.
    "},{"location":"submit-applications/docs/spark/multi-arch-image/","title":"Build a Multi-architecture Docker Image Supporting arm64 & amd64","text":""},{"location":"submit-applications/docs/spark/multi-arch-image/#pre-requisites","title":"Pre-requisites","text":"

    We can complete all the steps either from a local desktop or using AWS Cloud9. If you\u2019re using AWS Cloud9, follow the instructions in the \"Setup AWS Cloud9\" to create and configure the environment first, otherwise skip to the next section.

    "},{"location":"submit-applications/docs/spark/multi-arch-image/#setup-aws-cloud9","title":"Setup AWS Cloud9","text":"

    AWS Cloud9 is a cloud-based IDE that lets you write, run, and debug your code via just a browser. AWS Cloud9 comes preconfigured with some of AWS dependencies we require to build our application, such ash the AWS CLI tool.

    "},{"location":"submit-applications/docs/spark/multi-arch-image/#1-create-a-cloud9-instance","title":"1. Create a Cloud9 instance","text":"

    Instance type - Create an AWS Cloud9 environment from the AWS Management Console with an instance type of t3.small or larger. In our example, we used m5.xlarge for adequate memory and CPU to compile and build a large docker image.

    VPC - Follow the launch wizard and provide the required name. To interact with an existing EKS cluster in the same region later on, recommend to use the same VPC to your EKS cluster in the Cloud9 environment. Leave the remaining default values as they are.

    Storage size - You must increase the Cloud9's EBS volume size (pre-attached to your AWS Cloud9 instance) to 30+ GB, because the default disk space ( 10 GB with ~72% used) is not enough for building a container image. Refer to Resize an Amazon EBS volume used by an environment document, download the script resize.sh to your cloud9 environment.

    touch reaize.sh\n# Double click the file name in cloud9\n# Copy and paste the content from the official document to your file, save and close it\n

    Validate the disk size is 10GB currently:

    admin:~/environment $ df -h\nFilesystem        Size  Used Avail Use% Mounted on\ndevtmpfs          4.0M     0  4.0M   0% /dev\ntmpfs             951M     0  951M   0% /dev/shm\ntmpfs             381M  5.3M  376M   2% /run\n/dev/nvme0n1p1     10G  7.2G  2.9G  72% /\ntmpfs             951M   12K  951M   1% /tmp\n/dev/nvme0n1p128   10M  1.3M  8.7M  13% /boot/efi\ntmpfs             191M     0  191M   0% /run/user/1000\n

    Increase the disk size:

    bash resize.sh 30\n
    admin:~/environment $ df -h\nFilesystem        Size  Used Avail Use% Mounted on\ndevtmpfs          4.0M     0  4.0M   0% /dev\ntmpfs             951M     0  951M   0% /dev/shm\ntmpfs             381M  5.3M  376M   2% /run\n/dev/nvme0n1p1     30G  7.3G   23G  25% /\ntmpfs             951M   12K  951M   1% /tmp\n/dev/nvme0n1p128   10M  1.3M  8.7M  13% /boot/efi\ntmpfs             191M     0  191M   0% /run/user/1000\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#2-install-docker-and-buildx-if-required","title":"2. Install Docker and Buildx if required","text":"
    • Installing Docker - a Cloud9 EC2 instance comes with a Docker daemon pre-installed. Outside of the Cloud9, your environment may or may not need to install Docker. If needed, follow the instructions in the Docker Desktop page to install.

    • Installing Buildx (pre-installed in Cloud9) - To build a single multi-arch Docker image (x86_64 and arm64), we may or may not need to install an extra Buildx plugin that extends the Docker CLI to support the multi-architecture feature. Docker Buildx is installed by default with a Docker Engine since version 23.0+. For an earlier version, it requires you grab a binary from GitHub repository and install it manually, or get it from a separate package. See docker/buildx README for more information.

    Once the buildx CLI is available, we can create a builder instance which gives access to the new multi-architecture features.You only have to perform this task once.

    # create a builder\ndocker buildx create --name mybuilder --use\n# boot up the builder and inspect\ndocker buildx inspect --bootstrap\n\n\n# list builder instances\n# the asterisk (*) next to a builder name indicates the selected builder.\ndocker buildx ls\n

    If your builder doesn't support QEMU, only limited platform types are supported as below. For example, the current builder instance created in Cloud9 doesn't support QEMU, so we can't build the docker image for the arm64 CPU type yet.

    NAME/NODE       DRIVER/ENDPOINT      STATUS   BUILDKIT PLATFORMS\ndefault        docker\ndefault       default              running  v0.11.6  linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386\nmybuilder *    docker-container\nmy_builder0   default              running  v0.11.6  linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386\n
    • Installing QEMU for Cloud9 - Building multi-platform images under emulation with QEMU is the easiest way to get started if your builder already supports it. However, AWS Cloud9 isn't preconfigured with the binfmt_misc support. We must install compiled QEMU binaries. The installations can be easily done via the docker run CLI:
     docker run --privileged --rm tonistiigi/binfmt --install all\n

    List the builder instance again. Now we see the full list of platforms are supported,including arm-based CPU:

    docker buildx ls\n\nNAME/NODE     DRIVER/ENDPOINT             STATUS   BUILDKIT PLATFORMS\nmybuilder *   docker-container                              \n  mybuilder20 unix:///var/run/docker.sock running  v0.13.2  linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6     \ndefault       docker                                        \n  default     default                     running  v0.12.5  linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/386, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#build-a-docker-image-supporting-multi-arch","title":"Build a docker image supporting multi-arch","text":"

    In this example, we will create a spark-benchmark-utility container image. We are going to reuse the source code from the EMR on EKS benchmark Github repo.

    "},{"location":"submit-applications/docs/spark/multi-arch-image/#1-download-the-source-code-from-the-github","title":"1. Download the source code from the Github:","text":"
    git clone https://github.com/aws-samples/emr-on-eks-benchmark.git\ncd emr-on-eks-benchmark\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#2-setup-required-environment-variables","title":"2. Setup required environment variables","text":"

    We will build an image to test EMR 6.15's performance. The equivalent versions are Spark 3.4.1 and Hadoop 3.3.4. Change them accordingly if needed.

    export SPARK_VERSION=3.4.1\nexport HADOOP_VERSION=3.3.6\n

    Log in to your own Amazon ECR registry:

    export AWS_REGION=us-east-1\nexport ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)\nexport ECR_URL=$ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com\n\naws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ECR_URL\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#3-build-oss-spark-base-image-if-required","title":"3. Build OSS Spark base image if required","text":"

    If you want to test open-source Apache Spark's performance, build a base Spark image first. Otherwise skip this step.

    docker buildx build --platform linux/amd64,linux/arm64 \\\n-t $ECR_URL/spark:$SPARK_VERSION_hadoop_$HADOOP_VERSION \\\n-f docker/hadoop-aws-3.3.1/Dockerfile \\\n--build-arg HADOOP_VERSION=$HADOOP_VERSION --build-arg SPARK_VERSION=$SPARK_VERSION --push .\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#4-get-emr-spark-base-image-from-aws","title":"4. Get EMR Spark base image from AWS","text":"
    export SRC_ECR_URL=755674844232.dkr.ecr.us-east-1.amazonaws.com\naws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $SRC_ECR_URL\n\ndocker pull $SRC_ECR_URL/spark/emr-6.15.0:latest\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#5-build-the-benchmark-utility-image","title":"5. Build the Benchmark Utility image","text":"

    Build and push the docker image based the OSS Spark engine built before (Step #3):

    docker buildx build --platform linux/amd64,linux/arm64 \\\n-t $ECR_URL/spark:$SPARK_VERSION_hadoop_$HADOOP_VERSION \\\n-f docker/benchmark-util/Dockerfile \\\n--build-arg SPARK_BASE_IMAGE=$ECR_URL/spark:$SPARK_VERSION_hadoop_$HADOOP_VERSION \\\n--push .\n

    Build and push the benchmark docker image based EMR's Spark runtime (Step #4):

    docker buildx build --platform linux/amd64,linux/arm64 \\\n-t $ECR_URL/eks-spark-benchmark:emr6.15 \\\n-f docker/benchmark-util/Dockerfile \\\n--build-arg SPARK_BASE_IMAGE=$SRC_ECR_URL/spark/emr-6.15.0:latest \\\n--push .\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#benchmark-application-based-on-the-docker-images-built","title":"Benchmark application based on the docker images built","text":"

    Based on the mutli-arch docker images built previously, now you can start to run benchmark applications on both intel and arm-based CPU nodes.

    In Cloud9, the following extra steps are required to configure the environment, before you can submit the applications.

    1. Install kkubectl/helm/eksctl CLI tools. refer to this sample scirpt

    2. Modify the IAM role attached to the Cloud9 EC2 instance, allowing it has enough privilege to assume an EKS cluster's admin role or has the permission to submit jobs against the EKS cluster.

    3. Upgrade AWS CLI and turn off the AWS managed temporary credentials in Cloud9:

    curl \"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip\" -o \"awscliv2.zip\"\nunzip awscliv2.zip\nsudo ./aws/install --update\n/usr/local/bin/aws cloud9 update-environment  --environment-id $C9_PID --managed-credentials-action DISABLE\nrm -vf ${HOME}/.aws/credentials\n
    1. Connect to the EKS cluster
    # a sample connection string\naws eks update-kubeconfig --name YOUR_EKS_CLUSTER_NAME --region us-east-1 --role-arn arn:aws:iam::ACCOUNTID:role/SparkOnEKS-iamrolesclusterAdmin-xxxxxx\n\n# validate the connection\nkubectl get svc\n
    "},{"location":"submit-applications/docs/spark/pyspark/","title":"Pyspark Job submission","text":"

    Python interpreter is bundled in the EMR containers spark image that is used to run the spark job.Python code and dependencies can be provided with the below options.

    "},{"location":"submit-applications/docs/spark/pyspark/#python-code-self-contained-in-a-single-py-file","title":"Python code self contained in a single .py file","text":"

    To start with, in the simplest scenario - the example below shows how to submit a pi.py file that is self-contained and doesn't need any other dependencies.

    "},{"location":"submit-applications/docs/spark/pyspark/#python-file-from-s3","title":"Python file from S3","text":"

    Request pi.py used in the below request payload is from spark examples

    cat > spark-python-in-s3.json << EOF\n{\n  \"name\": \"spark-python-in-image\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/pi.py\", \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#python-file-from-mounted-volume","title":"Python file from mounted volume","text":"

    In the below example - pi.py is placed in a mounted volume. FSx for Lustre filesystem is mounted as a Persistent Volume on the driver pod under /var/data/ and will be referenced by local:// file prefix. For more information on how to mount FSx for lustre - EMR-Containers-integration-with-FSx-for-Lustre

    This approach can be used to provide spark application code and dependencies for execution. Persistent Volume mounted to the driver and executor pods lets you access the application code and dependencies with local:// prefix.

    cat > spark-python-in-FSx.json <<EOF\n{\n  \"name\": \"spark-python-in-FSx\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"local:///var/data/FSxLustre-pi.py\", \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-Fsx.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#python-code-with-python-dependencies","title":"Python code with python dependencies","text":"

    Info

    boto3 will only work with 'Bundled as a .pex file' or with 'Custom docker image'

    "},{"location":"submit-applications/docs/spark/pyspark/#list-of-py-files","title":"List of .py files","text":"

    This is not a scalable approach as the number of dependent files can grow to a large number, and also need to manually specify all the transitive dependencies.

    cat > py-files-pi.py <<EOF\nfrom __future__ import print_function\n\nimport sys\nfrom random import random\nfrom operator import add\n\nfrom pyspark.sql import SparkSession\nfrom pyspark import SparkContext\n\nimport dependentFunc\n\nif __name__ == \"__main__\":\n    \"\"\"\n        Usage: pi [partitions]\n    \"\"\"\n    spark = SparkSession.builder.getOrCreate()\n    sc = spark.sparkContext\n    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2\n    n = 100000 * partitions\n\n    def f(_):\n        x = random() * 2 - 1\n        y = random() * 2 - 1\n        return 1 if x ** 2 + y ** 2 <= 1 else 0\n\n    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)\n    dependentFunc.message()\n    print(\"Pi is roughly %f\" % (4.0 * count / n))\n\n    spark.stop()\n\n  EOF\n
    cat > dependentFunc.py <<EOF\ndef message():\n  print(\"Printing from inside the dependent python file\")\n\nEOF\n

    Upload dependentFunc.py and py-files-pi.py to s3

    Request:

    cat > spark-python-in-s3-dependency-files << EOF\n{\n  \"name\": \"spark-python-in-s3-dependency-files\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/py-files-pi.py\", \n       \"sparkSubmitParameters\": \"--py-files s3://<s3 prefix>/dependentFunc.py --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-dependency-files.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-a-zip-file","title":"Bundled as a zip file","text":"

    In this approach all the dependent python files are bundled as a zip file. Each folder should have __init__.py file as documented in zip python dependencies. Zip should be done at the top folder level and using the -r option.

    zip -r pyspark-packaged-dependency-src.zip . \n  adding: dependent/ (stored 0%)\n  adding: dependent/__init__.py (stored 0%)\n  adding: dependent/dependentFunc.py (deflated 7%)\n

    dependentFunc.py from earlier example has been bundled as pyspark-packaged-dependency-src.zip. Upload this file to a S3 location

    cat > py-files-zip-pi.py <<EOF\nfrom __future__ import print_function\n\nimport sys\nfrom random import random\nfrom operator import add\n\nfrom pyspark.sql import SparkSession\nfrom pyspark import SparkContext\n\n**from dependent import dependentFunc**\n\nif __name__ == \"__main__\":\n    \"\"\"\n        Usage: pi [partitions]\n    \"\"\"\n    spark = SparkSession.builder.getOrCreate()\n    sc = spark.sparkContext\n    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2\n    n = 100000 * partitions\n\n    def f(_):\n        x = random() * 2 - 1\n        y = random() * 2 - 1\n        return 1 if x ** 2 + y ** 2 <= 1 else 0\n\n    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)\n    dependentFunc.message()\n    print(\"Pi is roughly %f\" % (4.0 * count / n))\n\n    spark.stop()\n  EOF\n

    Request:

    cat > spark-python-in-s3-dependency-zip.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-dependency-zip\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/py-files-zip-pi.py\", \n       \"sparkSubmitParameters\": \"--py-files s3://<s3 prefix>/pyspark-packaged-dependency-src.zip --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n          }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-dependency-zip.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-a-egg-file","title":"Bundled as a .egg file","text":"

    Create a folder structure as in the below screenshot with the code from the previous example - py-files-zip-pi.py, dependentFunc.py

    Steps to create .egg file

    cd /pyspark-packaged-example\npip install setuptools\npython setup.py bdist_egg\n

    Upload dist/pyspark_packaged_example-0.0.3-py3.8.egg to a S3 location

    Request:

    cat > spark-python-in-s3-dependency-egg.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-dependency-egg\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/py-files-zip-pi.py\", \n       \"sparkSubmitParameters\": \"--py-files s3://<s3 prefix>/pyspark_packaged_example-0.0.3-py3.8.egg --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-dependency-egg.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-a-whl-file","title":"Bundled as a .whl file","text":"

    Create a folder structure as in the below screenshot with the code from the previous example - py-files-zip-pi.py, dependentFunc.py

    Steps to create .whl file

    cd /pyspark-packaged-example\n`pip install wheel`\npython setup.py bdist_wheel\n

    Upload dist/pyspark_packaged_example-0.0.3-py3-none-any.whl to a s3 location

    Request:

    cat > spark-python-in-s3-dependency-wheel.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-dependency-wheel\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/py-files-zip-pi.py\", \n       \"sparkSubmitParameters\": \"--py-files s3://<s3 prefix>/pyspark_packaged_example-0.0.3-py3-none-any.whl --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-dependency-wheel.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-a-pex-file","title":"Bundled as a .pex file","text":"

    pex is a library for generating .pex (Python EXecutable) files which are executable Python environments.PEX files can be created as below

    docker run -it -v $(pwd):/workdir python:3.7.9-buster /bin/bash #python 3.7.9 is installed in EMR 6.1.0\npip3 install pex\npex --python=python3 --inherit-path=prefer -v numpy -o numpy_dep.pex\n

    To read more about PEX: PEX PEX documentation Tips on PEX pex packaging for pyspark

    Approach 1: Using Persistent Volume - FSx for Lustre cluster

    Upload numpy_dep.pex to a s3 location that is mapped to a FSx for Lustre cluster. numpy_dep.pex can be placed on any Kubernetes persistent volume and mounted to the driver pod and executor pod. Request: kmeans.py used in the below request is from spark examples

    cat > spark-python-in-s3-pex-fsx.json << EOF\n{\n  \"name\": \"spark-python-in-s3-pex-fsx\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/kmeans.py\",\n      \"entryPointArguments\": [\n        \"s3://<s3 prefix>/kmeans_data.txt\",\n        \"2\",\n        \"3\"\n       ], \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.pyspark.pythonVersion\":\"3\",\n          \"spark.kubernetes.driverEnv.PEX_ROOT\":\"./tmp\",\n          \"spark.executorEnv.PEX_ROOT\":\"./tmp\",\n          \"spark.kubernetes.driverEnv.PEX_INHERIT_PATH\":\"prefer\",\n          \"spark.executorEnv.PEX_INHERIT_PATH\":\"prefer\",\n          \"spark.kubernetes.driverEnv.PEX_VERBOSE\":\"10\",\n          \"spark.kubernetes.driverEnv.PEX_PYTHON\":\"python3\",\n          \"spark.executorEnv.PEX_PYTHON\":\"python3\",\n          \"spark.pyspark.driver.python\":\"/var/data/numpy_dep.pex\",\n          \"spark.pyspark.python\":\"/var/data/numpy_dep.pex\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": { \n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\n\naws emr-containers start-job-run --cli-input-json file:////Spark-Python-in-s3-pex-fsx.json\n

    Approach 2: Using Custom Pod Templates

    Upload numpy_dep.pex to a s3 location. Create custom pod templates for driver and executor pods. Custom pod templates allows running a command through initContainers before the main application container is created. In this case, the command will download the numpy_dep.pex file to the /tmp/numpy_dep.pex path of the driver and executor pods.

    Note: This approach is only supported for release image 5.33.0 and later or 6.3.0 and later.

    Sample driver pod template YAML file:

    cat > driver_pod_tenplate.yaml <<EOF\napiVersion: v1\nkind: Pod\nspec:\n containers:\n   - name: spark-kubernetes-driver\n initContainers: \n   - name: my-init-container\n     image: 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.33.0-20210323:2.4.7-amzn-1-vanilla\n     volumeMounts:\n       - name: temp-data-dir\n         mountPath: /tmp\n     command:\n       - sh\n       - -c\n       - aws s3api get-object --bucket <s3-bucket> --key <s3-key-prefix>/numpy_dep.pex /tmp/numpy_dep.pex && chmod u+x /tmp/numpy_dep.pex\nEOF\n

    Sample executor pod template YAML file:

    cat > executor_pod_tenplate.yaml <<EOF\napiVersion: v1\nkind: Pod\nspec:\n  containers:\n    - name: spark-kubernetes-executor\n  initContainers: \n    - name: my-init-container\n      image: 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.33.0-20210323:2.4.7-amzn-1-vanilla\n      volumeMounts:\n        - name: temp-data-dir\n          mountPath: /tmp\n      command:\n        - sh\n        - -c\n        - aws s3api get-object --bucket <s3-bucket> --key <s3-key-prefix>/numpy_dep.pex /tmp/numpy_dep.pex && chmod u+x /tmp/numpy_dep.pex\nEOF\n

    Replace initContainer's image with the respective release label's container image. In this case we are using the image of release emr-5.33.0-latest. Upload the driver and executor custom pod templates to S3

    Request: kmeans.py used in the below request is from spark examples

    cat > spark-python-in-s3-pex-pod-templates.json << EOF\n{\n  \"name\": \"spark-python-in-s3-pex-pod-templates\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-5.33.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/kmeans.py\",\n      \"entryPointArguments\": [\n        \"s3://<s3 prefix>/kmeans_data.txt\",\n        \"2\",\n        \"3\"\n       ], \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.pyspark.pythonVersion\":\"3\",\n          \"spark.kubernetes.driverEnv.PEX_ROOT\":\"./tmp\",\n          \"spark.executorEnv.PEX_ROOT\":\"./tmp\",\n          \"spark.kubernetes.driverEnv.PEX_INHERIT_PATH\":\"prefer\",\n          \"spark.executorEnv.PEX_INHERIT_PATH\":\"prefer\",\n          \"spark.kubernetes.driverEnv.PEX_VERBOSE\":\"10\",\n          \"spark.kubernetes.driverEnv.PEX_PYTHON\":\"python3\",\n          \"spark.executorEnv.PEX_PYTHON\":\"python3\",\n          \"spark.pyspark.driver.python\":\"/tmp/numpy_dep.pex\",\n          \"spark.pyspark.python\":\"/tmp/numpy_dep.pex\",\n          \"spark.kubernetes.driver.podTemplateFile\": \"s3://<s3-prefix>/driver_pod_template.yaml\",\n          \"spark.kubernetes.executor.podTemplateFile\": \"s3://<s3-prefix>/executor_pod_template.yaml\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": { \n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\n\naws emr-containers start-job-run --cli-input-json file:////Spark-Python-in-s3-pex-pod-templates.json\n

    Point to Note: PEX files don\u2019t have the python interpreter bundled with it. Using the PEX env variables, we pass in the python interpreter installed in the spark driver and executor docker image.

    pex vs conda-pack A pex file contain only dependent Python packages but not a Python interpreter in it while a conda-pack environment has a Python interpreter as well, so with the same Python packages a conda-pack environment is much larger than a pex file. A conda-pack environment is a tar.gz file and need to be decompressed before being used while a pex file can be used directly. If a Python interpreter exists, pex is a better option than conda-pack. However, conda-pack is the ONLY CHOICE if you need a specific version of Python interpreter which does not exist and you do not have permission to install one (e.g., when you need to use a specific version of Python interpreter with an enterprise PySpark cluster). If the pex file or conda-pack environment needs to be distributed to machines on demand, there are some overhead before running your application. With the same Python packages, a conda-pack environment has large overhead/latency than the pex file as the conda-pack environment is usually much larger and need to be decompressed before being used.

    For more information - Tips on PEX

    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-a-targz-file-with-conda-pack","title":"Bundled as a tar.gz file with conda-pack","text":"

    conda-pack for spark Install conda through Miniconda Open a new terminal and execute the below commands

    conda create -y -n example python=3.5 numpy\nconda activate example\npip install conda-pack\nconda pack -f -o numpy_environment.tar.gz\n

    Upload numpy_environment.tar.gz to a s3 location that is mapped to a FSx for Lustre cluster. numpy_environment.tar.gz can be placed on any Kubernetes persistent volume and mounted to the driver pod and executor pod.Alternatively, S3 path for numpy_environment.tar.gz can also be passed using --py-files

    Request:

    {\n  \"name\": \"spark-python-in-s3-conda-fsx\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/kmeans.py\",\n      \"entryPointArguments\": [\n        \"s3://<s3 prefix>/kmeans_data.txt\",\n        \"2\",\n        \"3\"\n       ], \n       \"sparkSubmitParameters\": \"--verbose --archives /var/data/numpy_environment.tar.gz#environment --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.executor.instances\": \"3\",\n          \"spark.dynamicAllocation.enabled\":\"false\",\n          \"spark.files\":\"/var/data/numpy_environment.tar.gz#environment\",\n          \"spark.kubernetes.pyspark.pythonVersion\":\"3\",\n          \"spark.pyspark.driver.python\":\"./environment/bin/python\",\n          \"spark.pyspark.python\":\"./environment/bin/python\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\n

    The above request doesn't work with spark on kubernetes

    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-virtual-env","title":"Bundled as virtual env","text":"

    Warning

    This will not work with spark on kubernetes

    This feature only works with YARN - cluster mode In this implementation for YARN - the dependencies will be installed from the repository for every driver and executor. This might not be a more scalable model as per SPARK-25433. Recommended solution is to pass in the dependencies as PEX file.

    "},{"location":"submit-applications/docs/spark/pyspark/#custom-docker-image","title":"Custom docker image","text":"

    See the details in the official documentation.

    Dockerfile

    FROM 107292555468.dkr.ecr.eu-central-1.amazonaws.com/spark/emr-6.3.0\nUSER root\nRUN pip3 install boto3\nUSER hadoop:hadoop\n
    "},{"location":"submit-applications/docs/spark/pyspark/#python-code-with-java-dependencies","title":"Python code with java dependencies","text":""},{"location":"submit-applications/docs/spark/pyspark/#list-of-packages","title":"List of packages","text":"

    Warning

    This will not work with spark on kubernetes

    This feature only works with YARN - cluster mode

    kafka integration example

    ./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2\n
    "},{"location":"submit-applications/docs/spark/pyspark/#list-of-jar-files","title":"List of .jar files","text":"

    This is not a scalable approach as the number of dependent files can grow to a large number, and also need to manually specify all the transitive dependencies.

    How to find all the .jar files which belongs to given package?

    1. Go to Maven Repository
    2. Search for the package name
    3. Select the matching Spark and Scala version
    4. Copy the URL of the jar file
    5. Copy the URL of the jar file of all compile dependencies

    Request:

    cat > Spark-Python-with-jars.json << EOF\n{\n  \"name\": \"spark-python-with-jars\",\n  \"virtualClusterId\": \"<virtual-cluster-id>\",\n  \"executionRoleArn\": \"<execution-role-arn>\",\n  \"releaseLabel\": \"emr-6.2.0-latest\",\n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/pi.py\",\n      \"sparkSubmitParameters\": \"--jars https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.1.1/spark-sql-kafka-0-10_2.12-3.1.1.jar,https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.6.2/commons-pool2-2.6.2.jar,https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/2.6.0/kafka-clients-2.6.0.jar,https://repo1.maven.org/maven2/org/apache/spark/spark-token-provider-kafka-0-10_2.12/3.1.1/spark-token-provider-kafka-0-10_2.12-3.1.1.jar,https://repo1.maven.org/maven2/org/apache/spark/spark-tags_2.12/3.1.1/spark-tags_2.12-3.1.1.jar --conf spark.driver.cores=3 --conf spark.executor.memory=8G --conf spark.driver.memory=6G --conf spark.executor.cores=3\"\n    }\n  },\n  \"configurationOverrides\": {\n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\",\n        \"logStreamNamePrefix\": \"demo\"\n      },\n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-with-jars.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#custom-docker-image_1","title":"Custom docker image","text":"

    See the basics in the official documentation.

    Approach 1: List of .jar files

    This is not a scalable approach as the number of dependent files can grow to a large number, and also need to manually specify all the transitive dependencies.

    How to find all the .jar files which belongs to given package?

    1. Go to Maven Repository
    2. Search for the package name
    3. Select the matching Spark and Scala version
    4. Copy the URL of the jar file
    5. Copy the URL of the jar file of all compile dependencies

    Dockerfile

    FROM 107292555468.dkr.ecr.eu-central-1.amazonaws.com/spark/emr-6.3.0\n\nUSER root\n\nARG JAR_HOME=/usr/lib/spark/jars/\n\n# Kafka\nADD https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.1.1/spark-sql-kafka-0-10_2.12-3.1.1.jar $JAR_HOME\nADD https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.6.2/commons-pool2-2.6.2.jar $JAR_HOME\nADD https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/2.6.0/kafka-clients-2.6.0.jar $JAR_HOME\nADD https://repo1.maven.org/maven2/org/apache/spark/spark-token-provider-kafka-0-10_2.12/3.1.1/spark-token-provider-kafka-0-10_2.12-3.1.1.jar $JAR_HOME\nADD https://repo1.maven.org/maven2/org/apache/spark/spark-tags_2.12/3.1.1/spark-tags_2.12-3.1.1.jar $JAR_HOME\n\nRUN chmod -R +r  /usr/lib/spark/jars\n\nUSER hadoop:hadoop\n

    Observed Behavior: Spark automatically installs all the .jar files from /usr/lib/spark/jars/ directory. In Dockerfile we are adding these file as root user and these file will get -rw------- permission while the original files have -rw-r--r-- permission. EMR on EKS uses hadoop:hadoop to run spark jobs and files with -rw------- permission are hidden from this user and can not be imported. To make these file readable for all the users run the following command chmod -R +r /usr/lib/spark/jars and the files will have -rw-r--r-- permission.

    Approach 2: List of packages

    This approach is a resource intensive (min 1vCPU, 2GB RAM) solution, because it will run a dummy spark job. Scale your local or CI/CD resources according to it.

    Dockerfile

    FROM 107292555468.dkr.ecr.eu-central-1.amazonaws.com/spark/emr-6.3.0\n\nUSER root\n\nARG KAFKA_PKG=\"org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2\"\n\nRUN spark-submit run-example --packages $KAFKA_PKG --deploy-mode=client --master=local[1] SparkPi\nRUN mv /root/.ivy2/jars/* /usr/lib/spark/jars/\n\nUSER hadoop:hadoop\n

    Observed Behavior: Spark runs ivy to get all of its dependencies (packages) when --packages are defined in the submit command. We can run a \"dummy\" spark job to make spark downloads its packages. These .jars are saved in /root/.ivy2/jars/ which we can move to /usr/lib/spark/jars/ for further use. These jars having -rw-r--r-- permission and does not require further modifications. The advantage of this method is ivy download the dependencies of the package as well, and we needed to specify only org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 instead of 5 jars files above.

    "},{"location":"submit-applications/docs/spark/pyspark/#import-of-dynamic-modules-pyd-so","title":"Import of Dynamic Modules (.pyd, .so)","text":"

    Import of dynamic modules(.pyd, .so) is disallowed when bundled as a zip

    Steps to create a .so file example.c

    /* File : example.c */\n\n #include \"example.h\"\n unsigned int add(unsigned int a, unsigned int b)\n {\n    printf(\"\\n Inside add function in C library \\n\");\n    return (a+b);\n }\n

    example.h

    /* File : example.h */\n#include<stdio.h>\n extern unsigned int add(unsigned int a, unsigned int b);\n
    gcc  -fPIC -Wall -g -c example.c\ngcc -shared -fPIC -o libexample.so example.o\n

    Upload libexample.so to a S3 location.

    pyspark code to be executed - py_c_call.py

    import sys\nimport os\n\nfrom ctypes import CDLL\nfrom pyspark.sql import SparkSession\n\n\nif __name__ == \"__main__\":\n\n    spark = SparkSession\\\n        .builder\\\n        .appName(\"py-c-so-example\")\\\n        .getOrCreate()\n\n    basedir = os.path.abspath(os.path.dirname(__file__))\n    libpath = os.path.join(basedir, 'libexample.so')\n    sum_list = CDLL(libpath)\n    data = [(1,2),(2,3),(5,6)]\n    columns=[\"a\",\"b\"]\n    df = spark.sparkContext.parallelize(data).toDF(columns)\n    df.withColumn('total', sum_list.add(df.a,df.b)).collect()\n    spark.stop()\n

    Request:

    cat > spark-python-in-s3-Clib.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-Clib\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/py_c_call.py\", \n       \"sparkSubmitParameters\": \"--files s3://<s3 prefix>/libexample.so --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-Clib.json\n

    Configuration of interest: --files s3://<s3 prefix>/libexample.so distributes the libexample.so to the working directory of all executors. Dynamic modules(.pyd, .so) can also be imported by bundling within .egg (SPARK-6764), .whl and .pex files.

    "},{"location":"troubleshooting/docs/change-log-level/","title":"Change Log level for Spark application on EMR on EKS","text":"

    To obtain more detail about their application or job submission, Spark application developers can change the log level of their job to different levels depending on their requirements. Spark uses apache log4j for logging.

    "},{"location":"troubleshooting/docs/change-log-level/#change-log-level-to-debug","title":"Change log level to DEBUG","text":""},{"location":"troubleshooting/docs/change-log-level/#using-emr-classification","title":"Using EMR classification","text":"

    Log level of spark applications can be changed using the EMR spark-log4j configuration classification.

    Request The pi.py application script is from the spark examples. EMR on EKS has included the example located at/usr/lib/spark/examples/src/main for you to try.

    spark-log4j classification can be used to configure values in log4j.properties for EMR releases 6.7.0 or lower , and log4j2.properties for EMR releases 6.8.0+ .

    cat > Spark-Python-in-s3-debug-log.json << EOF\n{\n  \"name\": \"spark-python-in-s3-debug-log-classification\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"local:///usr/lib/spark/examples/src/main/python/pi.py\",\n      \"entryPointArguments\": [ \"200\" ],\n       \"sparkSubmitParameters\": \"--conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.memory=2G --conf spark.executor.instances=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n          }\n      },\n      {\n        \"classification\": \"spark-log4j\", \n        \"properties\": {\n          \"log4j.rootCategory\":\"DEBUG, console\"\n          }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-debug-log.json\n

    The above request will print DEBUG logs in the spark driver and executor containers. The generated logs will be pushed to S3 and AWS Cloudwatch logs as configured in the request.

    Starting from the version 3.3.0, Spark has been migrated from log4j1 to log4j2. EMR on EKS allows you still write the log4j properties to the same \"classification\": \"spark-log4j\", however it now needs to be log4j2.properties, such as

          {\n        \"classification\": \"spark-log4j\",\n        \"properties\": {\n          \"rootLogger.level\" : \"DEBUG\"\n          }\n      }\n
    "},{"location":"troubleshooting/docs/change-log-level/#custom-log4j-properties","title":"Custom log4j properties","text":"

    Download log4j properties from here. Edit log4j.properties with log level as required. Save the edited log4j.properties in a mounted volume. In this example log4j.properties is placed in a s3 bucket that is mapped to a FSx for Lustre filesystem.

    Request pi.py used in the below request payload is from spark examples

    cat > Spark-Python-in-s3-debug-log.json << EOF\n{\n  \"name\": \"spark-python-in-s3-debug-log\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/pi.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.driver.extraJavaOptions\":\"-Dlog4j.configuration=file:///var/data/log4j-debug.properties\",\n          \"spark.executor.extraJavaOptions\":\"-Dlog4j.configuration=file:///var/data/log4j-debug.properties\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-debug-log.json\n

    Configurations of interest: Below configuration enables spark driver and executor to pick up the log4j configuration file from /var/data/ folder mounted to the driver and executor containers. For guide to mount FSx for Lustre to driver and executor containers - refer to EMR Containers integration with FSx for Lustre

    \"spark.driver.extraJavaOptions\":\"-Dlog4j.configuration=file:///var/data/log4j-debug.properties\",\n\"spark.executor.extraJavaOptions\":\"-Dlog4j.configuration=file:///var/data/log4j-debug.properties\",\n
    "},{"location":"troubleshooting/docs/connect-spark-ui/","title":"Connect to Spark UI running on the Driver Pod","text":"

    To obtain more detail about their application or monitor their job execution, Spark application developers can connect to Spark-UI running on the Driver Pod.

    Spark UI (Spark history server) is packaged with EMR on EKS out of the box. Alternatively, if you want to see Spark UI immediately after the driver is spun up, you can use the instructions in this page to connect.

    This page shows how to use kubectl port-forward to connect to the Job's Driver Pod running in a Kubernetes cluster. This type of connection is useful for debugging purposes.

    Pre-Requisites

    • AWS cli should be installed
    • \"kubectl\" should be installed
    • If this is the first time you are connecting to your EKS cluster from your machine, you should run aws eks update-kubeconfig --name --region to download kubeconfig file and use correct context to talk to API server.
    "},{"location":"troubleshooting/docs/connect-spark-ui/#submitting-the-job-to-a-virtual-cluster","title":"Submitting the job to a virtual cluster","text":"

    Request

    cat >spark-python.json << EOF\n{\n  \"name\": \"spark-python-in-s3\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.3.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=4  --conf spark.executor.memory=20G --conf spark.driver.memory=20G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python.json\n

    Once the job is submitted successfully, run kubectl get pods -n <virtual-cluster-k8s-namespace> -w command to watch all the pods, until you observe the driver pod is in the \"Running\" state. The Driver pod's name usually is in spark-<job-id>-driver format.

    "},{"location":"troubleshooting/docs/connect-spark-ui/#connecting-to-the-driver-pod","title":"Connecting to the Driver Pod","text":"

    Spark Driver Pod hosts Spark-UI on port 4040. However the pod runs within the internal Kubernetes network. To get access to the internal Kubernetes resources, kubectl provides a tool (\"Port Forwarding\") that allows access from your localhost. To get access to the driver pod in your cluster:

    1- Run kubectl port-forward <driver-pod-name> 4040:4040

    The result should be the following:

    Forwarding from 127.0.0.1:28015 -> 27017\nForwarding from [::1]:28015 -> 27017\n

    2- Open a browser and type http://localhost:4040 in the Address bar.

    You should be able to connect to the Spark UI:

    "},{"location":"troubleshooting/docs/connect-spark-ui/#consideration","title":"Consideration","text":"

    In some cases like long-running Spark jobs, such as Spark streaming or large Spark SQL queries can generate large event logs. With large events logs, it might happen quickly use up storage space on running pods and sometimes encounter to experience blank UI or even OutOfMemory errors when you load Persistent UIs. To avoid these issues, we recommend that you follow either by turn on the Spark event log rolling and compaction feature (default emr-container-event-log-dir - /var/log/spark/apps) or use S3 location to parse the log using self hosted of Spark history server.

    "},{"location":"troubleshooting/docs/eks-cluster-auto-scaler/","title":"EKS Cluster Auto-Scaler","text":"

    Kubernetes provisions nodes using CAS (Cluster Autoscaler). AWS EKS has its own implementation of K8 CAS, and EKS uses Managed-Nodegroups to spuns of Nodes.

    "},{"location":"troubleshooting/docs/eks-cluster-auto-scaler/#logs-of-eks-cluster-auto-scaler","title":"Logs of EKS Cluster Auto-scaler.","text":"

    On AWS, Cluster Autoscaler utilizes Amazon EC2 Auto Scaling Groups to provision nodes. This section will help you identify the error message when a AutoScaler fails to provision nodes.

    An example scenario, where the NodeGroup would fail due to non-supported nodes in certain AZs.

    Could not launch On-Demand Instances. Unsupported - Your requested instance type (g4dn.xlarge) is not supported in your requested Availability Zone (ca-central-1d). Please retry your request by not specifying an Availability Zone or choosing ca-central-1a, ca-central-1b. Launching EC2 instance failed.\n

    The steps to find the logs for AutoScalingGroups are,

    Step 1: Login to AWS Console, and select Elastic Kubernetes Service

    Step 2: Select Compute tab, and select the NodeGroup that fails.

    Step 3: Select the Autoscaling group name from the NodeGroup's section, which will direct you to EC2 --> AutoScaling Group page.

    Step 4: Click the Tab Activity of the AutoScaling Group, and the Activity History would give provide the details of the error.

    - Status\n- Description\n- Cause\n- Start Time\n- End Time\n

    Alternatively, the activities/logs can be found via CLI as well

    aws autoscaling describe-scaling-activities \\\n  --region <region> \\\n  --auto-scaling-group-name <NodeGroup-AutoScaling-Group>\n

    In the above error scenario, the ca-central-1d availability zone doesn't support g4dn.xlarge. The solution is

    Step 1: Identify the Subnets of the Availability zones that supports the GPU node type. The NodeGroup Section would list all the subnets, and you can click each subnet to see which AZ it is deployed to.

    Step 2: Create a NodeGroup only in the Subnets identified in the above step

    aws eks create-nodegroup \\\n    --region <region> \\ \n    --cluster-name <cluster-name> \\\n    --nodegroup-name <nodegroup-name> \\\n    --scaling-config minSize=10,maxSize=10,desiredSize=10 \\\n    --ami-type AL2_x86_64_GPU \\\n    --node-role <NodeGroupRole> \\\n    --subnets <subnet-1-that-supports-gpu> <subnet-2-that-supports-gpu> \\\n    --instance-types g4dn.xlarge \\\n    --disk-size <disk size>\n
    "},{"location":"troubleshooting/docs/karpenter/","title":"Karpenter","text":"

    Karpenter is an open-source cluster autoscaler for kubernetes (EKS) that automatically provisions new nodes in response to unschedulable pods. Until Karpenter was introduced, EKS would use its implementation of \"CAS\" Cluster Autoscaler, which creates Managed-NodeGroups to provision nodes.

    The challenge with Managed-NodeGroups is that, it can only create nodes with a single instance-type. In-order to provision nodes with different instance-types for different workloads, multiple nodegroups have to be created. Karpenter on the other hand can provision nodes of different types by working with EC2-Fleet-API. The best practices to configure the Provisioners are documented at https://aws.github.io/aws-eks-best-practices/karpenter/

    This guide helps the user troubleshoot common problems with Karpenter.

    "},{"location":"troubleshooting/docs/karpenter/#logs-of-karpenter-controller","title":"Logs of Karpenter Controller","text":"

    Karpenter is a Custom Kubernetes Controller, and the following steps would help find Karpenter Logs.

    Step 1: Identify the namespace where Karpenter is running. In most cases, helm would be used to deploy Karpenter packages. The helm ls command would list the namespace where karpenter would be installed.

    # Example\n\n% helm ls --all-namespaces\nNAME        NAMESPACE   REVISION    UPDATED                                 STATUS      CHART               APP VERSION\nkarpenter   karpenter   1           2023-05-15 14:16:03.726908 -0500 CDT    deployed    karpenter-v0.27.3   0.27.3\n

    Step 2: Setup kubectl

    brew install kubectl\n\naws --region <region> eks update-kubeconfig --name <eks-cluster-name>\n

    Step 3: Check the status of the pods of Karpenter

    # kubectl get pods -n <namespace>\n\n% kubectl get pods -n karpenter\nNAME                         READY   STATUS    RESTARTS   AGE\nkarpenter-7b455dccb8-prrzx   1/1     Running   0          7m18s\nkarpenter-7b455dccb8-x8zv8   1/1     Running   0          7m18s\n

    Step 4: The kubectl logs command would help read the Karpenter logs. The below example, karpenter pod logs depict that an t3a.large instance was launched.

    # kubectl logs <karpenter pod name> -n <namespace>\n\n% kubectl logs karpenter-7b455dccb8-prrzx -n karpenter\n..\n..\n\n2023-05-15T19:16:20.546Z    DEBUG   controller  discovered region   {\"commit\": \"***-dirty\", \"region\": \"us-west-2\"}\n2023-05-15T19:16:20.666Z    DEBUG   controller  discovered cluster endpoint {\"commit\": \"**-dirty\", \"cluster-endpoint\": \"https://******.**.us-west-2.eks.amazonaws.com\"}\n..\n..\n2023-05-15T19:16:20.786Z    INFO    controller.provisioner  starting controller {\"commit\": \"**-dirty\"}\n2023-05-15T19:16:20.787Z    INFO    controller.deprovisioning   starting controller {\"commit\": \"**-dirty\"}\n..\n2023-05-15T19:16:20.788Z    INFO    controller  Starting EventSource    {\"commit\": \"**-dirty\", \"controller\": \"node\", \"controllerGroup\": \"\", \"controllerKind\": \"Node\", \"source\": \"kind source: *v1.Pod\"}\n..\n2023-05-15T20:34:56.718Z    INFO    controller.provisioner.cloudprovider    launched instance   {\"commit\": \"d7e22b1-dirty\", \"provisioner\": \"default\", \"id\": \"i-03146cd4d4152a935\", \"hostname\": \"ip-*-*-*-*.us-west-2.compute.internal\", \"instance-type\": \"t3a.large\", \"zone\": \"us-west-2d\", \"capacity-type\": \"on-demand\", \"capacity\": {\"cpu\":\"2\",\"ephemeral-storage\":\"20Gi\",\"memory\":\"7577Mi\",\"pods\":\"35\"}}\n
    "},{"location":"troubleshooting/docs/karpenter/#error-while-decoding-json-json-unknown-field-iamidentitymappings","title":"Error while decoding JSON: json: unknown field \"iamIdentityMappings\"","text":"

    Problem The Create-Cluster command https://karpenter.sh/v0.27.3/getting-started/getting-started-with-karpenter/#3-create-a-cluster throws an error

    Error: loading config file \"karpenter.yaml\": error unmarshaling JSON: while decoding JSON: json: unknown field \"iamIdentityMappings\"\n

    Solution The eksctl cli was not able to understand the kind iamIdentityMappings. This is because, the eksctl version is old, and its schema doesn't support this kind.

    The solution is to upgrade the eksctl cli, and re-run the cluster creation commands

    brew upgrade eksctl\n
    "},{"location":"troubleshooting/docs/rbac-permissions-errors/","title":"RBAC Permission Errors","text":"

    The following sections provide solutions to common RBAC authorization errors.

    "},{"location":"troubleshooting/docs/rbac-permissions-errors/#persistentvolumeclaims-is-forbidden","title":"PersistentVolumeClaims is forbidden","text":"

    Error: Spark jobs that require creation, listing or deletion of Persistent Volume Claims (PVC) was not supported before EMR6.8. Jobs that require these permissions will fail with the exception \u201cpersistentvolumeclaims is forbidden\". Looking into driver logs, you may see an error like this:

    persistentvolumeclaims is forbidden. User \"system:serviceaccount:emr:emr-containers-sa-spark-client-93ztm12rnjz163mt3rgdb3bjqxqfz1cgvqh1e9be6yr81\" cannot create resource \"persistentvolumeclaims\" in API group \"\" in namesapce \"emr\".\n

    You may encounter this error because the default Kubernetes role emr-containers is missing the required RBAC permissions. As a result, the emr-containers primary role can\u2019t dynamically create necessary permissions for additional roles such as Spark driver, Spark executor or Spark client when you submit a job.

    Solution: Add the required permissions to emr-containers.

    Here are the complete RBAC permissions for EMR on EKS:

    • emr-containers.yaml

    You can compare whether you have complete RBAC permissions using the steps below,

    export NAMESPACE=YOUR_VALUE\nkubectl describe role emr-containers -n ${NAMESPACE}\n

    If the permissions don't match, proceed to apply latest permissions

    export NAMESPACE=YOUR_VALUE\nkubectl apply -f https://github.com/aws/aws-emr-containers-best-practices/blob/main/tools/k8s-rbac-policies/emr-containers.yaml -n ${NAMESPACE}\n

    You can delete the spark driver and client roles because they will be dynamically created when the job is run next time.

    "},{"location":"troubleshooting/docs/self-hosted-shs/","title":"Self Hosted Spark History Server","text":"

    In this section, you will learn how to self host Spark History Server instead of using the Persistent App UI on the AWS Console.

    1. In your StartJobRun call for EMR on EKS, set the following conf. to point to an S3 bucket where you would like your event logs to go : spark.eventLog.dir and spark.eventLog.enabled as such:

      \"configurationOverrides\": {\n  \"applicationConfiguration\": [{\n    \"classification\": \"spark-defaults\",\n    \"properties\": {\n      \"spark.eventLog.enabled\": \"true\",\n      \"spark.eventLog.dir\": \"s3://your-bucket-here/some-directory\"\n...\n
    2. Take note of the S3 bucket specified in #1, and use it in the instructions on step #3 wherever you are asked for path_to_eventlog and make sure it is prepended with s3a://, not s3://. An example is -Dspark.history.fs.logDirectory=s3a://path_to_eventlog.

    3. Follow instructions here to launch Spark History Server using a Docker image.

    4. After following the above steps, event logs should flow to the specified S3 bucket and the docker container should spin up Spark History Server (which will be available at 127.0.0.1:18080). This instance of Spark History Server will pick up and parse event logs from the S3 bucket specified.

    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/","title":"Spark Driver and Executor Logs","text":"

    The status of the spark jobs can be monitored via EMR on EKS describe-job-run API.

    To be able to monitor the job progress and to troubleshoot failures, you must configure your jobs to send log information to Amazon S3, Amazon CloudWatch Logs, or both

    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#send-spark-logs-to-s3","title":"Send Spark Logs to S3","text":""},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#update-the-iam-role-with-s3-write-access","title":"Update the IAM role with S3 write access","text":"

    Configure the IAM Role passed in StartJobRun input executionRoleArn with access to S3 buckets.

    {\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"s3:PutObject\",\n                \"s3:GetObject\",\n                \"s3:ListBucket\"\n            ],\n            \"Resource\": [\n                \"arn:aws:s3:::my_s3_log_location\",\n                \"arn:aws:s3:::my_s3_log_location/*\",\n            ]\n        }\n    ]\n}\n
    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#configure-the-startjobrun-api-with-s3-buckets","title":"Configure the StartJobRun API with S3 buckets","text":"

    Configure the monitoringConfiguration with s3MonitoringConfiguration, and configure the S3 location where the logs would be synced.

    {\n  \"name\": \"<job_name>\", \n  \"virtualClusterId\": \"<vc_id>\",  \n  \"executionRoleArn\": \"<iam_role_name_for_job_execution>\", \n  \"releaseLabel\": \"<emr_release_label>\", \n  \"jobDriver\": {\n\n  }, \n  \"configurationOverrides\": {\n    \"monitoringConfiguration\": {\n      \"persistentAppUI\": \"ENABLED\",\n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://my_s3_log_location\"\n      }\n    }\n  }\n}\n
    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#log-location-of-jobrunner-driver-executor-in-s3","title":"Log location of JobRunner, Driver, Executor in S3","text":"

    The JobRunner (pod that does spark-submit), Spark Driver, and Spark Executor logs would be found in the following S3 locations.

    JobRunner/Spark-Submit/Controller Logs - s3://my_s3_log_location/${virtual-cluster-id}/jobs/${job-id}/containers/${job-runner-pod-id}/(stderr.gz/stdout.gz)\n\nDriver Logs - s3://my_s3_log_location/${virtual-cluster-id}/jobs/${job-id}/containers/${spark-application-id}/${spark-job-id-driver-pod-name}/(stderr.gz/stdout.gz)\n\nExecutor Logs - s3://my_s3_log_location/${virtual-cluster-id}/jobs/${job-id}/containers/${spark-application-id}/${spark-job-id-driver-executor-id}/(stderr.gz/stdout.gz)\n
    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#send-spark-logs-to-cloudwatch","title":"Send Spark Logs to CloudWatch","text":""},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#update-the-iam-role-with-cloudwatch-access","title":"Update the IAM role with CloudWatch access","text":"

    Configure the IAM Role passed in StartJobRun input executionRoleArn with access to CloudWatch Streams.

    {\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Effect\": \"Allow\",\n      \"Action\": [\n        \"logs:CreateLogStream\",\n        \"logs:DescribeLogGroups\",\n        \"logs:DescribeLogStreams\"\n      ],\n      \"Resource\": [\n        \"arn:aws:logs:*:*:*\"\n      ]\n    },\n    {\n      \"Effect\": \"Allow\",\n      \"Action\": [\n        \"logs:PutLogEvents\"\n      ],\n      \"Resource\": [\n        \"arn:aws:logs:*:*:log-group:my_log_group_name:log-stream:my_log_stream_prefix/*\"\n      ]\n    }\n  ]\n}\n
    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#configure-startjobrun-api-with-cloudwatch","title":"Configure StartJobRun API with CloudWatch","text":"

    Configure the monitoringConfiguration with cloudWatchMonitoringConfiguration, and configure the CloudWatch logGroupName and logStreamNamePrefix where the logs should be pushed.

    {\n  \"name\": \"<job_name>\", \n  \"virtualClusterId\": \"<vc_id>\",  \n  \"executionRoleArn\": \"<iam_role_name_for_job_execution>\", \n  \"releaseLabel\": \"<emr_release_label>\", \n  \"jobDriver\": {\n\n  }, \n  \"configurationOverrides\": {\n    \"monitoringConfiguration\": {\n      \"persistentAppUI\": \"ENABLED\",\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"my_log_group_name\",\n        \"logStreamNamePrefix\": \"my_log_stream_prefix\"\n      }\n    }\n  }\n}\n
    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#log-location-of-jobrunner-driver-executor","title":"Log location of JobRunner, Driver, Executor","text":"

    The JobRunner (pod that does spark-submit), Spark Driver, and Spark Executor logs would be found in the following AWS CloudWatch locations.

    JobRunner/Spark-Submit/Controller Logs - ${my_log_group_name}/${my_log_stream_prefix}/${virtual-cluster-id}/jobs/${job-id}/containers/${job-runner-pod-id}/(stderr.gz/stdout.gz)\n\nDriver Logs - ${my_log_group_name}/${my_log_stream_prefix}/${virtual-cluster-id}/jobs/${job-id}/containers/${spark-application-id}/${spark-job-id-driver-pod-name}/(stderr.gz/stdout.gz)\n\nExecutor Logs - ${my_log_group_name}/${my_log_stream_prefix}/${virtual-cluster-id}/jobs/${job-id}/containers/${spark-application-id}/${spark-job-id-driver-executor-id}/(stderr.gz/stdout.gz)\n
    "}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Introduction","text":"

    Welcome to the EMR Containers Best Practices Guide. The primary goal of this project is to offer a set of best practices and templates to get started with Amazon EMR on EKS. We publish this guide on GitHub so we could iterate the content quickly, provide timely and effective recommendations for variety of concerns, and easily incorporate suggestions from the broader community.

    "},{"location":"#amazon-emr-on-eks-workshop","title":"Amazon EMR on EKS Workshop","text":"

    If you are interested in step-by-step tutorials that leverage the best practices contained in this guide, please visit the Amazon EMR on EKS Workshop.

    "},{"location":"#contributing","title":"Contributing","text":"

    We encourage you to contribute to these guides. If you have implemented a practice that has proven to be effective, please share it with us by opening an issue or a pull request. Similarly, if you discover an error or flaw in the guide, please submit a pull request to correct it.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/","title":"EKS Best Practices and Recommendations","text":"

    Amazon EMR on EKS team has run scale tests on EKS cluster and has compiled a list of recommendations. The purpose of this document is to share our recommendations for running large scale EKS clusters supporting EMR on EKS.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#amazon-vpc-cni-best-practices","title":"Amazon VPC CNI Best practices","text":""},{"location":"best-practices-and-recommendations/eks-best-practices/#recommendation-1-improve-ip-address-utilization","title":"Recommendation 1: Improve IP Address Utilization","text":"

    EKS clusters can run out of IP addresses for pods when they reached between 400 and 500 nodes. With the default CNI settings, each node can request more IP addresses than is required. To ensure that you don\u2019t run out of IP addresses, there are two solutions:

    1. Set MINIMUM_IP_TARGET and WARM_IP_TARGET instead of the default setting of WARM_ENI_TARGET=1. The values of these settings will depend on your instance type, expected pod density, and workload. More info about these CNI settings can be found here. The maximum number of IP addresses per node (and thus maximum number of pods per node) depends on instance type and can be looked up here.

    2. If you have found the right CNI settings as described above, the subnets created by eksctl still do not provide enough addresses (by default eksctl creates a \u201c/19\u201d subnet for each nodegroup, which contains ~8.1k addresses). You can configure CNI to take addresses from (larger) subnets that you create. For example, you could create a few \u201c/16\u201d subnets, which contain ~65k IP addresses per subnet. You should implement this option after you have configured the CNI settings as described in #1. To configure your pods to use IP addresses from larger manually-created subnets, use CNI custom networking (see below for more information):

    CNI custom networking

    By default, the CNI assigns the Pod\u2019s IP address from the worker node's primary elastic network interface's (ENI) security groups and subnet. If you don\u2019t have enough IP addresses in the worker node subnet, or prefer that the worker nodes and Pods reside in separate subnets to avoid IP address allocation conflicts between Pods and other resources in the VPC, you can use CNI custom networking.

    Enabling a custom network removes an available elastic network interface (and all of its available IP addresses for pods) from each worker node that uses it. The worker node's primary network interface is not used for pod placement when a custom network is enabled.

    If you want the CNI to assign IP addresses for Pods from a different subnet, you can set AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG environment variable to true.

    kubectl set env daemonset aws-node \\\n-n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true\n

    When AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true, the CNI will assign Pod IP address from a subnet defined in ENIConfig. The ENIConfig custom resource is used to define the subnet in which Pods will be scheduled.

    apiVersion: crd.k8s.amazonaws.com/v1alpha1\nkind: ENIConfig\nmetadata: \n  name: us-west-2a\nspec: \n  securityGroups: \n    - sg-0dff111a1d11c1c11\n  subnet: subnet-011b111c1f11fdf11\n

    You will need to create an ENIconfig custom resource for each subnet you want to use for Pod networking.

    • The securityGroups field should have the ID of the security group attached to the worker nodes.

    • The name field should be the name of the Availability Zone in your VPC. If you name your ENIConfig custom resources after each Availability Zone in your VPC, you can enable Kubernetes to automatically apply the corresponding ENIConfig for the worker node Availability Zone with the following command.

    kubectl set env daemonset aws-node \\\n-n kube-system ENI_CONFIG_LABEL_DEF=failure-domain.beta.kubernetes.io/zone\n

    Note

    Upon creating the ENIconfig custom resources, you will need to create new worker nodes. The existing worker nodes and Pods will remain unaffected.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#recommendation-2-prevent-ec2-vpc-api-throttling-from-assignprivateipaddresses-attachnetworkinterface","title":"Recommendation 2: Prevent EC2 VPC API throttling from AssignPrivateIpAddresses & AttachNetworkInterface","text":"

    Often EKS cluster scale-out time can increase because the CNI is being throttled by the EC2 VPC APIs. The following steps can be taken to prevent these issues:

    1. Use CNI version 1.8.0 or later as it reduces the calls to EC2 VPC APIs than earlier versions.

    2. Configure the MINIMUM_IP_TARGET and WARM_IP_TARGET parameters instead of the default parameter of WARM_ENI_TARGET=1. Only those IP addresses that are necessary are requested from EC2. The values of these settings will depend on your instance type and expected pod density. More info about these settings here.

    3. Request an API limit increase on the EC2 VPC APIs that are getting throttled. This option should be considered only after steps 1 & 2 have been done.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#other-recommendations-for-amazon-vpc-cni","title":"Other Recommendations for Amazon VPC CNI","text":""},{"location":"best-practices-and-recommendations/eks-best-practices/#plan-for-growth","title":"Plan for growth","text":"

    Size the subnets you will use for Pod networking for growth. If you have insufficient IP addresses available in the subnet that the CNI uses, your pods will not get an IP address. The pods will remain in the pending state until an IP address becomes available. This may impact application autoscaling and compromise its availability.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#monitor-ip-address-inventory","title":"Monitor IP address inventory","text":"

    You can monitor the IP addresses inventory of subnets using the CNI Metrics Helper, and set CloudWatch alarms to get notified if a subnet is running out of IP addresses.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#snat-setting","title":"SNAT setting","text":"

    Source Network Address Translation (source-nat or SNAT) allows traffic from a private network to go out to the internet. Virtual machines launched on a private network can get to the internet by going through a gateway capable of performing SNAT. If your Pods with private IP address need to communicate with other private IP address spaces (for example, Direct Connect, VPC Peering or Transit VPC), then you should enable external SNAT in the CNI:

    kubectl set env daemonset \\\n-n kube-system aws-node AWS_VPC_K8S_CNI_EXTERNALSNAT=true\n
    "},{"location":"best-practices-and-recommendations/eks-best-practices/#coredns-best-practices","title":"CoreDNS Best practices","text":""},{"location":"best-practices-and-recommendations/eks-best-practices/#prevent-coredns-from-being-overwhelmed-unknownhostexception-in-spark-jobs-and-other-pods","title":"Prevent CoreDNS from being overwhelmed (UnknownHostException in spark jobs and other pods)","text":"

    CoreDNS is a deployment, which means it runs a fixed number of replicas and thus does not scale out with the cluster. This can be a problem for workloads that do a lot of DNS lookups. One simple solution is to install dns-autoscaler, which adjusts the number of replicas of the CoreDNS deployment as the cluster grows and shrinks.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#monitor-coredns-metrics","title":"Monitor CoreDNS metrics","text":"

    CoreDNS is a deployment, which means it runs a fixed number of replicas and thus does not scale out with the cluster. This can cause workloads to timeout with unknownHostException as spark-executors will do a lot of DNS lookups which registering themselves to spark-driver. One simple solution to fix this is to install dns-autoscaler, which adjusts the number of replicas of the CoreDNS deployment as the cluster grows and shrinks.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#cluster-autoscaler-best-practices","title":"Cluster Autoscaler Best practices","text":""},{"location":"best-practices-and-recommendations/eks-best-practices/#increase-cluster-autoscaler-memory-to-avoid-unnecessary-exceptions","title":"Increase cluster-autoscaler memory to avoid unnecessary exceptions","text":"

    Cluster-autoscaler can require a lot of memory to run because it stores a lot of information about the state of the cluster, such as data about every pod and every node. If the cluster-autoscaler has insufficient memory, it can lead to the cluster-autoscaler crashing. Ensure that you provide the cluster-autoscaler deployment more memory, e.g., 1Gi memory instead of the default 300Mi. Useful information about configuring the cluster-autoscaler for improved scalability and performance can be found here

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#avoid-job-failures-when-cluster-autoscaler-attempts-scale-in","title":"Avoid job failures when Cluster Autoscaler attempts scale-in","text":"

    Cluster Autoscaler will attempt scale-in action for any under utilized instance within your EKScluster. When scale-in action is performed, all pods from that instance is relocated to another node. This could cause disruption for critical workloads. For example, if driver pod is restarted, the entire job needs to restart. For this reason, we recommend using Kubernetes annotations on all critical pods (especially driver pods) and for cluster autoscaler deployment. Please see here for more info

    cluster-autoscaler.kubernetes.io/safe-to-evict=false\n
    "},{"location":"best-practices-and-recommendations/eks-best-practices/#configure-overprovisioning-with-cluster-autoscaler-for-higher-priority-jobs","title":"Configure overprovisioning with Cluster Autoscaler for higher priority jobs","text":"

    If the required resources is not available in the cluster, pods go into pending state. Cluster Autoscaler uses this metric to scale out the cluster and this activity can be time-consuming (several minutes) for higher priority jobs. In order to minimize time required for scaling, we recommend overprovisioning resources. You can launch pause pods(dummy workloads which sleeps until it receives SIGINT or SIGTERM) with negative priority to reserve EC2 capacity. Once the higher priority jobs are scheduled, these pause pods are preempted to make room for high priority pods which in turn scales out additional capacity as a buffer. You need to be aware that this is a trade-off as it adds slightly higher cost while minimizing scheduling latency. You can read more about over provisioning best practice here.

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#eks-control-plane-best-practices","title":"EKS Control Plane Best practices","text":""},{"location":"best-practices-and-recommendations/eks-best-practices/#api-server-overwhelmed","title":"API server overwhelmed","text":"

    System pods, workload pods, and external systems can make many calls to the Kubernetes API server. This can decrease performance and also increase EMR on EKS job failures. There are multiple ways to avoid API server availability issues including but not limited to:

    • By default, the EKS API servers are automatically scaled to meet your workload demand. If you see increased latencies, please contact AWS via a support ticket and work with engineering team to resolve the issue.

    • Consider reducing the scan interval of cluster-autoscaler from the 10 second default value. Each time the cluster-autoscaler runs, it makes many calls to the API server. However, this will result in the cluster scaling-out less frequently and in larger steps (and same with scaling back in when load is reduced). More information can be found about the cluster-autoscaler here. This is not recommended if you need jobs to start ASAP.

    • If you are running your own deployment of fluentd, an increased load on the APIserver can be observed. Consider using fluent-bit instead which makes fewer calls to the API server. More info can be found here

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#monitor-control-plane-metrics","title":"Monitor Control Plane Metrics","text":"

    Monitoring Kubernetes API metrics can give you insights into control plane performance and identify issues. An unhealthy control plane can compromise the availability of the workloads running inside the cluster. For example, poorly written controllers can overload the API servers, affecting your application's availability.

    Kubernetes exposes control plane metrics at the /metrics endpoint.

    You can view the metrics exposed using kubectl:

    kubectl get --raw /metrics\n

    These metrics are represented in a Prometheus text format.

    You can use Prometheus to collect and store these metrics. In May 2020, CloudWatch added support for monitoring Prometheus metrics in CloudWatch Container Insights. So you can also use Amazon CloudWatch to monitor the EKS control plane. You can follow the Tutorial for Adding a New Prometheus Scrape Target: Prometheus KPI Server Metrics to collect metrics and create CloudWatch dashboard to monitor your cluster\u2019s control plane.

    You can also find Kubernetes API server metrics here. For example, apiserver_request_duration_seconds can indicate how long API requests are taking to run.

    Consider monitoring these control plane metrics:

    "},{"location":"best-practices-and-recommendations/eks-best-practices/#api-server","title":"API Server","text":"Metric Description apiserver_request_total Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, client, and HTTP response contentType and code. apiserver_request_duration_seconds* Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope, and component. rest_client_request_duration_seconds Request latency in seconds. Broken down by verb and URL. apiserver_admission_controller_admission_duration_seconds Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit). rest_client_request_duration_seconds Request latency in seconds. Broken down by verb and URL. rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host."},{"location":"best-practices-and-recommendations/eks-best-practices/#etcd","title":"etcd","text":"Metric Description etcd_request_duration_seconds Etcd request latency in seconds for each operation and object type.

    You can visualize and monitor these Kubernetes API server requests, latency and etcD metrics on Grafana via Grafana dashboard 12006.

    "},{"location":"cost-optimization/docs/cost-optimization/","title":"Cost Optimization using EC2 Spot Instances","text":""},{"location":"cost-optimization/docs/cost-optimization/#ec2-spot-best-practices","title":"EC2 Spot Best Practices","text":"

    Amazon EMR on Amazon EKS enables you to submit Apache Spark jobs on demand on Amazon Elastic Kubernetes Service (EKS) without provisioning dedicated EMR clusters. With EMR on EKS, you can consolidate analytical workloads with your other Kubernetes-based applications on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management. Cost Optimization of the underlying infrastructure is often the key requirement for our customers, and this can be achieved by using Amazon EC2 Spot Instances. Spot Instances are spare EC2 capacity and is available at up to 90% discount compared to On-Demand Instance prices. If EC2 needs capacity back for On-Demand Instance usage, Spot Instances can be interrupted. Handling interruptions to build resilient workloads is simple and there are best practices to manage interruption by automation or AWS services like EKS.

    This document will describe how to architect with EC2 spot best practices and apply to EMR on EKS jobs. We will also cover Spark features related to EC2 Spot when you run EMR on EKS jobs

    "},{"location":"cost-optimization/docs/cost-optimization/#ec2-spot-capacity-provisioning","title":"EC2 Spot Capacity Provisioning","text":"

    EMR on EKS runs open-source big data framework like Spark on Amazon EKS, so basically when you are run on Spot instances you are, provisioning capacity for the underlying EKS cluster. The key point to remember when you are using Spot instances is instance diversification. There are three ways that EC2 Spot capacity can be provisioned in an EKS cluster.

    EKS Managed Nodegroup:

    We highly recommend to use Managed Nodegroup for provisioning Spot instances. This requires significantly less operational effort when compared to self-managed nodegroups. The Spot instance interruption is handled proactively using the Instance Rebalancing Recommendation and Spot best practice of using Capacity Optimized Allocation strategy is adopted by default along with other useful features. If you are planning to scale your cluster then Cluster Autoscaler can be used but keep in mind, one caveat with this approach is to maintain same vCPU to memory ratio for nodes defined in a nodegroup.

    Karpenter:

    An open-source node provisioning tool for Kubernetes which works seamlessly with EMR on EKS. Karpenter can help to improve the efficiency and cost of running workloads. It provisions nodes based on pod resource requirements. The key advantage of Karpenter is flexibility not only in terms of EC2 pricing (Spot/On-Demand) but it also aligns with the Spot best practice of instance diversification, and uses capacity optimized prioritized allocation strategy; more details can be found in this workshop. Karpenter will also be useful to scale the infrastructure which will be further discussed under the scaling section below.

    Self-Managed Nodegroup:

    EMR on EKS clusters can also run on self-managed nodegroups on EKS. You need to manage the Spot instance lifecycle if there is an interruption by installing an open-source tool named AWS Node Termination Handler. AWS Node Termination Handler ensures that the Kubernetes control plane responds appropriately to events that can cause your EC2 instance to become unavailable, such as EC2 maintenance events, EC2 Spot interruptions, ASG Scale-In, ASG AZ Rebalance, and EC2 Instance Termination via the API or Console. Please remember you need to manage all the software updates manually if you plan to use this. When you are using dynamic allocation the nodegroups needs to autoscale, and if you are using cluster autoscaler then you need to maintain the vCPU to memory ratio for nodes defined in a nodegroup.

    "},{"location":"cost-optimization/docs/cost-optimization/#spot-interruption-and-spark","title":"Spot Interruption and Spark","text":"

    EC2 Spot instances are suitable for flexible and fault tolerant workloads. Spark is a semi-resilient by design because if the executor fails, new executors are spun up by the driver to continue the job. However, if the driver fails, the entire job fails. For added resiliency, EMR of EKS retries up to 5 times for driver pods so that the k8s can find suitable host and job starts successfully. If k8s fails to find a host, job is cancelled after 15 min timeout. If driver pod fails for other reasons, job is cancelled with an error message for troubleshooting. Hence, we recommend to run Spark driver on On-Demand instances and executors on Spot instances to cost optimize the workloads. You can use PodTemplates to configure this scheduling constraint. NodeSelector can be used as the node selection constraint to run executors on Spot instances as in the example below. This is simple to use and works well with Karpenter too. The pod template for this would look like

    apiVersion: v1\nkind: Pod\nspec:\n  nodeSelector:\n    eks.amazonaws.com/capacityType: SPOT\n  containers:\n  - name: spark-kubernetes-executor\n

    Node affinity can also be used here, this allows for more flexibility for the constraints defined. We recommend to use \u2018hard affinity\u2019 as highlighted in the code below for this purpose. For jobs which have strict SLA and are not suitable to run on Spot we suggest to use NoSchedule taint effect to ensure no Pods are scheduled. The key thing to note here is that the bulk of the compute required in a Spark job runs on executors and if they can be run on EC2 Spot instances you can benefit from the steep discount available with Spot instances.

    apiVersion: v1\nkind: Pod\nmetadata:\n  labels:\n    spark-role: driver\n  namespace: emr-eks-workshop-namespace\nspec:\n  affinity: \n      nodeAffinity: \n          requiredDuringSchedulingIgnoredDuringExecution: \n            nodeSelectorTerms: \n            - matchExpressions: \n              - key: 'eks.amazonaws.com/capacityType' \n                operator: In \n                values: \n                - ON_DEMAND\n
    apiVersion: v1\nkind: Pod\nmetadata:\n  labels:\n    spark-role: executor\n  namespace: emr-eks-workshop-namespace\nspec:\n  affinity: \n      nodeAffinity: \n          requiredDuringSchedulingIgnoredDuringExecution: \n            nodeSelectorTerms: \n            - matchExpressions: \n              - key: 'eks.amazonaws.com/capacityType' \n                operator: In \n                values: \n                - SPOT\n

    When Spot instances are interrupted the executors running on them may lose (if any) the shuffle and cached RDDs which would require re-computation. This requires more compute cycles to be spent which will impact the overall SLA of the EMR on EKS jobs. EMR on EKS has incorporated two new Spark features which can help to address these issues. In the following sections we will discuss them.

    Node Decommissioning:

    Node decommissioning is a Spark feature that enables the removal of an executor gracefully, by preserving its state before removing it and not scheduling any new jobs on it. This feature is particularly useful when the Spark executors are running on Spot instances, and the Spark executor node is interrupted via a \u2018rebalance recommendation\u2019 or \u2018instance termination\u2019 notice to reclaim the instance.

    Node decommission begins when a Spark executor node receives a Spot Interruption Notice or Spot Rebalance Recommendation signal. The executor node immediately starts the process of decommissioning by sending a message to the Spark driver. The driver will identify the RDD/Shuffle files that it needs to migrate off the executor node in question, and will try to identify another Executor node which can take over the execution. If an executor is identified, the RDD/Shuffle files are copied to the new executor and the job execution continues on the new executor. If all the executors are busy, the RDD/Shuffle files are copied to an external storage.

    The key advantage of this process is that it enables block and shuffle data of a Spark executor that receives EC2 Spot Interruption signal to be migrated, reducing the re-computation of the Spark tasks. The reduction in the re-computation for the interrupted Spark tasks improves the resiliency of the system and reduces overall execution time. We recommend to enable node decommissioning feature because it would help to reduce the overall compute cycles when there is a Spot interruption.

    This feature is available on Amazon EMR version 6.3 and above. To set up this feature, add this configuration to the Spark job under the executor section:

    \"spark.decommission.enabled\": \"true\"\n\"spark.storage.decommission.rddBlocks.enabled\": \"true\"\n\"spark.storage.decommission.shuffleBlocks.enabled\" : \"true\"\n\"spark.storage.decommission.enabled\": \"true\"\n\"spark.storage.decommission.fallbackStorage.path\": \"s3://<<bucket>>\"\n

    The Spark executor logs sample shown below shows the process of decommission and sending message to the driver:

    21/05/05 17:41:41 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 7 decommissioned message\n21/05/05 17:41:41 DEBUG TaskSetManager: Valid locality levels for TaskSet 2.0: NO_PREF, ANY\n21/05/05 17:41:41 INFO KubernetesClusterSchedulerBackend: Decommission executors: 7\n21/05/05 17:41:41 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_2.0, runningTasks: 10\n21/05/05 17:41:41 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(7, 192.168.82.107, 39007, None)) as being decommissioning.\n
    21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Decommission executor 1.\n21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning\n21/05/05 20:22:17 INFO BlockManager: Starting block manager decommissioning process...\n21/05/05 20:22:17 DEBUG FileSystem: Looking for FS supporting s3a\n

    The Spark driver logs sample below shows the process of looking for an executor to migrate the shuffle data:

    22/06/07 20:41:38 INFO ShuffleStatus: Updating map output for 46 to BlockManagerId(4, 192.168.13.235, 34737, None)\n22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Received shuffle data block update for 0 46, ignore.\n22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Received shuffle index block update for 0 46, updating.\n

    The Spark executor logs sample below shows the process of reusing the shuffle files:

    22/06/07 20:42:50 INFO BasicExecutorFeatureStep: Adding decommission script to lifecycle\n22/06/07 20:42:50 DEBUG ExecutorPodsAllocator: Requested executor with id 19 from Kubernetes.\n22/06/07 20:42:50 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-bfd0a5813fd1b80f-exec-19, action ADDED\n22/06/07 20:42:50 DEBUG BlockManagerMasterEndpoint: Received shuffle index block update for 0 52, updating.\n22/06/07 20:42:50 INFO ShuffleStatus: Recover 52 BlockManagerId(fallback, remote, 7337, None)\n

    More details on this can be found here

    PVC Reuse:

    A PersistentVolume is a Kubernetes feature to provide persistent storage to container Pods running stateful workloads, and PersistentVolumeClaim (PVC) is to request the above storage in the container Pod for storage by a user. Apache Spark 3.1.0 introduced the ability to dynamically generate, mount, and remove Persistent Volume Claims, SPARK-29873 for Kubernetes workloads, which are basically volumes mounted into your Spark pods. This means Apache Spark does not have to pre-create any claims/volumes for executors and delete it during the executor decommissioning.

    Since Spark3.2, PVC reuse is introduced. In case of a Spark executor is killed due to EC2 Spot interruption or any other failure, then its PVC is not deleted but persisted throughtout the entire job lifetime. It will be reattached to a new executor for a faster recovery. If there are shuffle files on that volume, then they are reused. Without enabling this feature, the owner of dynamic PVCs is the executor pods. It means if a pod or a node became unavailable, the PVC would be terminated, resulting in all the shuffle data were lost, and the recompute would be triggered.

    This feature is available started from Amazon EMR version 6.6+. To set it up, you can add these configurations to Spark jobs:

    \"spark.kubernetes.driver.ownPersistentVolumeClaim\": \"true\"\n\"spark.kubernetes.driver.reusePersistentVolumeClaim\": \"true\n

    since Spark3.4 (EMR6.12), Spark driver is able to do PVC-oriented executor allocation which means Spark counts the total number of created PVCs which the job can have, and holds on a new executor creation if the driver owns the maximum number of PVCs. This helps the transition of the existing PVC from one executor to another executor. Add this extra config to improve your PVC reuse performance:

    \"spark.kubernetes.driver.waitToReusePersistentVolumeClaim\": \"true\"\n

    One key benefit of the PVC reuse is that if any Executor running on EC2 Spot becomes unavailable, the new executor replacement can reuse the shuffle data from the existing PVC, avoiding recompute of the shuffle blocks. Dynamic PVC or persistence volume claim enables \u2018true\u2019 decoupling of storage and compute when we run Spark jobs on Kubernetes, as it can be used as a local storage to spill in-process files too. We recommend to enable PVC reuse feature because the time taken to resume the task when there is a Spot interruption is optimized as the files are used in-situ and there is no time required to move the files around.

    If one or more of the nodes which are running executors is interrupted the underlying pods gets deleted and the driver gets the update. Note the driver is the owner of those PVCs attaching to executor pods and they are not deleted throughout the job lifetime.

    22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, action DELETED\n22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, action MODIFIED\n22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, action DELETED\n22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, action MODIFIED\n

    The ExecutorPodsAllocator tries to allocate new executor pods to replace the ones killed due to interruption. During the allocation it tries to figure out how many of the existing PVC has some files and can be reused.

    "},{"location":"cost-optimization/docs/cost-optimization/#scaling-emr-on-eks-and-ec2-spot","title":"Scaling EMR on EKS and EC2 Spot","text":"

    One of the key advantages of using Spot instances is it helps to increase the throughput of Big Data workloads at a fraction of the cost of On-Demand instances. There are spark workloads where there is a need to scale the \u2018number of executors\u2019 and the infrastructure dynamically. Scaling in a Spark process is done by spawning pod replicas and when they cannot be scheduled in the existing cluster the cluster need to be scaled up by adding more nodes. When you scale up using Spot instances you get the cost benefits of using the lowest price for EC2 Compute and thus increase the throughput of the job at a lower cost, as you can provision more compute capacity (at the same cost of On-Demand instances) to reduce the time taken to process large data sets.

    Dynamic Resource Allocation (DRA) enables the Spark driver to spawn the initial number of executors (pod replicas) and then scale up the number until the specified maximum number of executors is met to process the pending tasks. When the executors have no tasks running on them, they are terminated. This enables the nodes deployed in the Amazon EKS cluster to be better utilized while running multiple Spark jobs. DRA has mechanisms to dynamically adjust the resources your application occupies based on the workload. Idle executors are terminated when there are no pending tasks. This feature is available on Amazon EMR version 6.x. More details can be found here.

    Scaling of the infrastructure by adding more nodes can be achieved by using Cluster Autoscaler or Karpenter.

    Cluster Autoscaler:

    Cluster Autoscaler (CAS) is a Kubernetes open-source tool that automatically scale-out the size of the Kubernetes cluster when there are pending pods due to insufficient capacity on existing cluster, or scale-in when there are underutilized nodes in a cluster for extended period of time. The configuration below shows multiple Nodegroups with different vCPU and RAM configurations which adheres to the Spot best practice of diversification. Note each nodegroup has the same vCPU to memory ratio as discussed above. CAS works with EKS Managed and Self-Managed Nodegroups.

    Karpenter

    Karpenter is an open-source, flexible, high-performance auto-scaler built for Kubernetes. Karpenter automatically launches just the right compute resources to handle your cluster's applications. Karpenter observes aggregate resource requests of un-schedulable pods, computes and launches best-fit new capacity.

    The Provisioner CRD\u2019s configuration flexibility is very useful in adopting Spot best practices of diversification. It can include as many Spot Instance types as possible as we do not restrict specific instance types in the configuration. This approach is also future-proof when AWS launches new instance types. It also manages Spot instance lifecycle management through Spot interruptions. We recommend to use Karpenter with Spot Instances as it has faster node scheduling with early pod binding and binpacking to optimize the resource utilization. An example of a Karpenter provisioner with Spot instances below.

    apiVersion: karpenter.sh/v1alpha5\nkind: Provisioner\nmetadata:\n  name: default\nspec:\n  labels:\n    intent: apps\n  requirements:\n    - key: karpenter.sh/capacity-type\n      operator: In\n      values: [\"spot\"]\n    - key: karpenter.k8s.aws/instance-size\n      operator: NotIn\n      values: [nano, micro, small, medium, large]\n  limits:\n    resources:\n      cpu: 1000\n      memory: 1000Gi\n  ttlSecondsAfterEmpty: 30\n  ttlSecondsUntilExpired: 2592000\n  providerRef:\n    name: default\n
    "},{"location":"cost-optimization/docs/cost-optimization/#emr-on-eks-and-ec2-spot-instances-best-practices","title":"EMR on EKS and EC2 Spot Instances: Best Practices","text":"

    In summary, our recommendations are:

    • Use EC2 Spot instances for Spark executors and On-Demand instances for drivers.
      • Diversify the instances types (Instance family and size) used in a cluster.
    • Use a single AZ to launch a cluster to save Inter-AZ data transfer cost and improve job performance.
    • Use Karpenter for capacity provisioning and scaling when running EMR on EKS jobs.
    • If use Cluster Autoscaler not Karpenter, use EKS Managed Nodegroups.
    • If using EKS self-managed nodegroups, enuse the Capacity Optimized Allocation strategy and AWS Node Termination Handler are in place.
    • Utilizing Node decommissioning and PVC Reuse techniques can help reduce the time taken to complete EMR on EKS job when EC2 Spot interruptions occur. However, they do not guarantee 100% avoidance of data loss during shuffling interruptions.
    • Implementing a Remote Shuffle Service (RSS) solution can enhance job stability and availability if Node decommissioning and PVC Reuse features do not fully meet your requirements.
    • Spark's Dynamic Resource Allocation (DRA) feature is particularly useful for reducing job costs, as it releases idle resources if not needed. The cost of EMR on EKS is determined by resource consumption at various stages of a job and is not calculated by the EMR unit price * job run time.
    • DRA implementation on EKS is different from Spark on YARN. Check out the details here.
    • Decouple Compute and Storage. For example use S3 to store Input/Output data or use RSS to store shuffle data. It allows independent scaling of processing and storage. There is low chance of losing data in case of a Spot interruption too.
    • Reduce Spark\u2019s Shuffle Size and Blast Radius. This allows to select more Spot instances for diversification and also reduces the time taken to recompute/move the shuffle files in case of an interruption.
    • Automate Spot Interruption handling via existing tools and services.
    "},{"location":"cost-optimization/docs/cost-optimization/#conclusion","title":"Conclusion","text":"

    In this document, we covered best practices to cost effectively run EMR on EKS workloads using EC2 Spot Instances. We have outlined three key areas: Provisioning, Interruption Handling, and Scaling, along with the corresponding best practices for each. We aim for this document to offer prescriptive guidance on running EMR on EKS workloads with substantial cost savings through the utilization of Spot instances.

    "},{"location":"cost-optimization/docs/node-decommission/","title":"Node Decommission","text":"

    This section shows how to use an Apache Spark feature that allows you to store the shuffle data and cached RDD blocks present on the terminating executors to peer executors before a Spot node gets decommissioned. Consequently, your job does not need to recalculate the shuffle and RDD blocks of the terminating executor that would otherwise be lost, thus allowing the job to have minimal delay in completion.

    This feature is supported for releases EMR 6.3.0+.

    "},{"location":"cost-optimization/docs/node-decommission/#how-does-it-work","title":"How does it work?","text":"

    When spark.decommission.enabled is true, Spark will try its best to shut down the executor gracefully. spark.storage.decommission.enabled will enable migrating data stored on the executor. Spark will try to migrate all the cached RDD blocks (controlled by spark.storage.decommission.rddBlocks.enabled) and shuffle blocks (controlled by spark.storage.decommission.shuffleBlocks.enabled) from the decommissioning executor to all remote executors when spark decommission is enabled. Relevant Spark configurations for using node decommissioning in the jobs are

    Configuration Description Default Value spark.decommission.enabled Whether to enable decommissioning false spark.storage.decommission.enabled Whether to decommission the block manager when decommissioning executor false spark.storage.decommission.rddBlocks.enabled Whether to transfer RDD blocks during block manager decommissioning. false spark.storage.decommission.shuffleBlocks.enabled Whether to transfer shuffle blocks during block manager decommissioning. Requires a migratable shuffle resolver (like sort based shuffle) false spark.storage.decommission.maxReplicationFailuresPerBlock Maximum number of failures which can be handled for migrating shuffle blocks when block manager is decommissioning and trying to move its existing blocks. 3 spark.storage.decommission.shuffleBlocks.maxThreads Maximum number of threads to use in migrating shuffle files. 8

    This feature can currently be enabled through a temporary workaround on EMR 6.3.0+ releases. To enable it, Spark\u2019s decom.sh file permission must be modified using a custom image. Once the code is fixed, the page will be updated.

    Dockerfile for custom image:

    FROM <release account id>.dkr.ecr.<aws region>.amazonaws.com/spark/<release>\nUSER root\nWORKDIR /home/hadoop\nRUN chown hadoop:hadoop /usr/bin/decom.sh\n

    Setting decommission timeout:

    Each executor has to be decommissioned within a certain time limit controlled by the pod\u2019s terminationGracePeriodSeconds configuration. The default value is 30 secs but can be modified using a custom pod template. The pod template for this modification would look like

    apiVersion: v1\nkind: Pod\nspec:\n  terminationGracePeriodSeconds: <seconds>\n

    Note: terminationGracePeriodSeconds timeout should be lesser than spot instance timeout with around 5 seconds buffer kept aside for triggering the node termination

    Request:

    cat >spark-python-with-node-decommissioning.json << EOF\n{\n   \"name\": \"my-job-run-with-node-decommissioning\",\n   \"virtualClusterId\": \"<virtual-cluster-id>\",\n   \"executionRoleArn\": \"<execution-role-arn>\",\n   \"releaseLabel\": \"emr-6.3.0-latest\", \n   \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n   }, \n   \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n       \"classification\": \"spark-defaults\",\n       \"properties\": {\n       \"spark.kubernetes.container.image\": \"<account_id>.dkr.ecr.<region>.amazonaws.com/<custom_image_repo>\",\n       \"spark.executor.instances\": \"5\",\n        \"spark.decommission.enabled\": \"true\",\n        \"spark.storage.decommission.rddBlocks.enabled\": \"true\",\n        \"spark.storage.decommission.shuffleBlocks.enabled\" : \"true\",\n        \"spark.storage.decommission.enabled\": \"true\"\n       }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"<log group>\", \n        \"logStreamNamePrefix\": \"<log-group-prefix>\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"<S3 URI>\"\n      }\n    }\n   } \n}\nEOF\n

    Observed Behavior:

    When executors begin decommissioning, its shuffle data gets migrated to peer executors instead of recalculating the shuffle blocks again. If sending shuffle blocks to an executor fails, spark.storage.decommission.maxReplicationFailuresPerBlock will give the number of retries for migration. The driver log\u2019s stderr will see log lines Updating map output for <shuffle_id> to BlockManagerId(<executor_id>, <ip_address>, <port>, <topology_info>) denoting details about shuffle block \u2018s migration. This feature does not emit any other metrics for validation yet."},{"location":"metastore-integrations/docs/aws-glue/","title":"EMR Containers integration with AWS Glue","text":""},{"location":"metastore-integrations/docs/aws-glue/#aws-glue-catalog-in-same-account-as-eks","title":"AWS Glue catalog in same account as EKS","text":"

    In the below example a Spark application will be configured to use AWS Glue data catalog as the hive metastore.

    gluequery.py

    cat > gluequery.py <<EOF\nfrom os.path import expanduser, join, abspath\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql import Row\n# warehouse_location points to the default location for managed databases and tables\nwarehouse_location = abspath('spark-warehouse')\nspark = SparkSession \\\n    .builder \\\n    .appName(\"Python Spark SQL Hive integration example\") \\\n    .config(\"spark.sql.warehouse.dir\", warehouse_location) \\\n    .enableHiveSupport() \\\n    .getOrCreate()\nspark.sql(\"CREATE EXTERNAL TABLE `sparkemrnyc`( `dispatching_base_num` string, `pickup_datetime` string, `dropoff_datetime` string, `pulocationid` bigint, `dolocationid` bigint, `sr_flag` bigint) STORED AS PARQUET LOCATION 's3://<s3 prefix>/trip-data.parquet/'\")\nspark.sql(\"SELECT count(*) FROM sparkemrnyc\").show()\nspark.stop()\nEOF\n
    LOCATION 's3://<s3 prefix>/trip-data.parquet/'\n

    Configure the above property to point to the S3 location containing the data.

    Request

    cat > Spark-Python-in-s3-awsglue-log.json << EOF\n{\n  \"name\": \"spark-python-in-s3-awsglue-log\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/gluequery.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=3 --conf spark.executor.memory=8G --conf spark.driver.memory=6G --conf spark.executor.cores=3\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.hadoop.hive.metastore.client.factory.class\":\"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\",\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-awsglue-log.json\n

    Output from driver logs - Displays the number of rows.

    +----------+\n|  count(1)|\n+----------+\n|2716504499|\n+----------+\n
    "},{"location":"metastore-integrations/docs/aws-glue/#aws-glue-catalog-in-different-account","title":"AWS Glue catalog in different account","text":"

    The Spark application is submitted to EMR Virtual cluster in Account A and is configured to connect to AWS Glue catalog in Account B. The IAM policy attached to the job execution role (\"executionRoleArn\": \"<execution-role-arn>\")is in Account A

    {\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"glue:*\"\n            ],\n            \"Resource\": [\n                \"arn:aws:glue:<region>:<account>:catalog\",\n                \"arn:aws:glue:<region>:<account>:database/default\",\n                \"arn:aws:glue:<region>:<account>:table/default/sparkemrnyc\"\n            ]\n        }\n    ]\n}\n

    IAM policy attached to the AWS Glue catalog in Account B

    {\n  \"Version\" : \"2012-10-17\",\n  \"Statement\" : [ {\n    \"Effect\" : \"Allow\",\n    \"Principal\" : {\n      \"AWS\" : \"<execution-role-arn>\"\n    },\n    \"Action\" : \"glue:*\",\n    \"Resource\" : [ \"arn:aws:glue:<region>:<account>:catalog\", \"arn:aws:glue:<region>:<account>:database/default\", \"arn:aws:glue:<region>:<account>:table/default/sparkemrnyc\" ]\n  } ]\n}\n

    Request

    cat > Spark-Python-in-s3-awsglue-crossaccount.json << EOF\n{\n  \"name\": \"spark-python-in-s3-awsglue-crossaccount\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/gluequery.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 \"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.hadoop.hive.metastore.client.factory.class\":\"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\",\n          \"spark.hadoop.hive.metastore.glue.catalogid\":\"<account B>\",\n          }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-awsglue-crossaccount.json\n

    Configuration of interest To specify the accountID where the AWS Glue catalog is defined reference the following:

    Spark-Glue integration

    \"spark.hadoop.hive.metastore.glue.catalogid\":\"<account B>\",\n

    Output from driver logs - displays the number of rows.

    +----------+\n|  count(1)|\n+----------+\n|2716504499|\n+----------+\n
    "},{"location":"metastore-integrations/docs/aws-glue/#sync-hudi-table-with-aws-glue-catalog","title":"Sync Hudi table with AWS Glue catalog","text":"

    In this example, a Spark application will be configured to use AWS Glue data catalog as the hive metastore.

    Starting from Hudi 0.9.0, we can synchronize Hudi table's latest schema to Glue catalog via the Hive Metastore Service (HMS) in hive sync mode. This example runs a Hudi ETL job with EMR on EKS, and interact with AWS Glue metaStore to create a Hudi table. It provides you the native and serverless capabilities to manage your technical metadata. Also you can query Hudi tables in Athena straigt away after the ETL job, which provides your end user an easy data access and shortens the time to insight.

    HudiEMRonEKS.py

    cat > HudiEMRonEKS.py <<EOF\nimport sys\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\", sys.argv[1]+\"/warehouse/\" ) \\\n    .enableHiveSupport() \\\n    .getOrCreate()\n\n# Create a DataFrame\ninputDF = spark.createDataFrame(\n    [\n        (\"100\", \"2015-01-01\", \"2015-01-01T13:51:39.340396Z\"),\n        (\"101\", \"2015-01-01\", \"2015-01-01T12:14:58.597216Z\"),\n        (\"102\", \"2015-01-01\", \"2015-01-01T13:51:40.417052Z\"),\n        (\"103\", \"2015-01-01\", \"2015-01-01T13:51:40.519832Z\"),\n        (\"104\", \"2015-01-02\", \"2015-01-01T12:15:00.512679Z\"),\n        (\"105\", \"2015-01-02\", \"2015-01-01T13:51:42.248818Z\"),\n    ],\n    [\"id\", \"creation_date\", \"last_update_time\"]\n)\n\n# Specify common DataSourceWriteOptions in the single hudiOptions variable\ntest_tableName = \"hudi_tbl\"\nhudiOptions = {\n'hoodie.table.name': test_tableName,\n'hoodie.datasource.write.recordkey.field': 'id',\n'hoodie.datasource.write.partitionpath.field': 'creation_date',\n'hoodie.datasource.write.precombine.field': 'last_update_time',\n'hoodie.datasource.hive_sync.enable': 'true',\n'hoodie.datasource.hive_sync.table': test_tableName,\n'hoodie.datasource.hive_sync.database': 'default',\n'hoodie.datasource.write.hive_style_partitioning': 'true',\n'hoodie.datasource.hive_sync.partition_fields': 'creation_date',\n'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',\n'hoodie.datasource.hive_sync.mode': 'hms'\n}\n\n\n# Write a DataFrame as a Hudi dataset\ninputDF.write \\\n.format('org.apache.hudi') \\\n.option('hoodie.datasource.write.operation', 'bulk_insert') \\\n.options(**hudiOptions) \\\n.mode('overwrite') \\\n.save(sys.argv[1]+\"/hudi_hive_insert\")\nEOF\n

    NOTE: configure the warehouse dir property to point to a S3 location as your hive warehouse storage. The s3 location can be dynamic, which is based on an argument passed in or an environament vairable.

    .config(\"spark.sql.warehouse.dir\", sys.argv[1]+\"/warehouse/\" )\n

    Request

    export S3BUCKET=YOUR_S3_BUCKET_NAME\n\naws emr-containers start-job-run \\\n--virtual-cluster-id $VIRTUAL_CLUSTER_ID \\\n--name hudi-test1 \\\n--execution-role-arn $EMR_ROLE_ARN \\\n--release-label emr-6.3.0-latest \\\n--job-driver '{\n  \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://'$S3BUCKET'/app_code/job/HudiEMRonEKS.py\",\n      \"entryPointArguments\":[\"s3://'$S3BUCKET'\"],\n      \"sparkSubmitParameters\": \"--jars https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar --conf spark.executor.cores=1 --conf spark.executor.instances=2\"}}' \\\n--configuration-overrides '{\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.serializer\": \"org.apache.spark.serializer.KryoSerializer\",\n          \"spark.sql.hive.convertMetastoreParquet\": \"false\",\n          \"spark.hadoop.hive.metastore.client.factory.class\": \"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\"\n        }}\n    ], \n    \"monitoringConfiguration\": {\n      \"s3MonitoringConfiguration\": {\"logUri\": \"s3://'$S3BUCKET'/elasticmapreduce/emr-containers\"}}}'\n

    NOTE: To get a correct verison of hudi library, we directly download the jar from the maven repository with the synctax of \"sparkSubmitParameters\": \"--jars https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar. Starting from EMR 6.5, the Hudi-spark3-bundle library will be included in EMR docker images.

    "},{"location":"metastore-integrations/docs/hive-metastore/","title":"EMR Containers integration with Hive Metastore","text":"

    For more details, check out the github repository, which includes CDK/CFN templates that help you to get started quickly.

    "},{"location":"metastore-integrations/docs/hive-metastore/#1-hive-metastore-database-through-jdbc","title":"1-Hive metastore Database through JDBC","text":"

    In this example, a Spark application is configured to connect to a Hive Metastore database provisioned with Amazon RDS Aurora MySql via a JDBC connection. The Amazon RDS and EKS cluster should be in same VPC or else the Spark job will not be able to connect to RDS.

    You directly pass in the JDBC credentials at the job/application level, which is a simple and quick solution to make a connection to the HMS. However, it is not recommended in a production environment. From the security perspective, the password management could be a risk since the JDBC credentials will appear in all of your job logs. Also engineers may be required to hold the password when it is not necessary.

    Request:

    cat > Spark-Python-in-s3-hms-jdbc.json << EOF\n{\n  \"name\": \"spark-python-in-s3-hms-jdbc\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/hivejdbc.py\", \n       \"sparkSubmitParameters\": \"--jars s3://<s3 prefix>/mariadb-connector-java.jar --conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver --conf spark.hadoop.javax.jdo.option.ConnectionUserName=<connection-user-name> --conf spark.hadoop.javax.jdo.option.ConnectionPassword=<connection-password> --conf spark.hadoop.javax.jdo.option.ConnectionURL=<JDBC-Connection-string> --conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n          }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-hms-jdbc.json\n

    In this example we are connecting to mysql db, so mariadb-connector-java.jar needs to be passed with --jars option. If you are using postgres, Oracle or any other database, the appropriate connector jar needs to be included.

    Configuration of interest:

    --jars s3://<s3 prefix>/mariadb-connector-java.jar\n--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver \n--conf spark.hadoop.javax.jdo.option.ConnectionUserName=<connection-user-name>  \n--conf spark.hadoop.javax.jdo.option.ConnectionPassword=<connection-password>\n--conf spark.hadoop.javax.jdo.option.ConnectionURL**=<JDBC-Connection-string>\n

    hivejdbc.py

    from os.path import expanduser, join, abspath\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql import Row\n# warehouse_location points to the default location for managed databases and tables\nwarehouse_location = abspath('spark-warehouse')\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\", warehouse_location) \\\n    .enableHiveSupport() \\\n    .getOrCreate()\nspark.sql(\"SHOW DATABASES\").show()\nspark.sql(\"CREATE EXTERNAL TABLE `ehmsdb`.`sparkemrnyc5`( `dispatching_base_num` string, `pickup_datetime` string, `dropoff_datetime` string, `pulocationid` bigint, `dolocationid` bigint, `sr_flag` bigint) STORED AS PARQUET LOCATION 's3://<s3 prefix>/nyctaxi_parquet/'\")\nspark.sql(\"SELECT count(*) FROM ehmsdb.sparkemrnyc5 \").show()\nspark.stop()\n

    The above job lists databases from a remote RDS Hive Metastore, creates a new table and then queries it.

    "},{"location":"metastore-integrations/docs/hive-metastore/#2-hive-metastore-thrift-service-through-thrift-protocol","title":"2-Hive metastore thrift service through thrift:// protocol","text":"

    In this example, the spark application is configured to connect to an external Hive metastore thrift server. The thrift server is running on EMR on EC2's master node and AWS RDS Aurora is used as database for the Hive metastore.

    Running an EMR on EC2 cluster as a thrift server, simplify the application configuration and setup. You can start quickly with reduced engineering effort. However, your maintenance overhead may increase, since you will be monitoring two types of clusters, i.e. EMR on EC2 and EMR on EKS.

    thriftscript.py: hive.metastore.uris config needs to be set to read from external Hive metastore. The URI format looks like this: thrift://EMR_ON_EC2_MASTER_NODE_DNS_NAME:9083

    from os.path import expanduser, join, abspath\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql import Row\n# warehouse_location points to the default location for managed databases and tables\nwarehouse_location = abspath('spark-warehouse')\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\", warehouse_location) \\\n    .config(\"hive.metastore.uris\",\"<hive metastore thrift uri>\") \\\n    .enableHiveSupport() \\\n    .getOrCreate()\nspark.sql(\"SHOW DATABASES\").show()\nspark.sql(\"CREATE EXTERNAL TABLE ehmsdb.`sparkemrnyc2`( `dispatching_base_num` string, `pickup_datetime` string, `dropoff_datetime` string, `pulocationid` bigint, `dolocationid` bigint, `sr_flag` bigint) STORED AS PARQUET LOCATION 's3://<s3 prefix>/nyctaxi_parquet/'\")\nspark.sql(\"SELECT * FROM ehmsdb.sparkemrnyc2\").show()\nspark.stop()\n

    Request:

    The below job lists databases from remote Hive Metastore, creates a new table and then queries it.

    cat > Spark-Python-in-s3-hms-thrift.json << EOF\n{\n  \"name\": \"spark-python-in-s3-hms-thrift\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/thriftscript.py\", \n       \"sparkSubmitParameters\": \"--jars s3://<s3 prefix>/mariadb-connector-java.jar --conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n          }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-hms-thrift.json\n
    "},{"location":"metastore-integrations/docs/hive-metastore/#3-connect-hive-metastore-via-thrift-service-hosted-on-eks","title":"3-Connect Hive metastore via thrift service hosted on EKS","text":"

    In this example, our Spark application connects to a standalone Hive metastore service (HMS) running in EKS.

    Running the standalone HMS in EKS unifies your analytics applications with other business critical apps in a single platform. It simplifies your solution architecture and infrastructure design. The helm chart solution includes autoscaling feature, so your EKS cluster can automatically expand or shrink when the HMS request volume changes. Also it follows the security best practice to manage JDBC credentials via AWS Secrets Manager. However, you will need a combination of analytics and k8s skills to maintain this solution.

    To install the HMS helm chart, simply replace the environment variables in values.yaml, then manually helm install via the command below. Otherwise, deploy the HMS via a CDK/CFN template with a security best practice. Check out the CDK project for more details.

    cd hive-emr-on-eks/hive-metastore-chart\n\nsed -i '' -e 's/{RDS_JDBC_URL}/\"jdbc:mysql:\\/\\/'$YOUR_HOST_NAME':3306\\/'$YOUR_DB_NAME'?createDatabaseIfNotExist=true\"/g' values.yaml \nsed -i '' -e 's/{RDS_USERNAME}/'$YOUR_USER_NAME'/g' values.yaml \nsed -i '' -e 's/{RDS_PASSWORD}/'$YOUR_PASSWORD'/g' values.yaml\nsed -i '' -e 's/{S3BUCKET}/s3:\\/\\/'$YOUR_S3BUCKET'/g' values.yaml\n\nhelm repo add hive-metastore https://aws-samples.github.io/hive-metastore-chart \nhelm install hive hive-metastore/hive-metastore -f values.yaml --namespace=emr --debug\n

    hivethrift_eks.py

    from os import environ\nimport sys\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\",environ['warehouse_location']) \\\n    .config(\"hive.metastore.uris\",\"thrift://\"+environ['HIVE_METASTORE_SERVICE_HOST']+\":9083\") \\\n    .enableHiveSupport() \\\n    .getOrCreate()\n\nspark.sql(\"SHOW DATABASES\").show()\nspark.sql(\"CREATE DATABASE IF NOT EXISTS `demo`\")\nspark.sql(\"DROP TABLE IF EXISTS demo.amazonreview3\")\nspark.sql(\"CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview3`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_parent` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`vine` string,`verified_purchase` string,`review_headline` string,`review_body` string,`review_date` date,`year` integer) STORED AS PARQUET LOCATION '\"+sys.argv[1]+\"/app_code/data/toy/'\")\nspark.sql(\"SELECT coount(*) FROM demo.amazonreview3\").show()\nspark.stop()\n

    An environment variable HIVE_METASTORE_SERVICE_HOST appears in your Spark application pods automatically, once the standalone HMS is up running in EKS. You can directly set the hive.metastore.uris to thrift://\"+environ['HIVE_METASTORE_SERVICE_HOST']+\":9083\".

    Can set the spark.sql.warehouse.dir property to a S3 location as your hive warehouse storage. The s3 location can be dynamic, which is based on an argument passed in or an environment variable.

    Request:

    #!/bin/bash\naws emr-containers start-job-run \\\n--virtual-cluster-id $VIRTUAL_CLUSTER_ID \\\n--name spark-hive-via-thrift \\\n--execution-role-arn $EMR_ROLE_ARN \\\n--release-label emr-6.2.0-latest \\\n--job-driver '{\n  \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://'$S3BUCKET'/app_code/job/hivethrift_eks.py\",\n      \"entryPointArguments\":[\"s3://'$S3BUCKET'\"],\n      \"sparkSubmitParameters\": \"--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2\"}}' \\\n--configuration-overrides '{\n    \"monitoringConfiguration\": {\n      \"s3MonitoringConfiguration\": {\"logUri\": \"s3://'$S3BUCKET'/elasticmapreduce/emr-containers\"}}}'\n
    "},{"location":"metastore-integrations/docs/hive-metastore/#4-run-thrift-service-as-a-sidecar-in-spark-drivers-pod","title":"4-Run thrift service as a sidecar in Spark Driver's pod","text":"

    This advanced solution runs the standalone HMS thrift service inside a Spark driver as a sidecar. It means each Spark job will have its dedicated thrift server. The benefit of the design is HMS is no long a single point of failure, since each Spark application has its own HMS. Also it is no long a long running service, i.e. it spins up when your Spark job starts, then terminates when your job is done. The sidecar follows the security best practice via leveraging Secrets Manager to extract JDBC crednetials. However, the maintenance of the sidecar increases because you now need to manage the hms sidecar, custom configmaps and sidecar pod templates. Also this solution requires combination skills of analytics and k8s.

    The CDK/CFN template is available to simplify the installation against a new EKS cluster. If you have an existing EKS cluster, the prerequisite details can be found in the github repository

    sidecar_hivethrift_eks.py:

    import sys\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\",environ['warehouse_location']) \\\n    .enableHiveSupport() \\\n    .getOrCreate()\n\nspark.sql(\"SHOW DATABASES\").show()\nspark.sql(\"CREATE DATABASE IF NOT EXISTS `demo`\")\nspark.sql(\"DROP TABLE IF EXISTS demo.amazonreview4\")\nspark.sql(\"CREATE EXTERNAL TABLE `demo`.`amazonreview4`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_parent` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`vine` string,`verified_purchase` string,`review_headline` string,`review_body` string,`review_date` date,`year` integer) STORED AS PARQUET LOCATION '\"+sys.argv[1]+\"/app_code/data/toy/'\")\nspark.sql(\"SELECT coount(*) FROM demo.amazonreview4\").show()\nspark.stop()\n

    Request:

    Now that the HMS is running inside your Spark driver, it shares common attributes such as the network config, the spark.hive.metastore.uris can set to \"thrift://localhost:9083\". Don't forget to assign the sidecar pod template to the Spark Driver like this \"spark.kubernetes.driver.podTemplateFile\": \"s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml\"

    For more details, check out the github repo

    #!/bin/bash\n# test HMS sidecar on EKS\naws emr-containers start-job-run \\\n--virtual-cluster-id $VIRTUAL_CLUSTER_ID \\\n--name sidecar-hms \\\n--execution-role-arn $EMR_ROLE_ARN \\\n--release-label emr-6.3.0-latest \\\n--job-driver '{\n  \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://'$S3BUCKET'/app_code/job/sidecar_hivethrift_eks.py\",\n      \"entryPointArguments\":[\"s3://'$S3BUCKET'\"],\n      \"sparkSubmitParameters\": \"--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2\"}}' \\\n--configuration-overrides '{\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.driver.podTemplateFile\": \"s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml\",\n          \"spark.hive.metastore.uris\": \"thrift://localhost:9083\"\n        }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"s3MonitoringConfiguration\": {\"logUri\": \"s3://'$S3BUCKET'/elasticmapreduce/emr-containers\"}}}'\n
    "},{"location":"metastore-integrations/docs/hive-metastore/#5-hudi-remote-hive-metastore-integration","title":"5-Hudi + Remote Hive metastore integration","text":"

    Starting from Hudi 0.9.0, we can synchronize Hudi table's latest schema to Hive metastore in HMS sync mode, with this setting 'hoodie.datasource.hive_sync.mode': 'hms'.

    This example runs a Hudi job with EMR on EKS, and interact with a remote RDS hive metastore to create a Hudi table. As a serverless option, it can interact with AWS Glue catalog. check out the AWS Glue section for more details.

    HudiEMRonEKS.py

    from os import environ\nimport sys\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.sql.warehouse.dir\", sys.argv[1]+\"/warehouse/\" ) \\\n    .enableHiveSupport() \\\n    .getOrCreate()\n\n# Create a DataFrame\ninputDF = spark.createDataFrame(\n    [\n        (\"100\", \"2015-01-01\", \"2015-01-01T13:51:39.340396Z\"),\n        (\"101\", \"2015-01-01\", \"2015-01-01T12:14:58.597216Z\"),\n        (\"102\", \"2015-01-01\", \"2015-01-01T13:51:40.417052Z\"),\n        (\"103\", \"2015-01-01\", \"2015-01-01T13:51:40.519832Z\"),\n        (\"104\", \"2015-01-02\", \"2015-01-01T12:15:00.512679Z\"),\n        (\"105\", \"2015-01-02\", \"2015-01-01T13:51:42.248818Z\"),\n    ],\n    [\"id\", \"creation_date\", \"last_update_time\"]\n)\n\n# Specify common DataSourceWriteOptions in the single hudiOptions variable\ntest_tableName = \"hudi_tbl\"\nhudiOptions = {\n'hoodie.table.name': test_tableName,\n'hoodie.datasource.write.recordkey.field': 'id',\n'hoodie.datasource.write.partitionpath.field': 'creation_date',\n'hoodie.datasource.write.precombine.field': 'last_update_time',\n'hoodie.datasource.hive_sync.enable': 'true',\n'hoodie.datasource.hive_sync.table': test_tableName,\n'hoodie.datasource.hive_sync.database': 'default',\n'hoodie.datasource.write.hive_style_partitioning': 'true',\n'hoodie.datasource.hive_sync.partition_fields': 'creation_date',\n'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',\n'hoodie.datasource.hive_sync.mode': 'hms'\n}\n\n\n# Write a DataFrame as a Hudi dataset\ninputDF.write \\\n.format('org.apache.hudi') \\\n.option('hoodie.datasource.write.operation', 'bulk_insert') \\\n.options(**hudiOptions) \\\n.mode('overwrite') \\\n.save(sys.argv[1]+\"/hudi_hive_insert\")\n\nprint(\"After {}\".format(spark.catalog.listTables()))\n

    Request:

    The latest Hudi-spark3-bundle library is needed to support the new HMS hive sync functionality. In the following sample script, it is downloaded from maven repository when submitting a job with EMR 6.3. Starting from EMR 6.5, you don't need the --jars setting anymore, because EMR 6.5+ includes the Hudi-spark3-bundle library.

    aws emr-containers start-job-run \\\n--virtual-cluster-id $VIRTUAL_CLUSTER_ID \\\n--name hudi-test1 \\\n--execution-role-arn $EMR_ROLE_ARN \\\n--release-label emr-6.3.0-latest \\\n--job-driver '{\n  \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://'$S3BUCKET'/app_code/job/HudiEMRonEKS.py\",\n      \"entryPointArguments\":[\"s3://'$S3BUCKET'\"],\n      \"sparkSubmitParameters\": \"--jars https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar --conf spark.executor.cores=1 --conf spark.executor.instances=2\"}}' \\\n--configuration-overrides '{\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.serializer\": \"org.apache.spark.serializer.KryoSerializer\",\n          \"spark.sql.hive.convertMetastoreParquet\": \"false\",\n          \"spark.hive.metastore.uris\": \"thrift://localhost:9083\",\n          \"spark.kubernetes.driver.podTemplateFile\": \"s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml\"\n        }}\n    ], \n    \"monitoringConfiguration\": {\n      \"s3MonitoringConfiguration\": {\"logUri\": \"s3://'$S3BUCKET'/elasticmapreduce/emr-containers\"}}}'\n
    "},{"location":"node-placement/docs/eks-node-placement/","title":"EKS Node Placement","text":""},{"location":"node-placement/docs/eks-node-placement/#single-az-placement","title":"Single AZ placement","text":"

    AWS EKS clusters can span multiple AZs in a VPC. A Spark application whose driver and executor pods are distributed across multiple AZs can incur inter-AZ data transfer costs. To minimize or eliminate inter-AZ data transfer costs, you can configure the application to only run on the nodes within a single AZ. In this example, we use the kubernetes node selector to specify which AZ should the job run on.

    Request:

    cat >spark-python-in-s3-nodeselector.json << EOF\n{\n  \"name\": \"spark-python-in-s3-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.kubernetes.node.selector.topology.kubernetes.io/zone='<availability zone>' --conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector.json\n

    Observed Behavior: When the job starts the driver pod and executor pods are scheduled only on those EKS worker nodes with the label topology.kubernetes.io/zone: <availability zone>. This ensures the spark job is run within a single AZ. If there are not enough resources within the specified AZ, the pods will be in the pending state until the Autoscaler(if configured) kicks in or more resources become available.

    Spark on kubernetes Node selector configuration Kubernetes Node selector reference

    Configuration of interest -

    --conf spark.kubernetes.node.selector.zone='<availability zone>'\n

    zone is a built-in label that EKS assigns to every EKS worker Node. The above config will ensure to schedule the driver and executor pod on those EKS worker nodes labeled - topology.kubernetes.io/zone: <availability zone>. However, user defined labels can also be assigned to EKS worker nodes and used as node selector.

    Other common use cases are using node labels to force the job to run on on demand/spot, machine type, etc.

    "},{"location":"node-placement/docs/eks-node-placement/#single-az-and-ec2-instance-type-placement","title":"Single AZ and ec2 instance type placement","text":"

    Multiple key value pairs for spark.kubernetes.node.selector.[labelKey] can be passed to add filter conditions for selecting the EKS worker node. If you want to schedule on EKS worker nodes in <availability zone> and instance-type as m5.4xlarge - it is done as below

    Request:

    {\n  \"name\": \"spark-python-in-s3-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.kubernetes.pyspark.pythonVersion=3 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 --conf spark.sql.shuffle.partitions=1000\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\",\n          \"spark.kubernetes.node.selector.topology.kubernetes.io/zone\":\"<availability zone>\",\n          \"spark.kubernetes.node.selector.node.kubernetes.io/instance-type\":\"m5.4xlarge\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n      }\n      }\n    }\n  }\n}\n

    Configuration of interest

    spark.kubernetes.node.selector.[labelKey] - Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier will result in the driver pod and executors having a node selector with key identifier and value myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix.

    "},{"location":"node-placement/docs/eks-node-placement/#job-submitter-pod-placement","title":"Job submitter pod placement","text":"

    Similar to driver and executor pods, you can configure the job submitter pod's node selectors as well using the emr-job-submitter classification. It is recommended for job submitter pods to have node placement on ON_DEMAND nodes and not SPOT nodes as the job will fail if the job submitter pod gets Spot instance interruptions. You can also place the job submitter pod in a single AZ or use any Kubernetes labels that are applied to the nodes.

    Note: The job submitter pod is also referred as the job-runner pod

    StartJobRun request with ON_DEMAND node placement for job submitter pod

    cat >spark-python-in-s3-nodeselector-job-submitter.json << EOF\n{\n  \"name\": \"spark-python-in-s3-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      },\n      {\n        \"classification\": \"emr-job-submitter\",\n        \"properties\": {\n            \"jobsubmitter.node.selector.eks.amazonaws.com/capacityType\": \"ON_DEMAND\"\n        }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector-job-submitter.json\n

    StartJobRun request with Single AZ node placement for job submitter pod:

    cat >spark-python-in-s3-nodeselector-job-submitter-az.json << EOF\n{\n  \"name\": \"spark-python-in-s3-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      },\n      {\n        \"classification\": \"emr-job-submitter\",\n        \"properties\": {\n            \"jobsubmitter.node.selector.topology.kubernetes.io/zone\": \"<availability zone>\"\n        }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector-job-submitter-az.json\n

    StartJobRun request with single AZ and ec2 instance type placement for job submitter pod:

    {\n  \"name\": \"spark-python-in-s3-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.kubernetes.pyspark.pythonVersion=3 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 --conf spark.sql.shuffle.partitions=1000\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\",\n         }\n      },\n      {\n        \"classification\": \"emr-job-submitter\",\n        \"properties\": {\n            \"jobsubmitter.node.selector.topology.kubernetes.io/zone\": \"<availability zone>\",\n            \"jobsubmitter.node.selector.node.kubernetes.io/instance-type\":\"m5.4xlarge\"\n        }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n      }\n      }\n    }\n  }\n}\n

    Configurations of interest:

    jobsubmitter.node.selector.[labelKey]: Adds to the node selector of the job submitter pod, with key labelKey and the value as the configuration's value. For example, setting jobsubmitter.node.selector.identifier to myIdentifier will result in the job-runner pod having a node selector with key identifier and value myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix.

    "},{"location":"node-placement/docs/fargate-node-placement/","title":"EKS Fargate Node Placement","text":""},{"location":"node-placement/docs/fargate-node-placement/#fargate-node-placement","title":"Fargate Node Placement","text":"

    AWS Fargate is a technology that provides on-demand, right-sized compute capacity for containers. With AWS Fargate, you don't have to provision, configure, or scale groups of EC2 instances on your own to run containers. You also don't need to choose server types, decide when to scale your node groups, or optimize cluster packing. Instead you can control which pods start on Fargate and how they run with Fargate profiles.

    "},{"location":"node-placement/docs/fargate-node-placement/#aws-fargate-profile","title":"AWS Fargate profile","text":"

    Before you can schedule pods on Fargate in your cluster, you must define at least one Fargate profile that specifies which pods use Fargate when launched. You must define a namespace for every selector. The Fargate profile allows an administrator to declare which pods run on Fargate. This declaration is done through the profile\u2019s selectors. If a namespace selector is defined without any labels, Amazon EKS attempts to schedule all pods that run in that namespace onto Fargate using the profile.

    Create Fargate Profile Create your Fargate profile with the following eksctl command, replacing the <variable text> (including <>) with your own values. You're required to specify a namespace. The --labels option is not required to create your Fargate profile, but will be required if you want to only run Spark executors on Fargate.

    eksctl create fargateprofile \\\n    --cluster <cluster_name> \\\n    --name <fargate_profile_name> \\\n    --namespace <virtual_cluster_mapped_namespace> \\\n    --labels spark-node-placement=fargate\n
    "},{"location":"node-placement/docs/fargate-node-placement/#1-place-entire-job-including-driver-pod-on-fargate","title":"1- Place entire job including driver pod on Fargate","text":"

    When both Driver and Executors use the same labels as the Fargate Selector, the entire job including the driver pod will run on Fargate.

    Request:

    cat >spark-python-in-s3-nodeselector.json << EOF\n{\n  \"name\": \"spark-python-in-s3-fargate-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.3.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=4  --conf spark.executor.memory=20G --conf spark.driver.memory=20G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n            \"spark.kubernetes.driver.label.spark-node-placement\": \"fargate\",\n            \"spark.kubernetes.executor.label.spark-node-placement\": \"fargate\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector.json\n

    Observed Behavior: When the job starts, the driver pod and executor pods are scheduled only on Fargate since both are labeled with the spark-node-placement: fargate. This is useful when we want to run the entire job on Fargate nodes. The maximum vCPU available for the driver pod is 4vCPU.

    "},{"location":"node-placement/docs/fargate-node-placement/#2-place-driver-pod-on-ec2-and-executor-pod-on-fargate","title":"2- Place driver pod on EC2 and executor pod on Fargate","text":"

    Remove the label from the driver pod to schedule the driver pod on EC2 instances. This is especially helpful when driver pod needs more resources (i.e. > 4 vCPU).

    Request:

    cat >spark-python-in-s3-nodeselector.json << EOF\n{\n  \"name\": \"spark-python-in-s3-fargate-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.3.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=6 --conf spark.executor.memory=20G --conf spark.driver.memory=30G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n            \"spark.kubernetes.executor.label.spark-node-placement\": \"fargate\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector.json\n

    Observed Behavior: When the job starts, the driver pod schedules on an EC2 instance. EKS picks an instance from the first Node Group that has the matching resources available to the driver pod.

    "},{"location":"node-placement/docs/fargate-node-placement/#3-define-a-nodeselector-in-pod-templates","title":"3- Define a NodeSelector in Pod Templates","text":"

    Beginning with Amazon EMR versions 5.33.0 or 6.3.0, Amazon EMR on EKS supports Spark\u2019s pod template feature. Pod templates are specifications that determine how to run each pod. You can use pod template files to define the driver or executor pod\u2019s configurations that Spark configurations do not support. For example Spark configurations do not support defining individual node selectors for the driver pod and the executor pods. Define a node selector only for the driver pod when you want to choose on which pool of EC2 instance it should schedule. Let the Fargate Profile schedule the executor pods.

    Driver Pod Template

    apiVersion: v1\nkind: Pod\nspec:\n  volumes:\n    - name: source-data-volume\n      emptyDir: {}\n    - name: metrics-files-volume\n      emptyDir: {}\n  nodeSelector:\n    <ec2-instance-node-label-key>: <ec2-instance-node-label-value>\n  containers:\n  - name: spark-kubernetes-driver # This will be interpreted as Spark driver container\n

    Store the pod template file onto a S3 location:

    aws s3 cp /driver-pod-template.yaml s3://<your-bucket-name>/driver-pod-template.yaml

    Request

    cat >spark-python-in-s3-nodeselector.json << EOF\n{\n  \"name\": \"spark-python-in-s3-fargate-nodeselector\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.3.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=30G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n            \"spark.kubernetes.executor.label.spark-node-placement\": \"fargate\",\n            \"spark.kubernetes.driver.podTemplateFile\": \"s3://<your-bucket-name>/driver-pod-template.yaml\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-nodeselector.json\n

    Observed Behavior: The driver pod schedules on an EC2 instance with enough capacity and matching label key / value with the node selector.

    "},{"location":"outposts/emr-containers-on-outposts/","title":"Running EMR Containers on AWS Outposts","text":""},{"location":"outposts/emr-containers-on-outposts/#background","title":"Background","text":"

    You can now run Amazon EMR container jobs on EKS clusters that are running on AWS Outposts. AWS Outposts enables native AWS services, infrastructure, and operating models in on-premises facilities. In AWS Outposts environments, you can use the same AWS APIs, tools, and infrastructure that you use in the AWS Cloud. Amazon EKS nodes on AWS Outposts is ideal for low-latency workloads that need to be run in close proximity to on-premises data and applications. For more information, see the Amazon EKS on Outposts documentation page.

    This document provides the steps to set up EMR containers on AWS Outposts.

    "},{"location":"outposts/emr-containers-on-outposts/#key-considerations-and-recommendations","title":"Key Considerations and Recommendations","text":"
    • The EKS cluster on an Outpost must be created with self-managed node groups.
    • Use the AWS Management Console and AWS CloudFormation to create a self-managed node group in Outposts.
    • For EMR workloads, we recommend creating EKS clusters where all the worker nodes reside in the self-managed node group of Outposts.
    • The Kubernetes client in the Spark driver pod creates and monitor executor pods by communicating with the EKS managed Kubernetes API server residing in the parent AWS Region. For reliable monitoring of executor pods during a job run, we also recommend having a reliable low latency link between the Outpost and the parent Region.
    • AWS Fargate is not available on Outposts.
    • For more information about the supported Regions, prerequisites and considerations for Amazon EKS on AWS Outposts, see the EKS on Outposts documentation page.
    "},{"location":"outposts/emr-containers-on-outposts/#infrastructure-setup","title":"Infrastructure Setup","text":""},{"location":"outposts/emr-containers-on-outposts/#setup-eks-on-outposts","title":"Setup EKS on Outposts","text":"

    Network Setup

    • Setup a VPC
    aws ec2 create-vpc \\\n--region <us-west-2> \\\n--cidr-block '<10.0.0.0/16>'\n

    In the output, take note of the VPC ID.

    {\n    \"Vpc\": {\n        \"VpcId\": \"vpc-123vpc\", \n        ...\n    }\n}\n
    • Create two subnets in the parent Region.
    aws ec2 create-subnet \\\n    --region '<us-west-2>' \\\n    --availability-zone-id '<usw2-az1>' \\\n    --vpc-id '<vpc-123vpc>' \\\n    --cidr-block '<10.0.1.0/24>'\n\naws ec2 create-subnet \\\n    --region '<us-west-2>' \\\n    --availability-zone-id '<usw2-az2>' \\\n    --vpc-id '<vpc-123vpc>' \\\n    --cidr-block '<10.0.2.0/24>'\n

    In the output, take note of the Subnet ID.

    {\n    \"Subnet\": {\n        \"SubnetId\": \"subnet-111\",\n        ...\n    }\n}\n{\n    \"Subnet\": {\n        \"SubnetId\": \"subnet-222\",\n        ...\n    }\n}\n
    • Create a subnet in the Outpost Availability Zone. (This step is different for Outposts)
    aws ec2 create-subnet \\\n    --region '<us-west-2>' \\\n    --availability-zone-id '<usw2-az1>' \\\n    --outpost-arn 'arn:aws:outposts:<us-west-2>:<123456789>:outpost/<op-123op>' \\\n    --vpc-id '<vpc-123vpc>' \\\n    --cidr-block '<10.0.3.0/24>'\n

    In the output, take note of the Subnet ID.

    {\n    \"Subnet\": {\n        \"SubnetId\": \"subnet-333outpost\",\n        \"OutpostArn\": \"...\"\n        ...\n    }\n}\n

    EKS Cluster Creation

    • Create an EKS cluster using the three subnet Ids created earlier.
    aws eks create-cluster \\\n    --region '<us-west-2>' \\\n    --name '<outposts-eks-cluster>' \\\n    --role-arn 'arn:aws:iam::<123456789>:role/<cluster-service-role>' \\\n    --resources-vpc-config  subnetIds='<subnet-111>,<subnet-222>,<subnet-333outpost>'\n
    • Check until the cluster status becomes active.
    aws eks describe-cluster \\\n    --region '<us-west-2>' \\\n    --name '<outposts-eks-cluster>'\n

    Note the values of resourcesVpcConfig.clusterSecurityGroupId and identity.oidc.issuer.

    {\n    \"cluster\": {\n        \"name\": \"outposts-eks-cluster\",\n        ...\n        \"resourcesVpcConfig\": {\n            \"clusterSecurityGroupId\": \"sg-123clustersg\",\n        },\n        \"identity\": {\n            \"oidc\": {\n                \"issuer\": \"https://oidc.eks.us-west-2.amazonaws.com/id/oidcid\"\n            }\n        },\n        \"status\": \"ACTIVE\",\n    }\n}\n
    • Add the Outposts nodes to the EKS Cluster.

    At this point, eksctl cannot be used to launch self-managed node groups in Outposts. Please follow the steps listed in the self-managed nodes documentation page. In order to use the cloudformation script lised in the AWS Management Console tab, make note of the following values created in the earlier steps: * ClusterName: <outposts-eks-cluster> * ClusterControlPlaneSecurityGroup: <sg-123clustersg> * Subnets: <subnet-333outpost>

    Apply the aws-auth-cm config map listed on the documentation page to allow the nodes to join the cluster.

    "},{"location":"outposts/emr-containers-on-outposts/#register-cluster-with-emr-containers","title":"Register cluster with EMR Containers","text":"

    Once the EKS cluster has been created and the nodes have been registered with the EKS control plane, take the following steps:

    • Enable cluster access for Amazon EMR on EKS.
    • Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster.
    • Create a job execution role.
    • Update the trust policy of the job execution role.
    • Grant users access to Amazon EMR on EKS.
    • Register the Amazon EKS cluster with Amazon EMR.
    "},{"location":"outposts/emr-containers-on-outposts/#conclusion","title":"Conclusion","text":"

    EMR-EKS on Outposts allows users to run their big data jobs in close proximity to on-premises data and applications.

    "},{"location":"performance/docs/dra/","title":"Dynamic Resource Allocation","text":"

    DRA is available in Spark 3 (EMR 6.x) without the need for an external shuffle service. Spark on Kubernetes doesn't support external shuffle service as of spark 3.1, but DRA can be achieved by enabling shuffle tracking.

    Spark DRA without external shuffle service: With DRA, the spark driver spawns the initial number of executors and then scales up the number until the specified maximum number of executors is met to process the pending tasks. Idle executors are terminated when there are no pending tasks, the executor idle time exceeds the idle timeout(spark.dynamicAllocation.executorIdleTimeout)and it doesn't have any cached or shuffle data.

    If the executor idle threshold is reached and it has cached data, then it has to exceed the cache data idle timeout(spark.dynamicAllocation.cachedExecutorIdleTimeout) and if the executor doesn't have shuffle data, then the idle executor is terminated.

    If the executor idle threshold is reached and it has shuffle data, then without external shuffle service the executor will never be terminated. These executors will be terminated when the job is completed. This behavior is enforced by \"spark.dynamicAllocation.shuffleTracking.enabled\":\"true\" and \"spark.dynamicAllocation.enabled\":\"true\"

    If \"spark.dynamicAllocation.shuffleTracking.enabled\":\"false\"and \"spark.dynamicAllocation.enabled\":\"true\" then the spark application will error out since external shuffle service is not available.

    Request:

    cat >spark-python-in-s3-dra.json << EOF\n{\n  \"name\": \"spark-python-in-s3-dra\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"true\",\n          \"spark.dynamicAllocation.shuffleTracking.enabled\":\"true\",\n          \"spark.dynamicAllocation.minExecutors\":\"5\",\n          \"spark.dynamicAllocation.maxExecutors\":\"100\",\n          \"spark.dynamicAllocation.initialExecutors\":\"10\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n
    aws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-dra.json\n

    Observed Behavior: When the job gets started, the driver pod gets created and 10 executors are initially created. (\"spark.dynamicAllocation.initialExecutors\":\"10\") Then the number of executors can scale up to a maximum of 100 (\"spark.dynamicAllocation.maxExecutors\":\"100\"). Configurations to note:

    spark.dynamicAllocation.shuffleTracking.enabled - **Experimental**. Enables shuffle file tracking for executors, which allows dynamic allocation without the need for an external shuffle service. This option will try to keep alive executors that are storing shuffle data for active jobs.

    spark.dynamicAllocation.shuffleTracking.timeout - When shuffle tracking is enabled, controls the timeout for executors that are holding shuffle data. The default value means that Spark will rely on the shuffles being garbage collected to be able to release executors. If for some reason garbage collection is not cleaning up shuffles quickly enough, this option can be used to control when to time out executors even when they are storing shuffle data.

    "},{"location":"security/docs/spark/data-encryption/","title":"EMR Containers Spark - In transit and At Rest data encryption","text":""},{"location":"security/docs/spark/data-encryption/#encryption-at-rest","title":"Encryption at Rest","text":""},{"location":"security/docs/spark/data-encryption/#amazon-s3-client-side-encryption","title":"Amazon S3 Client-Side Encryption","text":"

    To utilize S3 Client side encryption, you will need to create a KMS Key to be used to encrypt and decrypt data. If you do not have an KMS key, please follow this guide - AWS KMS create keys. Also please note the job execution role needs access to this key, please see Add to Key policy for instructions on how to add these permissions.

    trip-count-encrypt-write.py:

    cat> trip-count-encrypt-write.py<<EOF\nimport sys\n\nfrom pyspark.sql import SparkSession\n\n\nif __name__ == \"__main__\":\n\n    spark = SparkSession\\\n        .builder\\\n        .appName(\"trip-count-join-fsx\")\\\n        .getOrCreate()\n\n    df = spark.read.parquet('s3://<s3 prefix>/trip-data.parquet')\n    print(\"Total trips: \" + str(df.count()))\n\n    df.write.parquet('s3://<s3 prefix>/write-encrypt-trip-data.parquet')\n    print(\"Encrypt - KMS- CSE writew to s3 compeleted\")\n    spark.stop()\n    EOF\n

    Request:

    cat > spark-python-in-s3-encrypt-cse-kms-write.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-encrypt-cse-kms-write\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>trip-count-encrypt-write.py\", \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=10 --conf spark.driver.cores=2  --conf spark.executor.memory=20G --conf spark.driver.memory=20G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n       },\n       {\n         \"classification\": \"emrfs-site\", \n         \"properties\": {\n          \"fs.s3.cse.enabled\":\"true\",\n          \"fs.s3.cse.encryptionMaterialsProvider\":\"com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider\",\n          \"fs.s3.cse.kms.keyId\":\"<KMS Key Id>\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"persistentAppUI\": \"ENABLED\", \n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-encrypt-cse-kms-write.json\n

    In the above request, EMRFS encrypts the parquet file with the specified KMS key and the encrypted object is persisted to the specified s3 location.

    To verify the encryption - use the same KMS key to decrypt - the KMS key used is a symmetric key ( the same key can be used to both encrypt and decrypt)

    trip-count-encrypt-read.py

    cat > trip-count-encrypt-read.py<<EOF\nimport sys\n\nfrom pyspark.sql import SparkSession\n\n\nif __name__ == \"__main__\":\n\n    spark = SparkSession\\\n        .builder\\\n        .appName(\"trip-count-join-fsx\")\\\n        .getOrCreate()\n\n    df = spark.read.parquet('s3://<s3 prefix>/trip-data.parquet')\n    print(\"Total trips: \" + str(df.count()))\n\n    df_encrypt = spark.read.parquet('s3://<s3 prefix>/write-encrypt-trip-data.parquet')\n    print(\"Encrypt data - Total trips: \" + str(df_encrypt.count()))\n    spark.stop()\n   EOF\n

    Request

    cat > spark-python-in-s3-encrypt-cse-kms-read.json<<EOF\n{\n  \"name\": \"spark-python-in-s3-encrypt-cse-kms-read\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>trip-count-encrypt-write.py\", \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=10 --conf spark.driver.cores=2  --conf spark.executor.memory=20G --conf spark.driver.memory=20G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n       },\n       {\n         \"classification\": \"emrfs-site\", \n         \"properties\": {\n          \"fs.s3.cse.enabled\":\"true\",\n          \"fs.s3.cse.encryptionMaterialsProvider\":\"com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider\",\n          \"fs.s3.cse.kms.keyId\":\"<KMS Key Id>\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"persistentAppUI\": \"ENABLED\", \n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-encrypt-cse-kms-read.json\n

    Validate encryption: Try to read the encrypted data without specifying \"fs.s3.cse.enabled\":\"true\" - will get an error message in the driver and executor logs because the content is encrypted and cannot be read without decryption.

    "},{"location":"security/docs/spark/encryption/","title":"EMR on EKS - Encryption Best Practices","text":"

    This document will describe how to think about security and its best practices when applying to EMR on EKS service. We will cover topics related to encryption at rest and in-transit when you run EMR on EKS jobs on EKS cluster.

    It's important to understand the shared responsibility model when using managed services such as EMR on EKS in order to improve the overall security posture of your environment. Generally speaking AWS is responsible for security \"of\" the cloud whereas you, the customer, are responsible for security \"in\" the cloud. The diagram below depicts this high level definition.

    "},{"location":"security/docs/spark/encryption/#shared-responsibility-model","title":"Shared responsibility model","text":"

    EMR on EKS provides simple way to run spark jobs on top of EKS clusters. The architecture itself is loosely coupled and is abstracted from customers so that they can run secure environment for running spark applications. Because EMR on EKS uses combination of two services (EMR and EKS) at a minimal, we will cover how EKS enables infrastructure components that are consumable by EMR spark workload and how to handle encryption for each service.

    AWS assumes different levels of responsibility depending on the features being consumed by EMR on EKS customers. At this time of writing, the features from EKS are managed node groups, self-managed workers, and Fargate. We won\u2019t go in-depth on these architectures as they are detailed in EKS best practices guide (https://aws.github.io/aws-eks-best-practices/security/docs/). Below diagrams depict how this responsibility changes between customer and AWS based on consumed features.

    "},{"location":"security/docs/spark/encryption/#encryption-for-data-in-transit","title":"Encryption for data in-transit","text":"

    In this section, we will cover encryption for data in-transit. We will highlight AWS platform capabilities from the physical layer and then review how AWS handles encryption in the EMR on EKS architecture layer. Lastly, we will cover how customers can enable encryption between spark drivers and executors.

    "},{"location":"security/docs/spark/encryption/#aws-infrastructure-physical-layer","title":"AWS Infrastructure - Physical layer","text":"

    AWS provides secure and private connectivity between EC2 instances of all types. All data flowing across AWS Regions over the AWS global network is automatically encrypted at the physical layer before it leaves AWS secured facilities. All traffic between AZs is encrypted. All cross-Region traffic that uses Amazon VPC and Transit Gateway peering is automatically bulk-encrypted when it exits a Region. In addition, if you use Nitro family of instances, all traffic between instances is encrypted in-transit using AEAD algorithms with 256-bit encryption. We highly recommend reviewing EC2 documentation for more information.

    "},{"location":"security/docs/spark/encryption/#amazon-emr-on-eks","title":"Amazon EMR on EKS","text":"

    Below diagram depicts high-level architecture implementation of EMR on EKS. In this section, we will cover encryption in-transit for communication between managed services such as EMR & EKS. All traffic with AWS API\u2019s that support EMR and EKS are encrypted by default. EKS enables Kubernetes API server using https endpoint. Both the kubelet that runs on EKS worker nodes and Kubernetes client such as kubectl interacts with EKS cluster API using TLS. Amazon EMR on EKS uses the same secure channel to interact with EKS cluster API to run spark jobs on worker nodes. In addition, EMR on EKS provides an encrypted endpoint for accessing spark history server.

    Spark offers AES-based encryption for RPC connections. EMR on EKS customers may choose to encrypt the traffic between spark drivers and executors using this encryption mechanism. In order to enable encryption, RPC authentication must also be enabled in your spark configuration.

    --conf spark.authenticate=true \\\n--conf spark.network.crypto.enabled=true \\\n

    The encryption key is generated by the driver and distributed to executors via environment variables. Because these environment variables can be accessed by users who have access to Kubernetes API (kubectl), we recommend securing access so that only authorized users have access to your environment. You should also configure proper Kubernetes RBAC permissions so that only authorized service accounts can use these variables.

    "},{"location":"security/docs/spark/encryption/#encryption-for-data-at-rest","title":"Encryption for data at-rest","text":"

    In this section, we will cover encryption for data at-rest. We will review how to enable storage-level encryption so that it is transparent for spark application to use this data securely. We will also see how to enable encryption from spark application while using AWS native storage options.

    "},{"location":"security/docs/spark/encryption/#amazon-s3","title":"Amazon S3","text":"

    Amazon S3 offers server-side encryption for encrypting all data that is stored in an S3 bucket. You can enable default encryption using either S3 managed keys (SSE-S3) or KMS managed keys (SSE-KMS). Amazon S3 will encrypt all data before storing it on disks based on the keys specified. We recommend using server-side encryption at a minimum so that your data at-rest is encrypted. Please review Amazon S3 documentation and use the mechanisms that apply to your encryption standards and acceptable performance.

    Amazon S3 supports client-side encryption as well. Using this approach, you can let spark application to encrypt all data with desired KMS keys and upload this data to S3 buckets. Below examples shows spark application reading and writing parquet data in S3. During job submission, we use EMRFS encryption mechanism to encrypt all data with KMS key into the desired S3 location.

    import sys\n\nfrom pyspark.sql import SparkSession\n\n\nif __name__ == \"__main__\":\n\n    spark = SparkSession\\\n        .builder\\\n        .appName(\"trip-count-join-fsx\")\\\n        .getOrCreate()\n\n    df = spark.read.parquet('s3://<s3 prefix>/trip-data.parquet')\n    print(\"Total trips: \" + str(df.count()))\n\n    df.write.parquet('s3://<s3 prefix>/write-encrypt-trip-data.parquet')\n    print(\"Encrypt - KMS- CSE writew to s3 compeleted\")\n    spark.stop()\n

    Below is the job submission request that depicts KMS specification needed for EMRFS to perform this encryption. For complete end-to-end example, please see EMR on EKS best practices documentation

    cat > spark-python-in-s3-encrypt-cse-kms-write.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-encrypt-cse-kms-write\",\n  \"virtualClusterId\": \"<virtual-cluster-id>\",\n  \"executionRoleArn\": \"<execution-role-arn>\",\n  \"releaseLabel\": \"emr-6.2.0-latest\",\n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>trip-count-encrypt-write.py\",\n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=10 --conf spark.driver.cores=2  --conf spark.executor.memory=20G --conf spark.driver.memory=20G --conf spark.executor.cores=2\"\n    }\n  },\n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\",\n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n       },\n       {\n         \"classification\": \"emrfs-site\",\n         \"properties\": {\n          \"fs.s3.cse.enabled\":\"true\",\n          \"fs.s3.cse.encryptionMaterialsProvider\":\"com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider\",\n          \"fs.s3.cse.kms.keyId\":\"<KMS Key Id>\"\n         }\n      }\n    ],\n    \"monitoringConfiguration\": {\n      \"persistentAppUI\": \"ENABLED\",\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\",\n        \"logStreamNamePrefix\": \"demo\"\n      },\n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-encrypt-cse-kms-write.json\n

    Amazon EKS offers three different storage offerings (EBS, EFS, FSx) that can be directly consumed by pods. Each storage offering provides encryption mechanism that can be enabled at the storage level.

    "},{"location":"security/docs/spark/encryption/#amazon-ebs","title":"Amazon EBS","text":"

    Amazon EBS supports default encryption that can be turned on a per-region basis. Once it's turned on, you can have newly created EBS volumes and snapshots encrypted using AWS managed KMS keys. Please review EBS documentation to learn more on how to enable this feature

    You can use Kubernetes (k8s) in-tree storage driver or choose to use EBS CSI driver to consume EBS volumes within your pods. Both choices offer options to enable encryption. In the below example, we use k8s in-tree storage driver to create storage class and persistent volume claim. You can create similar resources using EBS CSI driver as well.

    apiVersion: storage.k8s.io/v1\nkind: StorageClass\nmetadata:\n  name: encrypted-sc\nprovisioner: kubernetes.io/aws-ebs\nvolumeBindingMode: WaitForFirstConsumer\nparameters:\n  type: gp2\n  fsType: ext4\n  encrypted: \"true\"\n\napiVersion: v1\nkind: PersistentVolumeClaim\nmetadata:\n  name: spark-driver-pvc\nspec:\n  storageClassName: encrypted-sc\n  accessModes:\n    - ReadWriteOnce\n  resources:\n    requests:\n      storage: 10Gi\n

    Once these resources are created, you can specify them in your drivers and executors. You can see an example of this specification below. Keep in mind, you can only attach an EBS volume to single EC2 instance or a Kubernetes pod. Therefore, if you have multiple executor pods, you need to create multiple PVCs to fulfill this request

    --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=spark-driver-pvc\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.readOnly=false\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data\n...\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=spark-executor-pvc\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/data\n

    Another approach is to let k8s create EBS volumes dynamically based on your spark workload. You can do so by specifying just the storageClass and sizeLimit options and specify OnDemand for the persistent volume claim (PVC). This is useful in case of Dynamic Resource Allocation. Please be sure to use EMR 6.3.0 release and above to use this feature because dynamic PVC support was added in Spark 3.1. Below is an example for dynamically creating volumes for executors within your job

    --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=spark-driver-pvc\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.readOnly=false\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=encrypted-sc\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=10Gi\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/data\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.options.claimName=OnDemand\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.options.storageClass=encrypted-sc\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.options.sizeLimit=10Gi\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.mount.path=/var/data/spill\n--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.mount.readOnly=false\n

    For a complete list of available options, please refer to the Spark Documentation

    "},{"location":"security/docs/spark/encryption/#amazon-efs","title":"Amazon EFS","text":"

    Similar to EBS, you can consume EFS volumes via EFS CSI driver and FSx for Lustre volumes via FSx CSI driver. There are two provisioning methods before these storage volumes are consumed by workloads, namely static provisioning and dynamic provisioning. For static provisioning, you have to pre-create volumes using AWS API\u2019s, CLI or AWS console. For dynamic provisioning, volume is created dynamically by the CSI drivers as workloads are deployed onto Kubernetes cluster. Currently, EFS CSI driver doesn\u2019t support dynamic volume provisioning. However, you can create the volume using EFS API or AWS console before creating a persistent volume (PV) that can be used within your spark application. If you plan to encrypt the data stored in EFS, you need to specify encryption during volume creation. For further information about EFS file encryption, please refer to Encrypting Data at Rest. One of the advantages of using EFS is that it provides encryption in transit support using TLS and it's enabled by default by the CSI driver. You can see the example below if you need to enforce TLS encryption during PV creation

    apiVersion: v1\nkind: PersistentVolume\nmetadata:\n  name: efs-pv\nspec:\n  capacity:\n    storage: 5Gi\n  volumeMode: Filesystem\n  accessModes:\n    - ReadWriteOnce\n  persistentVolumeReclaimPolicy: Retain\n  storageClassName: efs-sc\n  csi:\n    driver: efs.csi.aws.com\n    volumeHandle: fs-4af69aab\n    volumeAttributes:\n      encryptInTransit: \"true\"\n
    "},{"location":"security/docs/spark/encryption/#amazon-fsx-for-lustre","title":"Amazon FSx for Lustre","text":"

    Amazon FSx CSI driversupports both static and dynamic provisioning. Encryption for data in-transit is automatically enabled from Amazon EC2 instances that support encryption in transit. To learn which EC2 instances support encryption in transit, see Encryption in Transit in the Amazon EC2 User Guide for Linux Instances. Encryption for data at rest is automatically enabled when you create the FSx filesystem. Amazon FSx for Lustre supports two types of filesystems, namely persistent and scratch. You can use the default encryption method where encryption keys are managed by Amazon FSx. However, if you prefer to manage your own KMS keys, you can do so for persistent filesystem. The example below shows how to create storage class using FSx for Lustre for persistent filesystem using your own KMS managed keys.

    kind: StorageClass\napiVersion: storage.k8s.io/v1\nmetadata:\n  name: fsx-sc\nprovisioner: fsx.csi.aws.com\nparameters:\n  subnetId: subnet-056da83524edbe641\n  securityGroupIds: sg-086f61ea73388fb6b\n  deploymentType: PERSISTENT_1\n  kmsKeyId: <kms_arn>\n

    You can then create persistent volume claim (see an example in FSx repo) and use within your spark application as below

    --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=fsx-claim\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.readOnly=false\n--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data\n
    "},{"location":"security/docs/spark/encryption/#using-spark-to-encrypt-data","title":"Using Spark to encrypt data","text":"

    Apache Spark supports encrypting temporary data that is stored on storage volumes. These volumes can be instance storage such as NVMe SSD volumes, EBS, EFS or FSx volumes. Temporary data can be shuffle files, shuffle spills and data blocks stored on disk (for both caching and broadcast variables). It's important to note that the data on NVMe instance storage is encrypted using an XTS-AES-256 block cipher implemented in a hardware module on the instance. Even though, instance storage is available, you need to format and mount them while you bootstrap EC2 instances. Below is an example to show how to use instance storage using eksctl

    managedNodeGroups:\n- name: nvme\n  minSize: 2\n  desiredCapacity: 2\n  maxSize: 10\n  instanceType: r5d.4xlarge\n  ssh:\n    enableSsm: true\n  preBootstrapCommands:\n    - IDX=1\n    - for DEV in /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_*-ns-1; do  mkfs.xfs ${DEV};mkdir -p /local${IDX};echo ${DEV} /local${IDX} xfs defaults,noatime 1 2 >> /etc/fstab; IDX=$((${IDX} + 1)); done\n    - mount -a\n

    If you use non-NVMe SSD volumes, you can follow the best practice to encrypt shuffle data before you write them to disks. You can see an example below that shows this example. For more information about the type of instance store volume supported by each instance type, see Instance store volumes.

    --conf spark.io.encryption.enabled=true\n
    "},{"location":"security/docs/spark/encryption/#conclusion","title":"Conclusion","text":"

    In this document, we covered shared responsibility model for running EMR on EKS workload. We then reviewed platform capabilities available through AWS infrastructure and how to enable encryption for both storage-level and via spark application. To quote Werner Vogels, AWS CTO \u201cSecurity is everyone\u2019s job now, not just the security team\u2019s\u201d. We hope this document provides prescriptive guidance into how to enable encryption for running secure EMR on EKS workload.

    "},{"location":"security/docs/spark/network-security/","title":"** Managing VPC for EMR on EKS**","text":"

    This section address network security at VPC level. If you want to read more on network security for Spark in EMR on EKS please refer to this section.

    "},{"location":"security/docs/spark/network-security/#security-group","title":"Security Group","text":"

    The applications running on your EMR on EKS cluster often would need access to services that are running outside the cluster, for example, these can Amazon Redshift, Amazon Relational Database Service, a service self hosted on an EC2 instance. To access these resource you need to allow network traffic at the security group level. The default mechanism in EKS is using security groups at the node level, this means all the pods running on the node will inherit the rules on the security group. For security conscious customers, this is not a desired behavior and you would want to use security groups at the pod level.

    This section address how you can use Security Groups with EMR on EKS.

    "},{"location":"security/docs/spark/network-security/#configure-eks-cluster-to-use-security-groups-for-pods","title":"Configure EKS Cluster to use Security Groups for Pods","text":"

    In order to use Security Groups at the pod level, you need to configure the VPC CNI for EKS. The following link guide through the prerequisites as well as configuring the EKS Cluster.

    "},{"location":"security/docs/spark/network-security/#define-securitygrouppolicy","title":"Define SecurityGroupPolicy","text":"

    Once you have configured the VPC CNI, you need to create a SecurityGroupPolicy object. This object define which security group (up to 5) to use, podselector to define which pod to apply the security group to and the namespace in which the Security Group should be evaluated. Below you find an example of SecurityGroupPolicy.

    apiVersion: vpcresources.k8s.aws/v1beta1\nkind: SecurityGroupPolicy\nmetadata:\n  name: <>\n  namespace: <NAMESPACE FOR VC>\nspec:\n  podSelector: \n    matchLabels:\n      role: spark\n  securityGroups:\n    groupIds:\n      - sg-xxxxx\n
    "},{"location":"security/docs/spark/network-security/#define-pod-template-to-use-security-group-for-pod","title":"Define pod template to use Security Group for pod","text":"

    In order for the security group to be applied to the Spark driver and executors, you need to provide a podtemplate which add label(s) to the pods. The labels should match the one defined above in the podSelector in our example it is role: spark. The snippet below define the pod template that you can upload in S3 and then reference when launching your job.

    apiVersion: v1\nkind: Pod\nmetadata:\n  labels:\n    role: spark\n
    "},{"location":"security/docs/spark/network-security/#launch-a-job","title":"Launch a job","text":"

    The command below can be used to run a job.

        aws emr-containers start-job-run --virtual-cluster-id <EMR-VIRTUAL-CLUSTER-ID> --name spark-jdbc --execution-role-arn <EXECUTION-ROLE-ARN> --release-label emr-6.7.0-latest --job-driver '{\n    \"sparkSubmitJobDriver\": {\n    \"entryPoint\": \"<S3-URI-FOR-PYSPARK-JOB-DEFINED-ABOVE>\",\n    \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1\"\n    }\n    }' --configuration-overrides '{\n    \"applicationConfiguration\": [\n    {\n    \"classification\": \"spark-defaults\", \n    \"properties\": {\n    \"spark.hadoop.hive.metastore.client.factory.class\": \"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\",\n    \"spark.sql.catalogImplementation\": \"hive\",\n    \"spark.dynamicAllocation.enabled\":\"true\",\n    \"spark.dynamicAllocation.minExecutors\": \"8\",\n    \"spark.dynamicAllocation.maxExecutors\": \"40\",\n    \"spark.kubernetes.allocation.batch.size\": \"8\",\n    \"spark.dynamicAllocation.executorAllocationRatio\": \"1\",\n    \"spark.dynamicAllocation.shuffleTracking.enabled\": \"true\",\n    \"spark.dynamicAllocation.shuffleTracking.timeout\": \"300s\",\n    \"spark.kubernetes.driver.podTemplateFile\":<S3-URI-TO-DRIVER-POD-TEMPLATE>,\n    \"spark.kubernetes.executor.podTemplateFile\":<S3-URI-TO-EXECUTOR-POD-TEMPLATE>\n    }\n    }\n    ],\n    \"monitoringConfiguration\": {\n        \"persistentAppUI\": \"ENABLED\",\n        \"cloudWatchMonitoringConfiguration\": {\n            \"logGroupName\": \"/aws/emr-containers/\",\n            \"logStreamNamePrefix\": \"default\"\n        }\n    }\n    }'\n
    "},{"location":"security/docs/spark/network-security/#verify-a-security-group-attached-to-the-pod-eni","title":"Verify a security group attached to the Pod ENI","text":"

    To verify that spark driver and executor driver have the security group attached to, apply the first command to get the podname then the second one to see the annotation in pod with the ENI associated to the pod which has the secuity group defined in the SecurityGroupPolicy.

    export POD_NAME=$(kubectl -n <NAMESPACE> get pods -l role=spark -o jsonpath='{.items[].metadata.name}')\n\nkubectl -n <NAMESPACE>  describe pod $POD_NAME | head -11\n
    Annotations:  kubernetes.io/psp: eks.privileged\n              vpc.amazonaws.com/pod-eni:\n                [{\"eniId\":\"eni-xxxxxxx\",\"ifAddress\":\"xx:xx:xx:xx:xx:xx\",\"privateIp\":\"x.x.x.x\",\"vlanId\":1,\"subnetCidr\":\"x.x.x.x/x\"}]\n
    "},{"location":"security/docs/spark/secrets/","title":"** Using Secrets in EMR on EKS**","text":"

    Secrets can be credentials to APIs, Databases or other resources. There are various ways these secrets can be passed to your containers, some of them are pod environment variable or Kubernetes Secrets. These methods are not secure, as for environment variable, secrets are stored in clear text and any authorized user who has access to Kubernetes cluster with admin privileges can read those secrets. Storing secrets using Kubernetes secrets is also not secure because they are not encrypted and only base36 encoded.

    There is a secure method to expose these secrets in EKS through the Secrets Store CSI Driver.

    The Secrets Store CSI Driver integrate with a secret store like AWS Secrets manager and mount the secrets as volume that can be accessed through your application code. This document describes how to set and use AWS Secrets Manager with EMR on EKS through the Secrets Store CSI Driver.

    "},{"location":"security/docs/spark/secrets/#deploy-secrets-store-csi-drivers-and-aws-secrets-and-configuration-provider","title":"Deploy Secrets Store CSI Drivers and AWS Secrets and Configuration Provider","text":""},{"location":"security/docs/spark/secrets/#secrets-store-csi-drivers","title":"Secrets Store CSI Drivers","text":"

    Configure EKS Cluster with Secrets Store CSI Driver.

    To learn more about AWS Secrets Manager CSI Driver you can refer to this link

    helm repo add secrets-store-csi-driver \\\n  https://kubernetes-sigs.github.io/secrets-store-csi-driver/charts\n\nhelm install -n kube-system csi-secrets-store \\\n  --set syncSecret.enabled=true \\\n  --set enableSecretRotation=true \\\n  secrets-store-csi-driver/secrets-store-csi-driver\n

    Deploy the AWS Secrets and Configuration Provider to use AWS Secrets Manager

    "},{"location":"security/docs/spark/secrets/#aws-secrets-and-configuration-provider","title":"AWS Secrets and Configuration Provider","text":"
    kubectl apply -f https://raw.githubusercontent.com/aws/secrets-store-csi-driver-provider-aws/main/deployment/aws-provider-installer.yaml\n
    "},{"location":"security/docs/spark/secrets/#define-the-secretproviderclass","title":"Define the SecretProviderClass","text":"

    The SecretProviderClass is how you present your secret in Kubernetes, below you find a definition of a SecretProviderClass. There are few parameters that are important:

    • The provider must be set to aws.
    • The objectName must be the name of the secret you want to use as defined in AWS. Here the secret is called db-creds.
    • The objectType must be set to secretsmanager.
    cat > db-cred.yaml << EOF\n\napiVersion: secrets-store.csi.x-k8s.io/v1\nkind: SecretProviderClass\nmetadata:\n  name: mysql-spark-secret\nspec:\n  provider: aws\n  parameters:\n    objects: |\n        - objectName: \"db-creds\"\n          objectType: \"secretsmanager\"\nEOF\n
    kubectl apply -f db-cred.yaml -n <NAMESPACE>\n

    In the terminal apply the above command to create SecretProviderClass, The kubectl command must include the namespace where your job will be executed.

    "},{"location":"security/docs/spark/secrets/#pod-template","title":"Pod Template","text":"

    In the executor podtemplate you should define it as follows to mount the secret. The example below show how you can define it. There are few points that are important to mount the secret:

    • secretProviderClass: this should have the same name as the one define above. In this case it is mysql-spark-secret.
    • mountPath: Is where the secret is going to be available to the pod. In this example it will be in /var/secrets When defining the mountPath make sure you do not specify the ones reserved by EMR on EKS as defined here.
    apiVersion: v1\nkind: Pod\n\nspec:\n  containers:\n    - name: spark-kubernetes-executors\n      volumeMounts:\n        - mountPath: \"/var/secrets\"\n          name: mysql-cred\n          readOnly: true\n  volumes:\n      - name: mysql-cred\n        csi:\n          driver: secrets-store.csi.k8s.io\n          readOnly: true\n          volumeAttributes:\n            secretProviderClass: mysql-spark-secret\n

    This podtemplate must be uploaded to S3 and referenced in the job submit command as shown below.

    Note You must make sure that the RDS instance or your Database allow traffic from the instances where your driver and executors pods are running.

    "},{"location":"security/docs/spark/secrets/#pyspark-code","title":"PySpark code","text":"

    The example below shows pyspark code for connecting with a MySQL DB. The example assume the secret is stored in AWS secrets manager as defined above. The username is the key to retrieve the database user as stored in AWS Secrets Manager, and password is the key to retrieve the database password.

    It shows how you can retrieve the credentials from the mount point /var/secrets/. The secret is stored in a file with the same name as it is defined in AWS in this case it is db-creds. This has been set in the podTemplate above.

    from pyspark.sql import SparkSession\nimport json\n\nsecret_path = \"/var/secrets/db-creds\"\n\nf = open(secret_path, \"r\")\nmySecretDict = json.loads(f.read())\n\nspark = SparkSession.builder.getOrCreate()\n\nstr_jdbc_url=\"jdbc:<DB endpoint>\"\nstr_Query= <QUERY>\nstr_username=mySecretDict['username']\nstr_password=mySecretDict['password']\ndriver = \"com.mysql.jdbc.Driver\"\n\njdbcDF = spark.read \\\n    .format(\"jdbc\") \\\n    .option(\"url\", str_jdbc_url) \\\n    .option(\"driver\", driver)\\\n    .option(\"query\", str_Query) \\\n    .option(\"user\", str_username) \\\n    .option(\"password\", str_password) \\\n    .load()\n\njdbcDF.show()\n
    "},{"location":"security/docs/spark/secrets/#execute-the-job","title":"Execute the job","text":"

    The command below can be used to run a job.

    Note: The supplied execution role MUST have access an IAM policy that allow it to access to the secret defined in SecretProviderClass above. The IAM policy below shows the IAM actions that are needed.

    {\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [ {\n        \"Effect\": \"Allow\",\n        \"Action\": [\"secretsmanager:GetSecretValue\", \"secretsmanager:DescribeSecret\"],\n        \"Resource\": [<SECRET-ARN>]\n    }]\n}\n
        aws emr-containers start-job-run --virtual-cluster-id <EMR-VIRTUAL-CLUSTER-ID> --name spark-jdbc --execution-role-arn <EXECUTION-ROLE-ARN> --release-label emr-6.7.0-latest --job-driver '{\n    \"sparkSubmitJobDriver\": {\n    \"entryPoint\": \"<S3-URI-FOR-PYSPARK-JOB-DEFINED-ABOVE>\",\n    \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1 --conf spark.jars=<S3-URI-TO-MYSQL-JDBC-JAR>\"\n    }\n    }' --configuration-overrides '{\n    \"applicationConfiguration\": [\n    {\n    \"classification\": \"spark-defaults\", \n    \"properties\": {\n    \"spark.hadoop.hive.metastore.client.factory.class\": \"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\",\n    \"spark.sql.catalogImplementation\": \"hive\",\n    \"spark.dynamicAllocation.enabled\":\"true\",\n    \"spark.dynamicAllocation.minExecutors\": \"8\",\n    \"spark.dynamicAllocation.maxExecutors\": \"40\",\n    \"spark.kubernetes.allocation.batch.size\": \"8\",\n    \"spark.dynamicAllocation.executorAllocationRatio\": \"1\",\n    \"spark.dynamicAllocation.shuffleTracking.enabled\": \"true\",\n    \"spark.dynamicAllocation.shuffleTracking.timeout\": \"300s\",\n    \"spark.kubernetes.driver.podTemplateFile\":<S3-URI-TO-DRIVER-POD-TEMPLATE>,\n    \"spark.kubernetes.executor.podTemplateFile\":<S3-URI-TO-EXECUTOR-POD-TEMPLATE>\n    }\n    }\n    ],\n    \"monitoringConfiguration\": {\n        \"persistentAppUI\": \"ENABLED\",\n        \"cloudWatchMonitoringConfiguration\": {\n            \"logGroupName\": \"/aws/emr-containers/\",\n            \"logStreamNamePrefix\": \"default\"\n        }\n    }\n    }'\n
    "},{"location":"storage/docs/spark/ebs/","title":"Mount EBS Volume to spark driver and executor pods","text":"

    Amazon EBS volumes can be mounted on Spark driver and executor pods through static and dynamic provisioning.

    Using dynamically-created PVC to mount EBS volumes per pod in a Spark offers significant benefit in terms of performance, scalability, and ease of management. However, it also introduces complexities and potential costs if EBS create/attach/detach/delete operation throttles when over 5000 EBS volumes were generated by Spark pods. It needs to be carefully managed. It's important to weigh the pros and cons against your specific use case requirements and constraints to determine if this technique is suitable for your scale of Spark workloads.

    "},{"location":"storage/docs/spark/ebs/#pros","title":"Pros","text":"
    • Scalability: As Spark scales up and down automatically during a job execution, dynamic PVCs allow storage to scale seamlessly with the number of executor pods. This ensures that each new executor gets the necessary storage without manual intervention.
    • Optimized Storage Allocation: Dynamically provisioning PVCs allows you to allocate exactly the amount of storage needed for each Spark's pod. This prevents over-provisioning and ensures efficient use of resources, potentially reducing storage costs.
    • Cost efficient: Only pay for the storage you actually use, which can be more cost-effective than pre-allocating large, static volumes.
    • High IO Performance: By giving each executor its own EBS volume, you avoid I/O contention among executors. This leads to more predictable and higher performance, especially for I/O-intensive tasks.
    • Data Locality: With each executor having its own volume, data is stored locally to the executor\u2019s pod. It can reduce data transfer latency.
    • Resilience for Spot Interruption: With the feature of \"PVC Reuse\" offered by Spark, EBS can persist shuffle data throughout a job lifetime, even if a pod is terminated in case of Spot interruption. You avoid creating new volumes instead re-attach them to new pods, which provides a faster recovery from a node failure or interruption event. This improves your application resilience while running Spot instances to reduce your compute cost.
    "},{"location":"storage/docs/spark/ebs/#cons","title":"Cons","text":"
    • Storage Costs: EBS volumes can be expensive, especially if EBS volumes were provisioned more than necessary, due to this bug or EBS CSI controller's scalability issue.
    • Resource Utilization: Inefficient use of storage resources can occur if each pod is allocated a large EBS volume but only uses a fraction of it.
    • Attachment Latency & Limit: Frequently attaching and detaching EBS volumes can introduce latency and potentially exceed EBS limit. For most instance types, only 26 extra volumes can be attached to a single Amazon EC2 instance.
    "},{"location":"storage/docs/spark/ebs/#prerequisite","title":"Prerequisite","text":"

    Amazon EBS CSI driver is installed on an EKS cluster. Use this comand to check if the driver exists:

    kubectl get csidriver\n
    "},{"location":"storage/docs/spark/ebs/#static-provisioning","title":"Static Provisioning","text":""},{"location":"storage/docs/spark/ebs/#eks-admin-tasks","title":"EKS Admin Tasks","text":"

    First, create your EBS volumes:

    aws ec2 --region <region> create-volume --availability-zone <availability zone> --size 50\n{\n    \"AvailabilityZone\": \"<availability zone>\", \n    \"MultiAttachEnabled\": false, \n    \"Tags\": [], \n    \"Encrypted\": false, \n    \"VolumeType\": \"gp2\", \n    \"VolumeId\": \"<vol -id>\", \n    \"State\": \"creating\", \n    \"Iops\": 150, \n    \"SnapshotId\": \"\", \n    \"CreateTime\": \"2020-11-03T18:36:21.000Z\", \n    \"Size\": 50\n}\n

    Create Persistent Volume(PV) that has the EBS volume created above hardcoded:

    cat > ebs-static-pv.yaml << EOF\napiVersion: v1\nkind: PersistentVolume\nmetadata:\n  name: ebs-static-pv\nspec:\n  capacity:\n    storage: 5Gi\n  accessModes:\n    - ReadWriteOnce\n  storageClassName: gp2\n  awsElasticBlockStore:\n    fsType: ext4\n    volumeID: <vol -id>\nEOF\n\nkubectl apply -f ebs-static-pv.yaml -n <namespace>\n

    Create Persistent Volume Claim(PVC) for the Persistent Volume created above:

    cat > ebs-static-pvc.yaml << EOF\nkind: PersistentVolumeClaim\napiVersion: v1\nmetadata:\n  name: ebs-static-pvc\nspec:\n  accessModes:\n    - ReadWriteOnce\n  resources:\n    requests:\n      storage: 5Gi\n  volumeName: ebs-static-pv\nEOF\n\nkubectl apply -f ebs-static-pvc.yaml -n <namespace>\n

    PVC - ebs-static-pvc can be used by spark developer to mount to the spark pod

    NOTE: Pods running in EKS worker nodes can only attach to the EBS volume provisioned in the same AZ as the EKS worker node. Use node selectors to schedule pods on EKS worker nodes the specified AZ.

    "},{"location":"storage/docs/spark/ebs/#spark-developer-tasks","title":"Spark Developer Tasks","text":"

    Request

    cat >spark-python-in-s3-ebs-static-localdir.json << EOF\n{\n  \"name\": \"spark-python-in-s3-ebs-static-localdir\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.15.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count-fsx.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5 --conf spark.executor.instances=10 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 \"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.options.claimName\": \"ebs-static-pvc\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.mount.path\": \"/var/spark/spill/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-sparkspill.mount.readOnly\": \"false\",\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-ebs-static-localdir.json\n

    Observed Behavior: When the job gets started, the pre-provisioned EBS volume is mounted to driver pod. You can exec into the driver container to verify that the EBS volume is mounted. Also you can verify the mount from the driver pod's spec.

    kubectl get pod <driver pod name> -n <namespace> -o yaml --export\n
    "},{"location":"storage/docs/spark/ebs/#dynamic-provisioning","title":"Dynamic Provisioning","text":"

    Dynamic Provisioning PVC/Volumes is supported for both Spark driver and executors for EMR versions >= 6.3.0.

    "},{"location":"storage/docs/spark/ebs/#eks-admin-tasks_1","title":"EKS Admin Tasks","text":"

    Create a new \"gp3\" EBS Storage Class or use an existing one:

    cat >demo-gp3-sc.yaml << EOF\napiVersion: storage.k8s.io/v1\nkind: StorageClass\nmetadata:\n  name: demo-gp3-sc\nprovisioner: kubernetes.io/aws-ebs\nparameters:\n  type: gp3\nreclaimPolicy: Retain\nallowVolumeExpansion: true\nmountOptions:\n  - debug\nvolumeBindingMode: Immediate\nEOF\n\nkubectl apply -f demo-gp3-sc.yaml\n
    "},{"location":"storage/docs/spark/ebs/#spark-developer-tasks_1","title":"Spark Developer Tasks","text":"

    Request

    cat >spark-python-in-s3-ebs-dynamic-localdir.json << EOF\n{\n  \"name\": \"spark-python-in-s3-ebs-dynamic-localdir\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.15.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count-fsx.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5 --conf spark.executor.instances=10 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName\": \"OnDemand\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass\": \"demo-gp3-sc\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path\":\"/data\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly\": \"false\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit\": \"10Gi\",\n\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName\":\"OnDemand\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass\": \"demo-gp3-sc\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path\": \"/data\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly\": \"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit\": \"50Gi\",\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-ebs-dynamic-localdir.json\n

    Observed Behavior: When the job gets started, an EBS volume is provisioned dynamically by the EBS CSI driver and mounted to Spark's driver and executor pods. You can exec into the driver / executor container to verify that the EBS volume is mounted. Also, you can verify the mount from driver / executor pod spec.

    # verify the EBS volume is mounted or not\nkubectl get pod <driver pod name> -n <namespace> -c spark-kubernetes-driver -- df -h\n\n# export the driver pod spec to a yaml file\nkubectl get pod <driver pod name> -n <namespace> -o yaml --export\n
    "},{"location":"storage/docs/spark/fsx-lustre/","title":"EMR Containers integration with FSx for Lustre","text":"

    Amazon EKS clusters provide the compute and ephemeral storage for Spark workloads. Ephemeral storage provided by EKS is allocated from the EKS worker node's disk storage and the lifecycle of the storage is bound by the lifecycle of the driver and executor pod.

    Need for durable storage: When multiple spark applications are executed as part of a data pipeline, there are scenarios where data from one spark application is passed to subsequent spark applications - in this case data can be persisted in S3. Alternatively, this data can also be persisted in FSx for Lustre. FSx for Lustre provides a fully managed, scalable, POSIX compliant native filesystem interface for the data in s3. With FSx, your torage is decoupled from your compute and has its own lifecycle.

    FSx for Lustre Volumes can be mounted on spark driver and executor pods through static and dynamic provisioning.

    Data used in the below example is from AWS Open data Registry

    "},{"location":"storage/docs/spark/fsx-lustre/#fsx-for-lustre-posix-permissions","title":"FSx for Lustre POSIX permissions","text":"

    When a Lustre filesystem is mounted to driver and executor pods, and if the S3 objects does not have required metadata, the mounted volume defaults ownership of the file system to root. EMR on EKS executes the driver and executor pods with UID(999), GID (1000) and groups(1000 and 65534). In this scenario, the spark application has read only access to the mounted Lustre file system. Below are a few approaches that can be considered:

    "},{"location":"storage/docs/spark/fsx-lustre/#tag-metadata-to-s3-object","title":"Tag Metadata to S3 object","text":"

    Applications writing to S3 can tag the S3 objects with the metadata that FSx for Lustre requires.

    Walkthrough: Attaching POSIX permissions when uploading objects into an S3 bucket provides a guided tutorial. FSx for Lustre will convert this tagged metadata to corresponding POSIX permissions when mounting Lustre file system to the driver and executor pods.

    EMR on EKS spawns the driver and executor pods as non-root user(UID -999, GID - 1000, groups - 1000, 65534). To enable the spark application to write to the mounted file system, (UID - 999) can be made as the file-owner and supplemental group 65534 be made as the file-group.

    For S3 objects that already exists with no metadata tagging, there can be a process that recursively tags all the S3 objects with the required metadata. Below is an example: 1. Create FSx for Lustre file system to the S3 prefix. 2. Create Persistent Volume and Persistent Volume claim for the created FSx for Lustre file system 3. Run a pod as root user with FSx for Lustre mounted with the PVC created in Step 2.

    ```\napiVersion: v1\nkind: Pod\nmetadata:\n  name: chmod-fsx-pod\n  namespace: test-demo\nspec:\n  containers:\n  - name: ownership-change\n    image: amazonlinux:2\n    command: [\"sh\", \"-c\", \"chown -hR +999:+65534 /data\"]\n    volumeMounts:\n    - name: persistent-storage\n      mountPath: /data\n  volumes:\n  - name: persistent-storage\n    persistentVolumeClaim:\n      claimName: fsx-static-root-claim\n```\n

    Run a data repository task with import path and export path pointing to the same S3 prefix. This will export the POSIX permission from FSx for Lustre file system as metadata, that is tagged on S3 objects.

    Now that the S3 objects are tagged with metadata, the spark application with FSx for Lustre filesystem mounted will have write access.

    "},{"location":"storage/docs/spark/fsx-lustre/#static-provisioning","title":"Static Provisioning","text":""},{"location":"storage/docs/spark/fsx-lustre/#provision-a-fsx-for-lustre-cluster","title":"Provision a FSx for Lustre cluster","text":"

    FSx for Luster can also be provisioned through aws cli

    How to decide what type of FSx for Lustre file system you need ? Create a Security Group to attach to FSx for Lustre file system as below Points to Note: Security group attached to the EKS worker nodes is given access on port number 988, 1021-1023 in inbound rules. Security group specified when creating the FSx for Lustre filesystem is given access on port number 988, 1021-1023 in inbound rules.

    Fsx for Lustre Provisioning through aws cli

    cat fsxLustreConfig.json << EOF \n{\n    \"ClientRequestToken\": \"EMRContainers-fsxLustre-demo\", \n    \"FileSystemType\": \"LUSTRE\",\n    \"StorageCapacity\": 1200, \n    \"StorageType\": \"SSD\", \n    \"SubnetIds\": [\n        \"<subnet-id>\"\n    ], \n    \"SecurityGroupIds\": [\n        \"<securitygroup-id>\"\n    ], \n    \"LustreConfiguration\": {\n        \"ImportPath\": \"s3://<s3 prefix>/\", \n        \"ExportPath\": \"s3://<s3 prefix>/\", \n        \"DeploymentType\": \"PERSISTENT_1\", \n        \"AutoImportPolicy\": \"NEW_CHANGED\",\n        \"PerUnitStorageThroughput\": 200\n    }\n}\nEOF\n

    Run the aws-cli command to create the FSx for Lustre filesystem as below.

    aws fsx create-file-system --cli-input-json file:///fsxLustreConfig.json\n

    Response is as below

    {\n    \"FileSystem\": {\n        \"VpcId\": \"<vpc id>\", \n        \"Tags\": [], \n        \"StorageType\": \"SSD\", \n        \"SubnetIds\": [\n            \"<subnet-id>\"\n        ], \n        \"FileSystemType\": \"LUSTRE\", \n        \"CreationTime\": 1603752401.183, \n        \"ResourceARN\": \"<fsx resource arn>\", \n        \"StorageCapacity\": 1200, \n        \"LustreConfiguration\": {\n            \"CopyTagsToBackups\": false, \n            \"WeeklyMaintenanceStartTime\": \"7:11:30\", \n            \"DataRepositoryConfiguration\": {\n                \"ImportPath\": \"s3://<s3 prefix>\", \n                \"AutoImportPolicy\": \"NEW_CHANGED\", \n                \"ImportedFileChunkSize\": 1024, \n                \"Lifecycle\": \"CREATING\", \n                \"ExportPath\": \"s3://<s3 prefix>/\"\n            }, \n            \"DeploymentType\": \"PERSISTENT_1\", \n            \"PerUnitStorageThroughput\": 200, \n            \"MountName\": \"mvmxtbmv\"\n        }, \n        \"FileSystemId\": \"<filesystem id>\", \n        \"DNSName\": \"<filesystem id>.fsx.<region>.amazonaws.com\", \n        \"KmsKeyId\": \"arn:aws:kms:<region>:<account>:key/<key id>\", \n        \"OwnerId\": \"<account>\", \n        \"Lifecycle\": \"CREATING\"\n    }\n}\n
    "},{"location":"storage/docs/spark/fsx-lustre/#eks-admin-tasks","title":"EKS admin tasks","text":"
    1. Attach IAM policy to EKS worker node IAM role to enable access to FSx for Lustre - Mount FSx for Lustre on EKS and Create a Security Group for FSx for Lustre
    2. Install the FSx CSI Driver in EKS
    3. Configure Storage Class for FSx for Lustre
    4. Configure Persistent Volume and Persistent Volume Claim for FSx for Lustre

    FSx for Lustre file system is created as described above -Provision a FSx for Lustre cluster Once provisioned, a persistent volume - as specified below is created with a direct (hard-coded) reference to the created lustre file system. A Persistent Volume claim for this persistent volume will always use the same file system.

    cat >fsxLustre-static-pv.yaml <<EOF\napiVersion: v1\nkind: PersistentVolume\nmetadata:\n  name: fsx-pv\nspec:\n  capacity:\n    storage: 1200Gi\n  volumeMode: Filesystem\n  accessModes:\n    - ReadWriteMany\n  mountOptions:\n    - flock\n  persistentVolumeReclaimPolicy: Recycle\n  csi:\n    driver: fsx.csi.aws.com\n    volumeHandle: <filesystem id>\n    volumeAttributes:\n      dnsname: <filesystem id>.fsx.<region>.amazonaws.com\n      mountname: mvmxtbmv\nEOF\n
    kubectl apply -f fsxLustre-static-pv.yaml\n

    Now, a Persistent Volume Claim (PVC) needs to be created that references PV created above.

    cat >fsxLustre-static-pvc.yaml <<EOF\napiVersion: v1\nkind: PersistentVolumeClaim\nmetadata:\n  name: fsx-claim\n  namespace: ns1\nspec:\n  accessModes:\n    - ReadWriteMany\n  storageClassName: \"\"\n  resources:\n    requests:\n      storage: 1200Gi\n  volumeName: fsx-pv\nEOF\n
    kubectl apply -f fsxLustre-static-pvc.yaml -n <namespace registered with EMR on EKS Virtual Cluster>\n
    "},{"location":"storage/docs/spark/fsx-lustre/#spark-developer-tasks","title":"Spark Developer Tasks","text":"

    Now spark applications can use fsx-claim in their spark application config to mount the FSx for Lustre filesystem to driver and executor container volumes.

    cat >spark-python-in-s3-fsx.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-fsx\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count-repartition-fsx.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n
    aws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-fsx.json\n

    Expected Behavior: All spark jobs that are run with persistent volume claims as fsx-claim will mount to the statically created FSx for Lustre file system.

    Use case:

    1. A data pipeline consisting of 10 spark applications can all be mounted to the statically created FSx for Lustre file system and can write the intermediate output to a particular folder. The next spark job in the data pipeline that is dependent on this data can read from FSx for Lustre. Data that needs to be persisted beyond the scope of the data pipeline can be exported to S3 by creating data repository tasks
    2. Data that is used often by multiple spark applications can also be stored in FSx for Lustre for improved performance.
    "},{"location":"storage/docs/spark/fsx-lustre/#dynamic-provisioning","title":"Dynamic Provisioning","text":"

    A FSx for Lustre file system can be provisioned on-demand. A Storage-class resource is created and that provisions FSx for Lustre file system dynamically. A PVC is created and refers to the storage class resource that was created. Whenever a pod refers to the PVC, the storage class invokes the FSx for Lustre Container Storage Interface (CSI) to provision a Lustre file system on the fly dynamically. In this model, FSx for Lustre of type Scratch File Systems is provisioned.

    "},{"location":"storage/docs/spark/fsx-lustre/#eks-admin-tasks_1","title":"EKS Admin Tasks","text":"
    1. Attach IAM policy to EKS worker node IAM role to enable access to FSx for Lustre - Mount FSx for Lustre on EKS and Create a Security Group for FSx for Lustre
    2. Install the FSx CSI Driver in EKS
    3. Configure Storage Class for FSx for Lustre
    4. Configure Persistent Volume Claim(fsx-dynamic-claim) for FSx for Lustre.

    Create PVC for dynamic provisioning with fsx-sc storage class.

    cat >fsx-dynamic-claim.yaml <<EOF\napiVersion: v1\nkind: PersistentVolumeClaim\nmetadata:\n  name: fsx-dynamic-claim\nspec:\n  accessModes:\n    - ReadWriteMany\n  storageClassName: fsx-sc\n  resources:\n    requests:\n      storage: 3600Gi\nEOF \n
    kubectl apply -f fsx-dynamic-pvc.yaml -n <namespace registered with EMR on EKS Virtual Cluster>\n
    "},{"location":"storage/docs/spark/fsx-lustre/#spark-developer-tasks_1","title":"Spark Developer Tasks","text":"
    cat >spark-python-in-s3-fsx-dynamic.json << EOF\n{\n  \"name\": \"spark-python-in-s3-fsx-dynamic\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count-repartition-fsx.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=5  --conf spark.kubernetes.pyspark.pythonVersion=3 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 --conf spark.sql.shuffle.partitions=1000\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.local.dir\":\"/var/spark/spill/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.options.claimName\":\"fsx-dynamic-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.mount.path\":\"/var/spark/spill/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n
    aws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-fsx-dynamic.json\n

    Expected Result: Statically provisioned FSx for Lustre is mounted to /var/data/ as before for the driver pod. For all the executors a SCRATCH 1 deployment type FSx for Lustre is provisioned on the fly dynamically by the Storage class that was created. There will be a latency before the first executor can start running - because the Lustre has to be created. Once it is created the same Lustre file system is mounted to all the executors. Also note - \"spark.local.dir\":\"/var/spark/spill/\" is used to force executor to use this folder mounted to Lustre for all spill and shuffle data. Once the spark job is completed, the Lustre file system is deleted or retained based on the PVC configuration. This dynamically created Lustre file system is mapped to a S3 path like the statically created filesystem. FSx-csi user guide

    "},{"location":"storage/docs/spark/instance-store/","title":"Instance Store Volumes","text":"

    When working with Spark workloads, it might be useful to use instances powered by SSD instance store volumes to improve the performance of your jobs. This storage is located on disks that are physically attached to the host computer and can provide better performance compared to traditional EBS volumes. In the context of Spark, this might be beneficial for wide transformations (e.g. JOIN, GROUP BY) that generate a significant amount of shuffle data that Spark persists on the local filesystem of the instances where the executors are running.

    In this document, we highlight two approaches to leverage NVMe disks in your workloads when using EMR on EKS. For a list of instances supporting NVMe disks, see Instance store volumes in the Amazon EC2 documentation.

    "},{"location":"storage/docs/spark/instance-store/#mount-kubelet-pod-directory-on-nvme-disks","title":"Mount kubelet pod directory on NVMe disks","text":"

    The kublet service manages the lifecycle of pod containers that are created using Kubernetes. When a pod is launched on an instance, an ephemeral volume is automatically created for the pod, and this volume is mapped in a subdirectory within the path /var/lib/kubelet of the host node. This volume folder exists for the lifetime of K8s pod, and it will be automatically deleted once the pod ceases to exist.

    In order to leverage NVMe disk attached to an EC2 node in our Spark application, we should perform the following actions during node bootstrap:

    • Prepare the NVMe disks attached to the instance (format disks and create a partition)
    • Mount the /var/lib/kubelet/pods path on the NVMe

    By doing this, all local files generated by your Spark job (blockmanager data, shuffle data, etc.) will be automatically written to NVMe disks. This way, you don't have to configure Spark volume path when launching the pod (driver or executor). This approach is easier to adopt because it doesn\u2019t require any additional configuration in your job. Besides, once the job is completed, all the data stored in ephemeral volumes will be automatically deleted when the EC2 instance is deleted.

    However, if you have multiple NVMe disks attached to the instance, you need to create RAID0 configuration of all the disks before mounting the /var/lib/kubelet/pods directory on the RAID partition. Without a RAID setup, it will not be possible to leverage all the disks capacity available on the node.

    The following example shows how to create a node group in your cluster using this approach. In order to prepare our NVMe disks, we can use the eksctl preBootstrapCommands definition while creating the node group. The script will perform the following actions:

    • For instances with a single NVMe disk, format the filesystem, create a Linux partition (e.g. ext4, xfs, etc.)
    • For instances with multiple NVMe disks, create a RAID 0 configuration across all available volumes

    Once the disks are formatted and ready to use, we will mount the folder /var/lib/kubelet/pods using the filesystem and setup correct permissions. Below, you can find an example of an eksctl configuration to create a managed node group using this approach.

    Example

    apiVersion: eksctl.io/v1alpha5\nkind: ClusterConfig\n\nmetadata:\n  name: YOUR_CLUSTER_NAME\n  region: YOUR_REGION\n\nmanagedNodeGroups:\n  - name: ng-c5d-9xlarge\n    instanceType: c5d.9xlarge\n    desiredCapacity: 1\n    privateNetworking: true\n    subnets:\n      - YOUR_NG_SUBNET\n    preBootstrapCommands: # commands executed as root\n      - yum install -y mdadm nvme-cli\n      - nvme_disks=($(nvme list | grep \"Amazon EC2 NVMe Instance Storage\" | awk -F'[[:space:]][[:space:]]+' '{print $1}')) && [[ ${#nvme_disks[@]} -eq 1 ]] && mkfs.ext4 -F ${nvme_disks[*]} && systemctl stop docker && mkdir -p /var/lib/kubelet/pods && mount ${nvme_disks[*]} /var/lib/kubelet/pods && chmod 750 /var/lib/docker && systemctl start docker\n      - nvme_disks=($(nvme list | grep \"Amazon EC2 NVMe Instance Storage\" | awk -F'[[:space:]][[:space:]]+' '{print $1}')) && [[ ${#nvme_disks[@]} -ge 2 ]] && mdadm --create --verbose /dev/md0 --level=0 --raid-devices=${#nvme_disks[@]} ${nvme_disks[*]} && mkfs.ext4 -F /dev/md0 && systemctl stop docker && mkdir -p /var/lib/kubelet/pods && mount /dev/md0 /var/lib/kubelet/pods && chmod 750 /var/lib/docker && systemctl start docker\n

    Benefits

    • No need to mount the disk using Spark configurations or pod templates
    • Data generated by the application, will immediately be deleted at the pod termination. Data will be also purged in case of pod failures.
    • One time configuration for the node group

    Cons

    • If multiple jobs are allocated on the same EC2 instance, contention of disk resources will occur because it is not possible to allocate instance store volume resources across jobs
    "},{"location":"storage/docs/spark/instance-store/#mount-nvme-disks-as-data-volumes","title":"Mount NVMe disks as data volumes","text":"

    In this section, we\u2019re going to explicitly mount instance store volumes as the mount path in Spark configuration for drivers and executors

    As in the previous example, this script will automatically format the instance store volumes and create an xfs partition. The disks are then mounted in local folders called /spark_data_IDX where IDX is an integer that corresponds to the disk mounted.

    Example

    apiVersion: eksctl.io/v1alpha5\nkind: ClusterConfig\n\nmetadata:\n  name: YOUR_CLUSTER_NAME\n  region: YOUR_REGION\n\nmanagedNodeGroups:\n  - name: ng-m5d-4xlarge\n    instanceType: m5d.4xlarge\n    desiredCapacity: 1\n    privateNetworking: true\n    subnets:\n      - YOUR_NG_SUBNET\n    preBootstrapCommands: # commands executed as root\n      - \"IDX=1;for DEV in /dev/nvme[1-9]n1;do mkfs.xfs ${DEV}; mkdir -p /spark_data_${IDX}; echo ${DEV} /spark_data_${IDX} xfs defaults,noatime 1 2 >> /etc/fstab; IDX=$((${IDX} + 1)); done\"\n      - \"mount -a\"\n      - \"chown 999:1000 /spark_data_*\"\n

    In order to successfully use ephemeral volumes within Spark, you need to specify additional configurations. In addition to spark configuration, the mounted volume name should start with spark-local-dir-.

    Below an example configuration provided during the EMR on EKS job submission, that shows how to configure Spark to use 2 volumes as local storage for the job.

    Spark Configurations

    {\n  \"name\": ....,\n  \"virtualClusterId\": ....,\n  \"executionRoleArn\": ....,\n  \"releaseLabel\": ....,\n  \"jobDriver\": ....,\n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\",\n        \"properties\": {\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path\": \"/spark_data_1\",\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly\": \"false\",\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path\": \"/spark_data_1\",\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path\": \"/spark_data_2\",\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.readOnly\": \"false\",\n          \"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path\": \"/spark_data_2\"\n        }\n      }\n    ]\n  }\n}\n

    Please note that for this approach it is required to specify the following configurations for each volume that you want to use. (IDX is a label to identify the volume mounted)

    # Mount path on the host node\nspark.kubernetes.executor.volumes.hostPath.spark-local-dir-IDX.options.path\n\n# Mount path on the k8s pod\nspark.kubernetes.executor.volumes.hostPath.spark-local-dir-IDX.mount.path\n\n# (boolean) Should be defined as false to allow Spark to write in the path\nspark.kubernetes.executor.volumes.hostPath.spark-local-dir-IDX.mount.readOnly\n

    Benefits

    • You can allocate dedicated resources of instance store volumes across your Spark jobs (For example, lets take a scenario where an EC2 instance has two instance store volumes. If you run two spark jobs on this node, you can dedicate one volume per Spark job)

    Cons

    • Additional configurations are required for Spark jobs to use instance store volumes. This approach can be error-prone if you don\u2019t control the instance types being used (for example, multiple node groups with different instance types). You can mitigate this issue by using k8s node selectors and specify instance type in your spark configuraiton: spark.kubernetes.node.selector.node.kubernetes.io/instance-type
    • Data created on the volumes is automatically deleted once the job is completed and instance is terminated. However, you need to extra measures to delete the data on instance store volumes if EC2 instance is re-used or is not terminated.
    "},{"location":"submit-applications/docs/spark/multi-arch-image/","title":"Build a Multi-architecture Docker Image Supporting arm64 & amd64","text":""},{"location":"submit-applications/docs/spark/multi-arch-image/#pre-requisites","title":"Pre-requisites","text":"

    We can complete all the steps either from a local desktop or using AWS Cloud9. If you\u2019re using AWS Cloud9, follow the instructions in the \"Setup AWS Cloud9\" to create and configure the environment first, otherwise skip to the next section.

    "},{"location":"submit-applications/docs/spark/multi-arch-image/#setup-aws-cloud9","title":"Setup AWS Cloud9","text":"

    AWS Cloud9 is a cloud-based IDE that lets you write, run, and debug your code via just a browser. AWS Cloud9 comes preconfigured with some of AWS dependencies we require to build our application, such ash the AWS CLI tool.

    "},{"location":"submit-applications/docs/spark/multi-arch-image/#1-create-a-cloud9-instance","title":"1. Create a Cloud9 instance","text":"

    Instance type - Create an AWS Cloud9 environment from the AWS Management Console with an instance type of t3.small or larger. In our example, we used m5.xlarge for adequate memory and CPU to compile and build a large docker image.

    VPC - Follow the launch wizard and provide the required name. To interact with an existing EKS cluster in the same region later on, recommend to use the same VPC to your EKS cluster in the Cloud9 environment. Leave the remaining default values as they are.

    Storage size - You must increase the Cloud9's EBS volume size (pre-attached to your AWS Cloud9 instance) to 30+ GB, because the default disk space ( 10 GB with ~72% used) is not enough for building a container image. Refer to Resize an Amazon EBS volume used by an environment document, download the script resize.sh to your cloud9 environment.

    touch resize.sh\n# Double click the file name in cloud9\n# Copy and paste the content from the official document to your file, save and close it\n

    Validate the disk size is 10GB currently:

    admin:~/environment $ df -h\nFilesystem        Size  Used Avail Use% Mounted on\ndevtmpfs          4.0M     0  4.0M   0% /dev\ntmpfs             951M     0  951M   0% /dev/shm\ntmpfs             381M  5.3M  376M   2% /run\n/dev/nvme0n1p1     10G  7.2G  2.9G  72% /\ntmpfs             951M   12K  951M   1% /tmp\n/dev/nvme0n1p128   10M  1.3M  8.7M  13% /boot/efi\ntmpfs             191M     0  191M   0% /run/user/1000\n

    Increase the disk size:

    bash resize.sh 30\n
    admin:~/environment $ df -h\nFilesystem        Size  Used Avail Use% Mounted on\ndevtmpfs          4.0M     0  4.0M   0% /dev\ntmpfs             951M     0  951M   0% /dev/shm\ntmpfs             381M  5.3M  376M   2% /run\n/dev/nvme0n1p1     30G  7.3G   23G  25% /\ntmpfs             951M   12K  951M   1% /tmp\n/dev/nvme0n1p128   10M  1.3M  8.7M  13% /boot/efi\ntmpfs             191M     0  191M   0% /run/user/1000\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#2-install-docker-and-buildx-if-required","title":"2. Install Docker and Buildx if required","text":"
    • Installing Docker - a Cloud9 EC2 instance comes with a Docker daemon pre-installed. Outside of the Cloud9, your environment may or may not need to install Docker. If needed, follow the instructions in the Docker Desktop page to install.

    • Installing Buildx (pre-installed in Cloud9) - To build a single multi-arch Docker image (x86_64 and arm64), we may or may not need to install an extra Buildx plugin that extends the Docker CLI to support the multi-architecture feature. Docker Buildx is installed by default with a Docker Engine since version 23.0+. For an earlier version, it requires you grab a binary from GitHub repository and install it manually, or get it from a separate package. See docker/buildx README for more information.

    Once the buildx CLI is available, we can create a builder instance which gives access to the new multi-architecture features.You only have to perform this task once.

    # create a builder\ndocker buildx create --name mybuilder --use\n# boot up the builder and inspect\ndocker buildx inspect --bootstrap\n\n\n# list builder instances\n# the asterisk (*) next to a builder name indicates the selected builder.\ndocker buildx ls\n

    If your builder doesn't support QEMU, only limited platform types are supported as below. For example, the current builder instance created in Cloud9 doesn't support QEMU, so we can't build the docker image for the arm64 CPU type yet.

    NAME/NODE       DRIVER/ENDPOINT      STATUS   BUILDKIT PLATFORMS\ndefault        docker\ndefault       default              running  v0.11.6  linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386\nmybuilder *    docker-container\nmy_builder0   default              running  v0.11.6  linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386\n
    • Installing QEMU for Cloud9 - Building multi-platform images under emulation with QEMU is the easiest way to get started if your builder already supports it. However, AWS Cloud9 isn't preconfigured with the binfmt_misc support. We must install compiled QEMU binaries. The installations can be easily done via the docker run CLI:
     docker run --privileged --rm tonistiigi/binfmt --install all\n

    List the builder instance again. Now we see the full list of platforms are supported,including arm-based CPU:

    docker buildx ls\n\nNAME/NODE     DRIVER/ENDPOINT             STATUS   BUILDKIT PLATFORMS\nmybuilder *   docker-container                              \n  mybuilder20 unix:///var/run/docker.sock running  v0.13.2  linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6     \ndefault       docker                                        \n  default     default                     running  v0.12.5  linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/386, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#build-a-docker-image-supporting-multi-arch","title":"Build a docker image supporting multi-arch","text":"

    In this example, we will create a spark-benchmark-utility container image. We are going to reuse the source code from the EMR on EKS benchmark Github repo.

    "},{"location":"submit-applications/docs/spark/multi-arch-image/#1-download-the-source-code-from-the-github","title":"1. Download the source code from the Github:","text":"
    git clone https://github.com/aws-samples/emr-on-eks-benchmark.git\ncd emr-on-eks-benchmark\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#2-setup-required-environment-variables","title":"2. Setup required environment variables","text":"

    We will build an image to test EMR 6.15's performance. The equivalent versions are Spark 3.4.1 and Hadoop 3.3.4. Change them accordingly if needed.

    export SPARK_VERSION=3.4.1\nexport HADOOP_VERSION=3.3.6\n

    Log in to your own Amazon ECR registry:

    export AWS_REGION=us-east-1\nexport ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)\nexport ECR_URL=$ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com\n\naws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ECR_URL\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#3-build-oss-spark-base-image-if-required","title":"3. Build OSS Spark base image if required","text":"

    If you want to test open-source Apache Spark's performance, build a base Spark image first. Otherwise skip this step.

    docker buildx build --platform linux/amd64,linux/arm64 \\\n-t $ECR_URL/spark:$SPARK_VERSION_hadoop_$HADOOP_VERSION \\\n-f docker/hadoop-aws-3.3.1/Dockerfile \\\n--build-arg HADOOP_VERSION=$HADOOP_VERSION --build-arg SPARK_VERSION=$SPARK_VERSION --push .\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#4-get-emr-spark-base-image-from-aws","title":"4. Get EMR Spark base image from AWS","text":"
    export SRC_ECR_URL=755674844232.dkr.ecr.us-east-1.amazonaws.com\naws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $SRC_ECR_URL\n\ndocker pull $SRC_ECR_URL/spark/emr-6.15.0:latest\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#5-build-the-benchmark-utility-image","title":"5. Build the Benchmark Utility image","text":"

    Build and push the docker image based the OSS Spark engine built before (Step #3):

    docker buildx build --platform linux/amd64,linux/arm64 \\\n-t $ECR_URL/spark:$SPARK_VERSION_hadoop_$HADOOP_VERSION \\\n-f docker/benchmark-util/Dockerfile \\\n--build-arg SPARK_BASE_IMAGE=$ECR_URL/spark:$SPARK_VERSION_hadoop_$HADOOP_VERSION \\\n--push .\n

    Build and push the benchmark docker image based EMR's Spark runtime (Step #4):

    docker buildx build --platform linux/amd64,linux/arm64 \\\n-t $ECR_URL/eks-spark-benchmark:emr6.15 \\\n-f docker/benchmark-util/Dockerfile \\\n--build-arg SPARK_BASE_IMAGE=$SRC_ECR_URL/spark/emr-6.15.0:latest \\\n--push .\n
    "},{"location":"submit-applications/docs/spark/multi-arch-image/#benchmark-application-based-on-the-docker-images-built","title":"Benchmark application based on the docker images built","text":"

    Based on the mutli-arch docker images built previously, now you can start to run benchmark applications on both intel and arm-based CPU nodes.

    In Cloud9, the following extra steps are required to configure the environment, before you can submit the applications.

    1. Install kkubectl/helm/eksctl CLI tools. refer to this sample scirpt

    2. Modify the IAM role attached to the Cloud9 EC2 instance, allowing it has enough privilege to assume an EKS cluster's admin role or has the permission to submit jobs against the EKS cluster.

    3. Upgrade AWS CLI and turn off the AWS managed temporary credentials in Cloud9:

    curl \"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip\" -o \"awscliv2.zip\"\nunzip awscliv2.zip\nsudo ./aws/install --update\n/usr/local/bin/aws cloud9 update-environment  --environment-id $C9_PID --managed-credentials-action DISABLE\nrm -vf ${HOME}/.aws/credentials\n
    1. Connect to the EKS cluster
    # a sample connection string\naws eks update-kubeconfig --name YOUR_EKS_CLUSTER_NAME --region us-east-1 --role-arn arn:aws:iam::ACCOUNTID:role/SparkOnEKS-iamrolesclusterAdmin-xxxxxx\n\n# validate the connection\nkubectl get svc\n
    "},{"location":"submit-applications/docs/spark/pyspark/","title":"Pyspark Job submission","text":"

    Python interpreter is bundled in the EMR containers spark image that is used to run the spark job.Python code and dependencies can be provided with the below options.

    "},{"location":"submit-applications/docs/spark/pyspark/#python-code-self-contained-in-a-single-py-file","title":"Python code self contained in a single .py file","text":"

    To start with, in the simplest scenario - the example below shows how to submit a pi.py file that is self-contained and doesn't need any other dependencies.

    "},{"location":"submit-applications/docs/spark/pyspark/#python-file-from-s3","title":"Python file from S3","text":"

    Request pi.py used in the below request payload is from spark examples

    cat > spark-python-in-s3.json << EOF\n{\n  \"name\": \"spark-python-in-image\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/pi.py\", \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#python-file-from-mounted-volume","title":"Python file from mounted volume","text":"

    In the below example - pi.py is placed in a mounted volume. FSx for Lustre filesystem is mounted as a Persistent Volume on the driver pod under /var/data/ and will be referenced by local:// file prefix. For more information on how to mount FSx for lustre - EMR-Containers-integration-with-FSx-for-Lustre

    This approach can be used to provide spark application code and dependencies for execution. Persistent Volume mounted to the driver and executor pods lets you access the application code and dependencies with local:// prefix.

    cat > spark-python-in-FSx.json <<EOF\n{\n  \"name\": \"spark-python-in-FSx\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"local:///var/data/FSxLustre-pi.py\", \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-Fsx.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#python-code-with-python-dependencies","title":"Python code with python dependencies","text":"

    Info

    boto3 will only work with 'Bundled as a .pex file' or with 'Custom docker image'

    "},{"location":"submit-applications/docs/spark/pyspark/#list-of-py-files","title":"List of .py files","text":"

    This is not a scalable approach as the number of dependent files can grow to a large number, and also need to manually specify all the transitive dependencies.

    cat > py-files-pi.py <<EOF\nfrom __future__ import print_function\n\nimport sys\nfrom random import random\nfrom operator import add\n\nfrom pyspark.sql import SparkSession\nfrom pyspark import SparkContext\n\nimport dependentFunc\n\nif __name__ == \"__main__\":\n    \"\"\"\n        Usage: pi [partitions]\n    \"\"\"\n    spark = SparkSession.builder.getOrCreate()\n    sc = spark.sparkContext\n    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2\n    n = 100000 * partitions\n\n    def f(_):\n        x = random() * 2 - 1\n        y = random() * 2 - 1\n        return 1 if x ** 2 + y ** 2 <= 1 else 0\n\n    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)\n    dependentFunc.message()\n    print(\"Pi is roughly %f\" % (4.0 * count / n))\n\n    spark.stop()\n\n  EOF\n
    cat > dependentFunc.py <<EOF\ndef message():\n  print(\"Printing from inside the dependent python file\")\n\nEOF\n

    Upload dependentFunc.py and py-files-pi.py to s3

    Request:

    cat > spark-python-in-s3-dependency-files << EOF\n{\n  \"name\": \"spark-python-in-s3-dependency-files\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/py-files-pi.py\", \n       \"sparkSubmitParameters\": \"--py-files s3://<s3 prefix>/dependentFunc.py --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-dependency-files.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-a-zip-file","title":"Bundled as a zip file","text":"

    In this approach all the dependent python files are bundled as a zip file. Each folder should have __init__.py file as documented in zip python dependencies. Zip should be done at the top folder level and using the -r option.

    zip -r pyspark-packaged-dependency-src.zip . \n  adding: dependent/ (stored 0%)\n  adding: dependent/__init__.py (stored 0%)\n  adding: dependent/dependentFunc.py (deflated 7%)\n

    dependentFunc.py from earlier example has been bundled as pyspark-packaged-dependency-src.zip. Upload this file to a S3 location

    cat > py-files-zip-pi.py <<EOF\nfrom __future__ import print_function\n\nimport sys\nfrom random import random\nfrom operator import add\n\nfrom pyspark.sql import SparkSession\nfrom pyspark import SparkContext\n\n**from dependent import dependentFunc**\n\nif __name__ == \"__main__\":\n    \"\"\"\n        Usage: pi [partitions]\n    \"\"\"\n    spark = SparkSession.builder.getOrCreate()\n    sc = spark.sparkContext\n    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2\n    n = 100000 * partitions\n\n    def f(_):\n        x = random() * 2 - 1\n        y = random() * 2 - 1\n        return 1 if x ** 2 + y ** 2 <= 1 else 0\n\n    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)\n    dependentFunc.message()\n    print(\"Pi is roughly %f\" % (4.0 * count / n))\n\n    spark.stop()\n  EOF\n

    Request:

    cat > spark-python-in-s3-dependency-zip.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-dependency-zip\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/py-files-zip-pi.py\", \n       \"sparkSubmitParameters\": \"--py-files s3://<s3 prefix>/pyspark-packaged-dependency-src.zip --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n          }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-dependency-zip.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-a-egg-file","title":"Bundled as a .egg file","text":"

    Create a folder structure as in the below screenshot with the code from the previous example - py-files-zip-pi.py, dependentFunc.py

    Steps to create .egg file

    cd /pyspark-packaged-example\npip install setuptools\npython setup.py bdist_egg\n

    Upload dist/pyspark_packaged_example-0.0.3-py3.8.egg to a S3 location

    Request:

    cat > spark-python-in-s3-dependency-egg.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-dependency-egg\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/py-files-zip-pi.py\", \n       \"sparkSubmitParameters\": \"--py-files s3://<s3 prefix>/pyspark_packaged_example-0.0.3-py3.8.egg --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-dependency-egg.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-a-whl-file","title":"Bundled as a .whl file","text":"

    Create a folder structure as in the below screenshot with the code from the previous example - py-files-zip-pi.py, dependentFunc.py

    Steps to create .whl file

    cd /pyspark-packaged-example\n`pip install wheel`\npython setup.py bdist_wheel\n

    Upload dist/pyspark_packaged_example-0.0.3-py3-none-any.whl to a s3 location

    Request:

    cat > spark-python-in-s3-dependency-wheel.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-dependency-wheel\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/py-files-zip-pi.py\", \n       \"sparkSubmitParameters\": \"--py-files s3://<s3 prefix>/pyspark_packaged_example-0.0.3-py3-none-any.whl --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-dependency-wheel.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-a-pex-file","title":"Bundled as a .pex file","text":"

    pex is a library for generating .pex (Python EXecutable) files which are executable Python environments.PEX files can be created as below

    docker run -it -v $(pwd):/workdir python:3.7.9-buster /bin/bash #python 3.7.9 is installed in EMR 6.1.0\npip3 install pex\npex --python=python3 --inherit-path=prefer -v numpy -o numpy_dep.pex\n

    To read more about PEX: PEX PEX documentation Tips on PEX pex packaging for pyspark

    Approach 1: Using Persistent Volume - FSx for Lustre cluster

    Upload numpy_dep.pex to a s3 location that is mapped to a FSx for Lustre cluster. numpy_dep.pex can be placed on any Kubernetes persistent volume and mounted to the driver pod and executor pod. Request: kmeans.py used in the below request is from spark examples

    cat > spark-python-in-s3-pex-fsx.json << EOF\n{\n  \"name\": \"spark-python-in-s3-pex-fsx\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/kmeans.py\",\n      \"entryPointArguments\": [\n        \"s3://<s3 prefix>/kmeans_data.txt\",\n        \"2\",\n        \"3\"\n       ], \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.pyspark.pythonVersion\":\"3\",\n          \"spark.kubernetes.driverEnv.PEX_ROOT\":\"./tmp\",\n          \"spark.executorEnv.PEX_ROOT\":\"./tmp\",\n          \"spark.kubernetes.driverEnv.PEX_INHERIT_PATH\":\"prefer\",\n          \"spark.executorEnv.PEX_INHERIT_PATH\":\"prefer\",\n          \"spark.kubernetes.driverEnv.PEX_VERBOSE\":\"10\",\n          \"spark.kubernetes.driverEnv.PEX_PYTHON\":\"python3\",\n          \"spark.executorEnv.PEX_PYTHON\":\"python3\",\n          \"spark.pyspark.driver.python\":\"/var/data/numpy_dep.pex\",\n          \"spark.pyspark.python\":\"/var/data/numpy_dep.pex\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": { \n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\n\naws emr-containers start-job-run --cli-input-json file:////Spark-Python-in-s3-pex-fsx.json\n

    Approach 2: Using Custom Pod Templates

    Upload numpy_dep.pex to a s3 location. Create custom pod templates for driver and executor pods. Custom pod templates allows running a command through initContainers before the main application container is created. In this case, the command will download the numpy_dep.pex file to the /tmp/numpy_dep.pex path of the driver and executor pods.

    Note: This approach is only supported for release image 5.33.0 and later or 6.3.0 and later.

    Sample driver pod template YAML file:

    cat > driver_pod_tenplate.yaml <<EOF\napiVersion: v1\nkind: Pod\nspec:\n containers:\n   - name: spark-kubernetes-driver\n initContainers: \n   - name: my-init-container\n     image: 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.33.0-20210323:2.4.7-amzn-1-vanilla\n     volumeMounts:\n       - name: temp-data-dir\n         mountPath: /tmp\n     command:\n       - sh\n       - -c\n       - aws s3api get-object --bucket <s3-bucket> --key <s3-key-prefix>/numpy_dep.pex /tmp/numpy_dep.pex && chmod u+x /tmp/numpy_dep.pex\nEOF\n

    Sample executor pod template YAML file:

    cat > executor_pod_tenplate.yaml <<EOF\napiVersion: v1\nkind: Pod\nspec:\n  containers:\n    - name: spark-kubernetes-executor\n  initContainers: \n    - name: my-init-container\n      image: 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.33.0-20210323:2.4.7-amzn-1-vanilla\n      volumeMounts:\n        - name: temp-data-dir\n          mountPath: /tmp\n      command:\n        - sh\n        - -c\n        - aws s3api get-object --bucket <s3-bucket> --key <s3-key-prefix>/numpy_dep.pex /tmp/numpy_dep.pex && chmod u+x /tmp/numpy_dep.pex\nEOF\n

    Replace initContainer's image with the respective release label's container image. In this case we are using the image of release emr-5.33.0-latest. Upload the driver and executor custom pod templates to S3

    Request: kmeans.py used in the below request is from spark examples

    cat > spark-python-in-s3-pex-pod-templates.json << EOF\n{\n  \"name\": \"spark-python-in-s3-pex-pod-templates\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-5.33.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/kmeans.py\",\n      \"entryPointArguments\": [\n        \"s3://<s3 prefix>/kmeans_data.txt\",\n        \"2\",\n        \"3\"\n       ], \n       \"sparkSubmitParameters\": \"--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.kubernetes.pyspark.pythonVersion\":\"3\",\n          \"spark.kubernetes.driverEnv.PEX_ROOT\":\"./tmp\",\n          \"spark.executorEnv.PEX_ROOT\":\"./tmp\",\n          \"spark.kubernetes.driverEnv.PEX_INHERIT_PATH\":\"prefer\",\n          \"spark.executorEnv.PEX_INHERIT_PATH\":\"prefer\",\n          \"spark.kubernetes.driverEnv.PEX_VERBOSE\":\"10\",\n          \"spark.kubernetes.driverEnv.PEX_PYTHON\":\"python3\",\n          \"spark.executorEnv.PEX_PYTHON\":\"python3\",\n          \"spark.pyspark.driver.python\":\"/tmp/numpy_dep.pex\",\n          \"spark.pyspark.python\":\"/tmp/numpy_dep.pex\",\n          \"spark.kubernetes.driver.podTemplateFile\": \"s3://<s3-prefix>/driver_pod_template.yaml\",\n          \"spark.kubernetes.executor.podTemplateFile\": \"s3://<s3-prefix>/executor_pod_template.yaml\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": { \n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\n\naws emr-containers start-job-run --cli-input-json file:////Spark-Python-in-s3-pex-pod-templates.json\n

    Point to Note: PEX files don\u2019t have the python interpreter bundled with it. Using the PEX env variables, we pass in the python interpreter installed in the spark driver and executor docker image.

    pex vs conda-pack A pex file contain only dependent Python packages but not a Python interpreter in it while a conda-pack environment has a Python interpreter as well, so with the same Python packages a conda-pack environment is much larger than a pex file. A conda-pack environment is a tar.gz file and need to be decompressed before being used while a pex file can be used directly. If a Python interpreter exists, pex is a better option than conda-pack. However, conda-pack is the ONLY CHOICE if you need a specific version of Python interpreter which does not exist and you do not have permission to install one (e.g., when you need to use a specific version of Python interpreter with an enterprise PySpark cluster). If the pex file or conda-pack environment needs to be distributed to machines on demand, there are some overhead before running your application. With the same Python packages, a conda-pack environment has large overhead/latency than the pex file as the conda-pack environment is usually much larger and need to be decompressed before being used.

    For more information - Tips on PEX

    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-a-targz-file-with-conda-pack","title":"Bundled as a tar.gz file with conda-pack","text":"

    conda-pack for spark Install conda through Miniconda Open a new terminal and execute the below commands

    conda create -y -n example python=3.5 numpy\nconda activate example\npip install conda-pack\nconda pack -f -o numpy_environment.tar.gz\n

    Upload numpy_environment.tar.gz to a s3 location that is mapped to a FSx for Lustre cluster. numpy_environment.tar.gz can be placed on any Kubernetes persistent volume and mounted to the driver pod and executor pod.Alternatively, S3 path for numpy_environment.tar.gz can also be passed using --py-files

    Request:

    {\n  \"name\": \"spark-python-in-s3-conda-fsx\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/kmeans.py\",\n      \"entryPointArguments\": [\n        \"s3://<s3 prefix>/kmeans_data.txt\",\n        \"2\",\n        \"3\"\n       ], \n       \"sparkSubmitParameters\": \"--verbose --archives /var/data/numpy_environment.tar.gz#environment --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.executor.instances\": \"3\",\n          \"spark.dynamicAllocation.enabled\":\"false\",\n          \"spark.files\":\"/var/data/numpy_environment.tar.gz#environment\",\n          \"spark.kubernetes.pyspark.pythonVersion\":\"3\",\n          \"spark.pyspark.driver.python\":\"./environment/bin/python\",\n          \"spark.pyspark.python\":\"./environment/bin/python\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\n

    The above request doesn't work with spark on kubernetes

    "},{"location":"submit-applications/docs/spark/pyspark/#bundled-as-virtual-env","title":"Bundled as virtual env","text":"

    Warning

    This will not work with spark on kubernetes

    This feature only works with YARN - cluster mode In this implementation for YARN - the dependencies will be installed from the repository for every driver and executor. This might not be a more scalable model as per SPARK-25433. Recommended solution is to pass in the dependencies as PEX file.

    "},{"location":"submit-applications/docs/spark/pyspark/#custom-docker-image","title":"Custom docker image","text":"

    See the details in the official documentation.

    Dockerfile

    FROM 107292555468.dkr.ecr.eu-central-1.amazonaws.com/spark/emr-6.3.0\nUSER root\nRUN pip3 install boto3\nUSER hadoop:hadoop\n
    "},{"location":"submit-applications/docs/spark/pyspark/#python-code-with-java-dependencies","title":"Python code with java dependencies","text":""},{"location":"submit-applications/docs/spark/pyspark/#list-of-packages","title":"List of packages","text":"

    Warning

    This will not work with spark on kubernetes

    This feature only works with YARN - cluster mode

    kafka integration example

    ./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2\n
    "},{"location":"submit-applications/docs/spark/pyspark/#list-of-jar-files","title":"List of .jar files","text":"

    This is not a scalable approach as the number of dependent files can grow to a large number, and also need to manually specify all the transitive dependencies.

    How to find all the .jar files which belongs to given package?

    1. Go to Maven Repository
    2. Search for the package name
    3. Select the matching Spark and Scala version
    4. Copy the URL of the jar file
    5. Copy the URL of the jar file of all compile dependencies

    Request:

    cat > Spark-Python-with-jars.json << EOF\n{\n  \"name\": \"spark-python-with-jars\",\n  \"virtualClusterId\": \"<virtual-cluster-id>\",\n  \"executionRoleArn\": \"<execution-role-arn>\",\n  \"releaseLabel\": \"emr-6.2.0-latest\",\n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/pi.py\",\n      \"sparkSubmitParameters\": \"--jars https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.1.1/spark-sql-kafka-0-10_2.12-3.1.1.jar,https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.6.2/commons-pool2-2.6.2.jar,https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/2.6.0/kafka-clients-2.6.0.jar,https://repo1.maven.org/maven2/org/apache/spark/spark-token-provider-kafka-0-10_2.12/3.1.1/spark-token-provider-kafka-0-10_2.12-3.1.1.jar,https://repo1.maven.org/maven2/org/apache/spark/spark-tags_2.12/3.1.1/spark-tags_2.12-3.1.1.jar --conf spark.driver.cores=3 --conf spark.executor.memory=8G --conf spark.driver.memory=6G --conf spark.executor.cores=3\"\n    }\n  },\n  \"configurationOverrides\": {\n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\",\n        \"logStreamNamePrefix\": \"demo\"\n      },\n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-with-jars.json\n
    "},{"location":"submit-applications/docs/spark/pyspark/#custom-docker-image_1","title":"Custom docker image","text":"

    See the basics in the official documentation.

    Approach 1: List of .jar files

    This is not a scalable approach as the number of dependent files can grow to a large number, and also need to manually specify all the transitive dependencies.

    How to find all the .jar files which belongs to given package?

    1. Go to Maven Repository
    2. Search for the package name
    3. Select the matching Spark and Scala version
    4. Copy the URL of the jar file
    5. Copy the URL of the jar file of all compile dependencies

    Dockerfile

    FROM 107292555468.dkr.ecr.eu-central-1.amazonaws.com/spark/emr-6.3.0\n\nUSER root\n\nARG JAR_HOME=/usr/lib/spark/jars/\n\n# Kafka\nADD https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.1.1/spark-sql-kafka-0-10_2.12-3.1.1.jar $JAR_HOME\nADD https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.6.2/commons-pool2-2.6.2.jar $JAR_HOME\nADD https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/2.6.0/kafka-clients-2.6.0.jar $JAR_HOME\nADD https://repo1.maven.org/maven2/org/apache/spark/spark-token-provider-kafka-0-10_2.12/3.1.1/spark-token-provider-kafka-0-10_2.12-3.1.1.jar $JAR_HOME\nADD https://repo1.maven.org/maven2/org/apache/spark/spark-tags_2.12/3.1.1/spark-tags_2.12-3.1.1.jar $JAR_HOME\n\nRUN chmod -R +r  /usr/lib/spark/jars\n\nUSER hadoop:hadoop\n

    Observed Behavior: Spark automatically installs all the .jar files from /usr/lib/spark/jars/ directory. In Dockerfile we are adding these file as root user and these file will get -rw------- permission while the original files have -rw-r--r-- permission. EMR on EKS uses hadoop:hadoop to run spark jobs and files with -rw------- permission are hidden from this user and can not be imported. To make these file readable for all the users run the following command chmod -R +r /usr/lib/spark/jars and the files will have -rw-r--r-- permission.

    Approach 2: List of packages

    This approach is a resource intensive (min 1vCPU, 2GB RAM) solution, because it will run a dummy spark job. Scale your local or CI/CD resources according to it.

    Dockerfile

    FROM 107292555468.dkr.ecr.eu-central-1.amazonaws.com/spark/emr-6.3.0\n\nUSER root\n\nARG KAFKA_PKG=\"org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2\"\n\nRUN spark-submit run-example --packages $KAFKA_PKG --deploy-mode=client --master=local[1] SparkPi\nRUN mv /root/.ivy2/jars/* /usr/lib/spark/jars/\n\nUSER hadoop:hadoop\n

    Observed Behavior: Spark runs ivy to get all of its dependencies (packages) when --packages are defined in the submit command. We can run a \"dummy\" spark job to make spark downloads its packages. These .jars are saved in /root/.ivy2/jars/ which we can move to /usr/lib/spark/jars/ for further use. These jars having -rw-r--r-- permission and does not require further modifications. The advantage of this method is ivy download the dependencies of the package as well, and we needed to specify only org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 instead of 5 jars files above.

    "},{"location":"submit-applications/docs/spark/pyspark/#import-of-dynamic-modules-pyd-so","title":"Import of Dynamic Modules (.pyd, .so)","text":"

    Import of dynamic modules(.pyd, .so) is disallowed when bundled as a zip

    Steps to create a .so file example.c

    /* File : example.c */\n\n #include \"example.h\"\n unsigned int add(unsigned int a, unsigned int b)\n {\n    printf(\"\\n Inside add function in C library \\n\");\n    return (a+b);\n }\n

    example.h

    /* File : example.h */\n#include<stdio.h>\n extern unsigned int add(unsigned int a, unsigned int b);\n
    gcc  -fPIC -Wall -g -c example.c\ngcc -shared -fPIC -o libexample.so example.o\n

    Upload libexample.so to a S3 location.

    pyspark code to be executed - py_c_call.py

    import sys\nimport os\n\nfrom ctypes import CDLL\nfrom pyspark.sql import SparkSession\n\n\nif __name__ == \"__main__\":\n\n    spark = SparkSession\\\n        .builder\\\n        .appName(\"py-c-so-example\")\\\n        .getOrCreate()\n\n    basedir = os.path.abspath(os.path.dirname(__file__))\n    libpath = os.path.join(basedir, 'libexample.so')\n    sum_list = CDLL(libpath)\n    data = [(1,2),(2,3),(5,6)]\n    columns=[\"a\",\"b\"]\n    df = spark.sparkContext.parallelize(data).toDF(columns)\n    df.withColumn('total', sum_list.add(df.a,df.b)).collect()\n    spark.stop()\n

    Request:

    cat > spark-python-in-s3-Clib.json <<EOF\n{\n  \"name\": \"spark-python-in-s3-Clib\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/py_c_call.py\", \n       \"sparkSubmitParameters\": \"--files s3://<s3 prefix>/libexample.so --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///spark-python-in-s3-Clib.json\n

    Configuration of interest: --files s3://<s3 prefix>/libexample.so distributes the libexample.so to the working directory of all executors. Dynamic modules(.pyd, .so) can also be imported by bundling within .egg (SPARK-6764), .whl and .pex files.

    "},{"location":"troubleshooting/docs/change-log-level/","title":"Change Log level for Spark application on EMR on EKS","text":"

    To obtain more detail about their application or job submission, Spark application developers can change the log level of their job to different levels depending on their requirements. Spark uses apache log4j for logging.

    "},{"location":"troubleshooting/docs/change-log-level/#change-log-level-to-debug","title":"Change log level to DEBUG","text":""},{"location":"troubleshooting/docs/change-log-level/#using-emr-classification","title":"Using EMR classification","text":"

    Log level of spark applications can be changed using the EMR spark-log4j configuration classification.

    Request The pi.py application script is from the spark examples. EMR on EKS has included the example located at/usr/lib/spark/examples/src/main for you to try.

    spark-log4j classification can be used to configure values in log4j.properties for EMR releases 6.7.0 or lower , and log4j2.properties for EMR releases 6.8.0+ .

    cat > Spark-Python-in-s3-debug-log.json << EOF\n{\n  \"name\": \"spark-python-in-s3-debug-log-classification\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"local:///usr/lib/spark/examples/src/main/python/pi.py\",\n      \"entryPointArguments\": [ \"200\" ],\n       \"sparkSubmitParameters\": \"--conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.memory=2G --conf spark.executor.instances=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.dynamicAllocation.enabled\":\"false\"\n          }\n      },\n      {\n        \"classification\": \"spark-log4j\", \n        \"properties\": {\n          \"log4j.rootCategory\":\"DEBUG, console\"\n          }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-debug-log.json\n

    The above request will print DEBUG logs in the spark driver and executor containers. The generated logs will be pushed to S3 and AWS Cloudwatch logs as configured in the request.

    Starting from the version 3.3.0, Spark has been migrated from log4j1 to log4j2. EMR on EKS allows you still write the log4j properties to the same \"classification\": \"spark-log4j\", however it now needs to be log4j2.properties, such as

          {\n        \"classification\": \"spark-log4j\",\n        \"properties\": {\n          \"rootLogger.level\" : \"DEBUG\"\n          }\n      }\n
    "},{"location":"troubleshooting/docs/change-log-level/#custom-log4j-properties","title":"Custom log4j properties","text":"

    Download log4j properties from here. Edit log4j.properties with log level as required. Save the edited log4j.properties in a mounted volume. In this example log4j.properties is placed in a s3 bucket that is mapped to a FSx for Lustre filesystem.

    Request pi.py used in the below request payload is from spark examples

    cat > Spark-Python-in-s3-debug-log.json << EOF\n{\n  \"name\": \"spark-python-in-s3-debug-log\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.2.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/pi.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n          \"spark.driver.extraJavaOptions\":\"-Dlog4j.configuration=file:///var/data/log4j-debug.properties\",\n          \"spark.executor.extraJavaOptions\":\"-Dlog4j.configuration=file:///var/data/log4j-debug.properties\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName\":\"fsx-claim\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path\":\"/var/data/\",\n          \"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly\":\"false\"\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\n\naws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-debug-log.json\n

    Configurations of interest: Below configuration enables spark driver and executor to pick up the log4j configuration file from /var/data/ folder mounted to the driver and executor containers. For guide to mount FSx for Lustre to driver and executor containers - refer to EMR Containers integration with FSx for Lustre

    \"spark.driver.extraJavaOptions\":\"-Dlog4j.configuration=file:///var/data/log4j-debug.properties\",\n\"spark.executor.extraJavaOptions\":\"-Dlog4j.configuration=file:///var/data/log4j-debug.properties\",\n
    "},{"location":"troubleshooting/docs/connect-spark-ui/","title":"Connect to Spark UI running on the Driver Pod","text":"

    To obtain more detail about their application or monitor their job execution, Spark application developers can connect to Spark-UI running on the Driver Pod.

    Spark UI (Spark history server) is packaged with EMR on EKS out of the box. Alternatively, if you want to see Spark UI immediately after the driver is spun up, you can use the instructions in this page to connect.

    This page shows how to use kubectl port-forward to connect to the Job's Driver Pod running in a Kubernetes cluster. This type of connection is useful for debugging purposes.

    Pre-Requisites

    • AWS cli should be installed
    • \"kubectl\" should be installed
    • If this is the first time you are connecting to your EKS cluster from your machine, you should run aws eks update-kubeconfig --name --region to download kubeconfig file and use correct context to talk to API server.
    "},{"location":"troubleshooting/docs/connect-spark-ui/#submitting-the-job-to-a-virtual-cluster","title":"Submitting the job to a virtual cluster","text":"

    Request

    cat >spark-python.json << EOF\n{\n  \"name\": \"spark-python-in-s3\", \n  \"virtualClusterId\": \"<virtual-cluster-id>\", \n  \"executionRoleArn\": \"<execution-role-arn>\", \n  \"releaseLabel\": \"emr-6.3.0-latest\", \n  \"jobDriver\": {\n    \"sparkSubmitJobDriver\": {\n      \"entryPoint\": \"s3://<s3 prefix>/trip-count.py\", \n       \"sparkSubmitParameters\": \"--conf spark.driver.cores=4  --conf spark.executor.memory=20G --conf spark.driver.memory=20G --conf spark.executor.cores=4\"\n    }\n  }, \n  \"configurationOverrides\": {\n    \"applicationConfiguration\": [\n      {\n        \"classification\": \"spark-defaults\", \n        \"properties\": {\n\n         }\n      }\n    ], \n    \"monitoringConfiguration\": {\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"/emr-containers/jobs\", \n        \"logStreamNamePrefix\": \"demo\"\n      }, \n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://joblogs\"\n      }\n    }\n  }\n}\nEOF\naws emr-containers start-job-run --cli-input-json file:///spark-python.json\n

    Once the job is submitted successfully, run kubectl get pods -n <virtual-cluster-k8s-namespace> -w command to watch all the pods, until you observe the driver pod is in the \"Running\" state. The Driver pod's name usually is in spark-<job-id>-driver format.

    "},{"location":"troubleshooting/docs/connect-spark-ui/#connecting-to-the-driver-pod","title":"Connecting to the Driver Pod","text":"

    Spark Driver Pod hosts Spark-UI on port 4040. However the pod runs within the internal Kubernetes network. To get access to the internal Kubernetes resources, kubectl provides a tool (\"Port Forwarding\") that allows access from your localhost. To get access to the driver pod in your cluster:

    1- Run kubectl port-forward <driver-pod-name> 4040:4040

    The result should be the following:

    Forwarding from 127.0.0.1:28015 -> 27017\nForwarding from [::1]:28015 -> 27017\n

    2- Open a browser and type http://localhost:4040 in the Address bar.

    You should be able to connect to the Spark UI:

    "},{"location":"troubleshooting/docs/connect-spark-ui/#consideration","title":"Consideration","text":"

    In some cases like long-running Spark jobs, such as Spark streaming or large Spark SQL queries can generate large event logs. With large events logs, it might happen quickly use up storage space on running pods and sometimes encounter to experience blank UI or even OutOfMemory errors when you load Persistent UIs. To avoid these issues, we recommend that you follow either by turn on the Spark event log rolling and compaction feature (default emr-container-event-log-dir - /var/log/spark/apps) or use S3 location to parse the log using self hosted of Spark history server.

    "},{"location":"troubleshooting/docs/eks-cluster-auto-scaler/","title":"EKS Cluster Auto-Scaler","text":"

    Kubernetes provisions nodes using CAS (Cluster Autoscaler). AWS EKS has its own implementation of K8 CAS, and EKS uses Managed-Nodegroups to spuns of Nodes.

    "},{"location":"troubleshooting/docs/eks-cluster-auto-scaler/#logs-of-eks-cluster-auto-scaler","title":"Logs of EKS Cluster Auto-scaler.","text":"

    On AWS, Cluster Autoscaler utilizes Amazon EC2 Auto Scaling Groups to provision nodes. This section will help you identify the error message when a AutoScaler fails to provision nodes.

    An example scenario, where the NodeGroup would fail due to non-supported nodes in certain AZs.

    Could not launch On-Demand Instances. Unsupported - Your requested instance type (g4dn.xlarge) is not supported in your requested Availability Zone (ca-central-1d). Please retry your request by not specifying an Availability Zone or choosing ca-central-1a, ca-central-1b. Launching EC2 instance failed.\n

    The steps to find the logs for AutoScalingGroups are,

    Step 1: Login to AWS Console, and select Elastic Kubernetes Service

    Step 2: Select Compute tab, and select the NodeGroup that fails.

    Step 3: Select the Autoscaling group name from the NodeGroup's section, which will direct you to EC2 --> AutoScaling Group page.

    Step 4: Click the Tab Activity of the AutoScaling Group, and the Activity History would give provide the details of the error.

    - Status\n- Description\n- Cause\n- Start Time\n- End Time\n

    Alternatively, the activities/logs can be found via CLI as well

    aws autoscaling describe-scaling-activities \\\n  --region <region> \\\n  --auto-scaling-group-name <NodeGroup-AutoScaling-Group>\n

    In the above error scenario, the ca-central-1d availability zone doesn't support g4dn.xlarge. The solution is

    Step 1: Identify the Subnets of the Availability zones that supports the GPU node type. The NodeGroup Section would list all the subnets, and you can click each subnet to see which AZ it is deployed to.

    Step 2: Create a NodeGroup only in the Subnets identified in the above step

    aws eks create-nodegroup \\\n    --region <region> \\ \n    --cluster-name <cluster-name> \\\n    --nodegroup-name <nodegroup-name> \\\n    --scaling-config minSize=10,maxSize=10,desiredSize=10 \\\n    --ami-type AL2_x86_64_GPU \\\n    --node-role <NodeGroupRole> \\\n    --subnets <subnet-1-that-supports-gpu> <subnet-2-that-supports-gpu> \\\n    --instance-types g4dn.xlarge \\\n    --disk-size <disk size>\n
    "},{"location":"troubleshooting/docs/karpenter/","title":"Karpenter","text":"

    Karpenter is an open-source cluster autoscaler for kubernetes (EKS) that automatically provisions new nodes in response to unschedulable pods. Until Karpenter was introduced, EKS would use its implementation of \"CAS\" Cluster Autoscaler, which creates Managed-NodeGroups to provision nodes.

    The challenge with Managed-NodeGroups is that, it can only create nodes with a single instance-type. In-order to provision nodes with different instance-types for different workloads, multiple nodegroups have to be created. Karpenter on the other hand can provision nodes of different types by working with EC2-Fleet-API. The best practices to configure the Provisioners are documented at https://aws.github.io/aws-eks-best-practices/karpenter/

    This guide helps the user troubleshoot common problems with Karpenter.

    "},{"location":"troubleshooting/docs/karpenter/#logs-of-karpenter-controller","title":"Logs of Karpenter Controller","text":"

    Karpenter is a Custom Kubernetes Controller, and the following steps would help find Karpenter Logs.

    Step 1: Identify the namespace where Karpenter is running. In most cases, helm would be used to deploy Karpenter packages. The helm ls command would list the namespace where karpenter would be installed.

    # Example\n\n% helm ls --all-namespaces\nNAME        NAMESPACE   REVISION    UPDATED                                 STATUS      CHART               APP VERSION\nkarpenter   karpenter   1           2023-05-15 14:16:03.726908 -0500 CDT    deployed    karpenter-v0.27.3   0.27.3\n

    Step 2: Setup kubectl

    brew install kubectl\n\naws --region <region> eks update-kubeconfig --name <eks-cluster-name>\n

    Step 3: Check the status of the pods of Karpenter

    # kubectl get pods -n <namespace>\n\n% kubectl get pods -n karpenter\nNAME                         READY   STATUS    RESTARTS   AGE\nkarpenter-7b455dccb8-prrzx   1/1     Running   0          7m18s\nkarpenter-7b455dccb8-x8zv8   1/1     Running   0          7m18s\n

    Step 4: The kubectl logs command would help read the Karpenter logs. The below example, karpenter pod logs depict that an t3a.large instance was launched.

    # kubectl logs <karpenter pod name> -n <namespace>\n\n% kubectl logs karpenter-7b455dccb8-prrzx -n karpenter\n..\n..\n\n2023-05-15T19:16:20.546Z    DEBUG   controller  discovered region   {\"commit\": \"***-dirty\", \"region\": \"us-west-2\"}\n2023-05-15T19:16:20.666Z    DEBUG   controller  discovered cluster endpoint {\"commit\": \"**-dirty\", \"cluster-endpoint\": \"https://******.**.us-west-2.eks.amazonaws.com\"}\n..\n..\n2023-05-15T19:16:20.786Z    INFO    controller.provisioner  starting controller {\"commit\": \"**-dirty\"}\n2023-05-15T19:16:20.787Z    INFO    controller.deprovisioning   starting controller {\"commit\": \"**-dirty\"}\n..\n2023-05-15T19:16:20.788Z    INFO    controller  Starting EventSource    {\"commit\": \"**-dirty\", \"controller\": \"node\", \"controllerGroup\": \"\", \"controllerKind\": \"Node\", \"source\": \"kind source: *v1.Pod\"}\n..\n2023-05-15T20:34:56.718Z    INFO    controller.provisioner.cloudprovider    launched instance   {\"commit\": \"d7e22b1-dirty\", \"provisioner\": \"default\", \"id\": \"i-03146cd4d4152a935\", \"hostname\": \"ip-*-*-*-*.us-west-2.compute.internal\", \"instance-type\": \"t3a.large\", \"zone\": \"us-west-2d\", \"capacity-type\": \"on-demand\", \"capacity\": {\"cpu\":\"2\",\"ephemeral-storage\":\"20Gi\",\"memory\":\"7577Mi\",\"pods\":\"35\"}}\n
    "},{"location":"troubleshooting/docs/karpenter/#error-while-decoding-json-json-unknown-field-iamidentitymappings","title":"Error while decoding JSON: json: unknown field \"iamIdentityMappings\"","text":"

    Problem The Create-Cluster command https://karpenter.sh/v0.27.3/getting-started/getting-started-with-karpenter/#3-create-a-cluster throws an error

    Error: loading config file \"karpenter.yaml\": error unmarshaling JSON: while decoding JSON: json: unknown field \"iamIdentityMappings\"\n

    Solution The eksctl cli was not able to understand the kind iamIdentityMappings. This is because, the eksctl version is old, and its schema doesn't support this kind.

    The solution is to upgrade the eksctl cli, and re-run the cluster creation commands

    brew upgrade eksctl\n
    "},{"location":"troubleshooting/docs/rbac-permissions-errors/","title":"RBAC Permission Errors","text":"

    The following sections provide solutions to common RBAC authorization errors.

    "},{"location":"troubleshooting/docs/rbac-permissions-errors/#persistentvolumeclaims-is-forbidden","title":"PersistentVolumeClaims is forbidden","text":"

    Error: Spark jobs that require creation, listing or deletion of Persistent Volume Claims (PVC) was not supported before EMR6.8. Jobs that require these permissions will fail with the exception \u201cpersistentvolumeclaims is forbidden\". Looking into driver logs, you may see an error like this:

    persistentvolumeclaims is forbidden. User \"system:serviceaccount:emr:emr-containers-sa-spark-client-93ztm12rnjz163mt3rgdb3bjqxqfz1cgvqh1e9be6yr81\" cannot create resource \"persistentvolumeclaims\" in API group \"\" in namesapce \"emr\".\n

    You may encounter this error because the default Kubernetes role emr-containers is missing the required RBAC permissions. As a result, the emr-containers primary role can\u2019t dynamically create necessary permissions for additional roles such as Spark driver, Spark executor or Spark client when you submit a job.

    Solution: Add the required permissions to emr-containers.

    Here are the complete RBAC permissions for EMR on EKS:

    • emr-containers.yaml

    You can compare whether you have complete RBAC permissions using the steps below,

    export NAMESPACE=YOUR_VALUE\nkubectl describe role emr-containers -n ${NAMESPACE}\n

    If the permissions don't match, proceed to apply latest permissions

    export NAMESPACE=YOUR_VALUE\nkubectl apply -f https://github.com/aws/aws-emr-containers-best-practices/blob/main/tools/k8s-rbac-policies/emr-containers.yaml -n ${NAMESPACE}\n

    You can delete the spark driver and client roles because they will be dynamically created when the job is run next time.

    "},{"location":"troubleshooting/docs/reverse-proxy-sparkui/","title":"Connect to SparkUI without Port Fowarding","text":"

    This is an example of connecting to SparkUI running on Spark's driver pod via a reserve proxy solution, without an access to the kubectl tool or AWS console. We demostrate how to set it up via three EMR on EKS deployment methods: Spark Operator, JobRun API, and spark-submit.

    "},{"location":"troubleshooting/docs/reverse-proxy-sparkui/#launch-emr-on-eks-jobs-via-spark-operator","title":"Launch EMR on EKS jobs via Spark Operator","text":"

    There are 3 Steps to set up in EKS:

    1.Create a SparkUI reverse proxy and an ALB in a default namespace, which is a different namespace from your EMR on EKS virtual cluster environment. It can be configured to the EMR's namespace if neccessary.

    The sample yaml file is in the Appendix section. Make sure the EMR on EKS's namespace at the line #25 in deployment.yaml is updated if needed:

    kubectl apply -f deployment.yaml\n
    NOTE: The example file is not production ready. The listen port 80 is not recommended. Make sure to stronger your Application Load Balance's security posture before deploy it to your production environment.

    2.Submit two test jobs using EMR on EKS's Spark Operator. The sample job scripts emr-eks-spark-example-01.yaml and emr-eks-spark-example-02.yaml can be found in the Appendix section. The \"spec.driver.Serviceaccount\" attribute should be updated based on your own IAM Role for Service Account (IRSA) setup in EMR on EKS.

    Remember to specify the Spark configuration at line #16 spark.ui.proxyBase: /sparkui/YOUR_SPARK_APP_NAME, eg. spark.ui.proxyBase: /sparkui/test-02.

    kubectl apply -f emr-eks-spark-example-01.yaml\nkubectl apply -f emr-eks-spark-example-02.yaml\n

    3.Go to a web browser, then access their Spark Web UI while jobs are still running.

    The Web UI address is in the format of http://ALB_ENDPOINT_ADDRESS:PORT/sparkui/YOUR_SPARK_APP_NAME. For example:

    http://k8s-default-sparkui-2d325c0434-124141735.us-west-2.elb.amazonaws.com:80/sparkui/spark-example-01\nhttp://k8s-default-sparkui-2d325c0434-124141735.us-west-2.elb.amazonaws.com:80/sparkui/test-02\n

    EKS Admin can provide the ALB endpoint address to users via the command:

    kubectl get ingress\n
    "},{"location":"troubleshooting/docs/reverse-proxy-sparkui/#launch-emr-on-eks-jobs-via-job-run-api","title":"Launch EMR on EKS jobs via Job Run API","text":"

    1.Update the the environment variables in the sample job submission script:

    export EMR_VIRTUAL_CLUSTER_NAME=YOUR_EMR_VIRTUAL_CLUSTER_NAME\nexport AWS_REGION=YOUR_AWS_REGION\nexport app_name=job-run-api\n\nexport ACCOUNTID=$(aws sts get-caller-identity --query Account --output text)\nexport VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --query \"virtualClusters[?name == '$EMR_VIRTUAL_CLUSTER_NAME' && state == 'RUNNING'].id\" --output text)\nexport EMR_ROLE_ARN=arn:aws:iam::$ACCOUNTID:role/$EMR_VIRTUAL_CLUSTER_NAME-execution-role\nexport S3BUCKET=$EMR_VIRTUAL_CLUSTER_NAME-$ACCOUNTID-$AWS_REGION\n\naws emr-containers start-job-run \\\n--virtual-cluster-id $VIRTUAL_CLUSTER_ID \\\n--name $app_name \\\n--execution-role-arn $EMR_ROLE_ARN \\\n--release-label emr-7.1.0-latest \\\n--job-driver '{\n\"sparkSubmitJobDriver\": {\n    \"entryPoint\": \"local:///usr/lib/spark/examples/jars/spark-examples.jar\", \n    \"entryPointArguments\": [\"100000\"],\n    \"sparkSubmitParameters\": \"--class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1\" }}' \\\n--configuration-overrides '{\n\"applicationConfiguration\": [\n    {\n    \"classification\": \"spark-defaults\", \n    \"properties\": {\n        \"spark.ui.proxyBase\": \"/sparkui/`$app_name`\",\n        \"spark.ui.proxyRedirectUri\": \"/\"\n    }}]}'\n

    2.Once the job's driver pod is running, create a kubernetes service based on the driver pod name. Ensure its name contains the suffix ui-svc:

    # query the driver pod name\njob_id=$(aws emr-containers list-job-runs --virtual-cluster-id $VIRTUAL_CLUSTER_ID --query \"jobRuns[?name=='$app_name' && state=='RUNNING'].id\" --output text)\ndriver_pod_name=$(kubectl get po -n emr | grep $job_id-driver | awk '{print $1}')\n# create a k8s service\nkubectl expose po -n emr \\\n--port=4040 \\\n--target-port 4040 \\\n--name=$app_name-ui-svc \\\n$driver_pod_name\n

    The SparkUI service looks like this:

    kubectl get svc -n emr\n\nNAME                  TYPE      CLUSTER-IP  EXTERNAL-IP   PORT(S)  AGE\njob-run-api-ui-svc ClusterIP 10.100.233.186  <none>      4040/TCP   9s\n

    3.Finally, access the SparkUI in this format:

    http://<YOUR_INGRESS_ADDRESS>/sparkui/<app_name>\n

    Admin can get the ingress address by the CLI:

    kubectl get ingress\n
    "},{"location":"troubleshooting/docs/reverse-proxy-sparkui/#launch-emr-on-eks-jobs-by-spark-submit","title":"Launch EMR on EKS jobs by Spark Submit:","text":"

    1.Create an EMR on EKS pod with a service account that has the IRSA associated

    kubectl run -it emrekspod \\\n--image=public.ecr.aws/emr-on-eks/spark/emr-7.1.0:latest \\\n--overrides='{ \"spec\": {\"serviceAccount\": \"emr-containers-sa-spark\"}}' \\\n--command -n spark-operator /bin/bash\n

    2.After login into the \"emrekspod\" pod, submit the job:

    export app_name=sparksubmittest\n\nspark-submit \\\n--master k8s://$KUBERNETES_SERVICE_HOST:443 \\\n--deploy-mode cluster \\\n--name $app_name \\\n--class org.apache.spark.examples.SparkPi \\\n--conf spark.ui.proxyBase=/sparkui/$app_name \\\n--conf spark.ui.proxyRedirectUri=\"/\" \\\n--conf spark.kubernetes.container.image=public.ecr.aws/emr-on-eks/spark/emr-7.1.0:latest \\\n--conf spark.kubernetes.authenticate.driver.serviceAccountName=emr-containers-sa-spark \\\n--conf spark.kubernetes.namespace=spark-operator \\\nlocal:///usr/lib/spark/examples/jars/spark-examples.jar 100000\n

    3.As soon as the driver pod is running, create a kubernetes service for Spark UI and ensure its name has the suffix of ui-svc:

    export app_name=sparksubmittest\n# get the running driver pod name\ndriver_pod_name=$(kubectl get po -n spark-operator | grep -E 'sparksubmittest-.*-driver' | grep 'Running' | awk '{print $1}')\n\n# OPTIONAL - remove existing service if needed\nkubectl delete svc $app_name-ui-svc -n spark-operator\n\n# create k8s service\nkubectl expose po -n spark-operator \\\n--port=4040 \\\n--target-port 4040 \\\n--name=$app_name-ui-svc \\\n$driver_pod_name\n
    "},{"location":"troubleshooting/docs/reverse-proxy-sparkui/#appendix","title":"Appendix","text":""},{"location":"troubleshooting/docs/reverse-proxy-sparkui/#deploymentyaml","title":"deployment.yaml","text":"
    apiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: spark-ui-reverse-proxy\n  labels:\n    app: spark-ui-reverse-proxy\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app: spark-ui-reverse-proxy\n  template:\n    metadata:\n      labels:\n        app: spark-ui-reverse-proxy\n    spec:\n      containers:\n      - name: spark-ui-reverse-proxy\n        image: ghcr.io/datapunchorg/spark-ui-reverse-proxy:main-1652762636\n        imagePullPolicy: IfNotPresent\n        command:\n          - '/usr/bin/spark-ui-reverse-proxy'\n        args:\n          # EMR on EKS's namespace\n          - -namespace=spark-operator\n        resources:\n          requests:\n            cpu: 500m\n            memory: 512Mi\n---\napiVersion: v1\nkind: Service\nmetadata:\n  name: spark-ui-reverse-proxy\n  labels:\n    app: spark-ui-reverse-proxy\nspec:\n  type: ClusterIP\n  ports:\n    - name: http\n      protocol: TCP\n      port: 8080\n      targetPort: 8080\n  selector:\n    app: spark-ui-reverse-proxy\n\n---\napiVersion: networking.k8s.io/v1\nkind: IngressClass\nmetadata:\n  name: alb-ingress-class\nspec:\n  controller: ingress.k8s.aws/alb\n\n--- \napiVersion: networking.k8s.io/v1\nkind: Ingress\nmetadata:\n  name: spark-ui\n  annotations:\n    # kubernetes.io/ingress.class: alb\n    alb.ingress.kubernetes.io/scheme: internet-facing\n    alb.ingress.kubernetes.io/target-type: ip\n    alb.ingress.kubernetes.io/success-codes: 200,301,302\n    alb.ingress.kubernetes.io/listen-ports: '[{\"HTTP\": 80}]'\n    alb.ingress.kubernetes.io/manage-backend-security-group-rules: \"true\"\n    # alb.ingress.kubernetes.io/security-groups: {{INBOUND_SG}}\n  # labels:\n  #   app: spark-ui-reverse-proxy\nspec:\n  ingressClassName: \"alb-ingress-class\"\n  rules:\n  - host: \"\"\n    http:\n      paths:\n      - path: /\n        pathType: Prefix\n        backend:\n          service:\n              name: spark-ui-reverse-proxy\n              port:\n                number: 8080\n
    "},{"location":"troubleshooting/docs/reverse-proxy-sparkui/#emr-eks-spark-example-01yaml","title":"emr-eks-spark-example-01.yaml","text":"
    apiVersion: \"sparkoperator.k8s.io/v1beta2\"\nkind: SparkApplication\nmetadata:\n  name: spark-example-01\n  namespace: spark-operator\nspec:\n  type: Scala\n  image: public.ecr.aws/emr-on-eks/spark/emr-7.1.0:latest\n  mainClass: org.apache.spark.examples.SparkPi\n  mainApplicationFile: \"local:///usr/lib/spark/examples/jars/spark-examples.jar\"\n  arguments: [\"100000\"]\n  sparkVersion: 3.5.0\n  restartPolicy:\n    type: Never\n  sparkConf:\n    spark.ui.proxyBase: /sparkui/spark-example-01\n    spark.ui.proxyRedirectUri: /\n  driver:\n    cores: 1\n    coreLimit: \"1200m\"\n    memory: \"1g\"\n    serviceAccount: emr-containers-sa-spark\n  executor:\n    cores: 2\n    instances: 2\n    memory: \"5120m\"\n
    "},{"location":"troubleshooting/docs/reverse-proxy-sparkui/#emr-eks-spark-example-02yaml","title":"emr-eks-spark-example-02.yaml","text":"
    apiVersion: \"sparkoperator.k8s.io/v1beta2\"\nkind: SparkApplication\nmetadata:\n  name: spark-example-02\n  namespace: spark-operator\nspec:\n  type: Scala\n  image: public.ecr.aws/emr-on-eks/spark/emr-7.1.0:latest\n  mainClass: org.apache.spark.examples.SparkPi\n  mainApplicationFile: \"local:///usr/lib/spark/examples/jars/spark-examples.jar\"\n  arguments: [\"1000000\"]\n  sparkVersion: 3.5.0\n  restartPolicy:\n    type: Never\n  sparkConf:\n    spark.ui.proxyBase: /sparkui/test-02\n    spark.ui.proxyRedirectUri: /\n  driver:\n    cores: 1\n    coreLimit: \"1200m\"\n    memory: \"1g\"\n    serviceAccount: emr-containers-sa-spark\n  executor:\n    cores: 1\n    instances: 1\n    memory: \"2120m\"\n
    "},{"location":"troubleshooting/docs/self-hosted-shs/","title":"Self Hosted Spark History Server","text":"

    In this section, you will learn how to self host Spark History Server instead of using the Persistent App UI on the AWS Console.

    1. In your StartJobRun call for EMR on EKS, set the following conf. to point to an S3 bucket where you would like your event logs to go : spark.eventLog.dir and spark.eventLog.enabled as such:

      \"configurationOverrides\": {\n  \"applicationConfiguration\": [{\n    \"classification\": \"spark-defaults\",\n    \"properties\": {\n      \"spark.eventLog.enabled\": \"true\",\n      \"spark.eventLog.dir\": \"s3://your-bucket-here/some-directory\"\n...\n
    2. Take note of the S3 bucket specified in #1, and use it in the instructions on step #3 wherever you are asked for path_to_eventlog and make sure it is prepended with s3a://, not s3://. An example is -Dspark.history.fs.logDirectory=s3a://path_to_eventlog.

    3. Follow instructions here to launch Spark History Server using a Docker image.

    4. After following the above steps, event logs should flow to the specified S3 bucket and the docker container should spin up Spark History Server (which will be available at 127.0.0.1:18080). This instance of Spark History Server will pick up and parse event logs from the S3 bucket specified.

    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/","title":"Spark Driver and Executor Logs","text":"

    The status of the spark jobs can be monitored via EMR on EKS describe-job-run API.

    To be able to monitor the job progress and to troubleshoot failures, you must configure your jobs to send log information to Amazon S3, Amazon CloudWatch Logs, or both

    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#send-spark-logs-to-s3","title":"Send Spark Logs to S3","text":""},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#update-the-iam-role-with-s3-write-access","title":"Update the IAM role with S3 write access","text":"

    Configure the IAM Role passed in StartJobRun input executionRoleArn with access to S3 buckets.

    {\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"s3:PutObject\",\n                \"s3:GetObject\",\n                \"s3:ListBucket\"\n            ],\n            \"Resource\": [\n                \"arn:aws:s3:::my_s3_log_location\",\n                \"arn:aws:s3:::my_s3_log_location/*\",\n            ]\n        }\n    ]\n}\n
    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#configure-the-startjobrun-api-with-s3-buckets","title":"Configure the StartJobRun API with S3 buckets","text":"

    Configure the monitoringConfiguration with s3MonitoringConfiguration, and configure the S3 location where the logs would be synced.

    {\n  \"name\": \"<job_name>\", \n  \"virtualClusterId\": \"<vc_id>\",  \n  \"executionRoleArn\": \"<iam_role_name_for_job_execution>\", \n  \"releaseLabel\": \"<emr_release_label>\", \n  \"jobDriver\": {\n\n  }, \n  \"configurationOverrides\": {\n    \"monitoringConfiguration\": {\n      \"persistentAppUI\": \"ENABLED\",\n      \"s3MonitoringConfiguration\": {\n        \"logUri\": \"s3://my_s3_log_location\"\n      }\n    }\n  }\n}\n
    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#log-location-of-jobrunner-driver-executor-in-s3","title":"Log location of JobRunner, Driver, Executor in S3","text":"

    The JobRunner (pod that does spark-submit), Spark Driver, and Spark Executor logs would be found in the following S3 locations.

    JobRunner/Spark-Submit/Controller Logs - s3://my_s3_log_location/${virtual-cluster-id}/jobs/${job-id}/containers/${job-runner-pod-id}/(stderr.gz/stdout.gz)\n\nDriver Logs - s3://my_s3_log_location/${virtual-cluster-id}/jobs/${job-id}/containers/${spark-application-id}/${spark-job-id-driver-pod-name}/(stderr.gz/stdout.gz)\n\nExecutor Logs - s3://my_s3_log_location/${virtual-cluster-id}/jobs/${job-id}/containers/${spark-application-id}/${spark-job-id-driver-executor-id}/(stderr.gz/stdout.gz)\n
    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#send-spark-logs-to-cloudwatch","title":"Send Spark Logs to CloudWatch","text":""},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#update-the-iam-role-with-cloudwatch-access","title":"Update the IAM role with CloudWatch access","text":"

    Configure the IAM Role passed in StartJobRun input executionRoleArn with access to CloudWatch Streams.

    {\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Effect\": \"Allow\",\n      \"Action\": [\n        \"logs:CreateLogStream\",\n        \"logs:DescribeLogGroups\",\n        \"logs:DescribeLogStreams\"\n      ],\n      \"Resource\": [\n        \"arn:aws:logs:*:*:*\"\n      ]\n    },\n    {\n      \"Effect\": \"Allow\",\n      \"Action\": [\n        \"logs:PutLogEvents\"\n      ],\n      \"Resource\": [\n        \"arn:aws:logs:*:*:log-group:my_log_group_name:log-stream:my_log_stream_prefix/*\"\n      ]\n    }\n  ]\n}\n
    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#configure-startjobrun-api-with-cloudwatch","title":"Configure StartJobRun API with CloudWatch","text":"

    Configure the monitoringConfiguration with cloudWatchMonitoringConfiguration, and configure the CloudWatch logGroupName and logStreamNamePrefix where the logs should be pushed.

    {\n  \"name\": \"<job_name>\", \n  \"virtualClusterId\": \"<vc_id>\",  \n  \"executionRoleArn\": \"<iam_role_name_for_job_execution>\", \n  \"releaseLabel\": \"<emr_release_label>\", \n  \"jobDriver\": {\n\n  }, \n  \"configurationOverrides\": {\n    \"monitoringConfiguration\": {\n      \"persistentAppUI\": \"ENABLED\",\n      \"cloudWatchMonitoringConfiguration\": {\n        \"logGroupName\": \"my_log_group_name\",\n        \"logStreamNamePrefix\": \"my_log_stream_prefix\"\n      }\n    }\n  }\n}\n
    "},{"location":"troubleshooting/docs/where-to-look-for-spark-logs/#log-location-of-jobrunner-driver-executor","title":"Log location of JobRunner, Driver, Executor","text":"

    The JobRunner (pod that does spark-submit), Spark Driver, and Spark Executor logs would be found in the following AWS CloudWatch locations.

    JobRunner/Spark-Submit/Controller Logs - ${my_log_group_name}/${my_log_stream_prefix}/${virtual-cluster-id}/jobs/${job-id}/containers/${job-runner-pod-id}/(stderr.gz/stdout.gz)\n\nDriver Logs - ${my_log_group_name}/${my_log_stream_prefix}/${virtual-cluster-id}/jobs/${job-id}/containers/${spark-application-id}/${spark-job-id-driver-pod-name}/(stderr.gz/stdout.gz)\n\nExecutor Logs - ${my_log_group_name}/${my_log_stream_prefix}/${virtual-cluster-id}/jobs/${job-id}/containers/${spark-application-id}/${spark-job-id-driver-executor-id}/(stderr.gz/stdout.gz)\n
    "}]} \ No newline at end of file diff --git a/security/docs/index.html b/security/docs/index.html index 55f60e0..ad567af 100644 --- a/security/docs/index.html +++ b/security/docs/index.html @@ -975,6 +975,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/security/docs/spark/data-encryption/index.html b/security/docs/spark/data-encryption/index.html index fd361e0..e7d45c5 100644 --- a/security/docs/spark/data-encryption/index.html +++ b/security/docs/spark/data-encryption/index.html @@ -1055,6 +1055,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/security/docs/spark/encryption/index.html b/security/docs/spark/encryption/index.html index b8d38a6..64ba6de 100644 --- a/security/docs/spark/encryption/index.html +++ b/security/docs/spark/encryption/index.html @@ -1142,6 +1142,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/security/docs/spark/network-security/index.html b/security/docs/spark/network-security/index.html index 2c75fb4..5257913 100644 --- a/security/docs/spark/network-security/index.html +++ b/security/docs/spark/network-security/index.html @@ -1103,6 +1103,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/security/docs/spark/secrets/index.html b/security/docs/spark/secrets/index.html index c160c07..7b08771 100644 --- a/security/docs/spark/secrets/index.html +++ b/security/docs/spark/secrets/index.html @@ -1100,6 +1100,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/sitemap.xml.gz b/sitemap.xml.gz index ed450728b609f3ec6bed0ff7f8903e028aeb76f3..2b406b54ecdc075e9a1249dfeed8db46ace61477 100644 GIT binary patch delta 13 Ucmb=gXP58h;7EAdKasrx03Lt@qyPW_ delta 13 Ucmb=gXP58h;CS#pZz6jI03j{~4FCWD diff --git a/storage/docs/index.html b/storage/docs/index.html index d0f4506..9f19d6c 100644 --- a/storage/docs/index.html +++ b/storage/docs/index.html @@ -975,6 +975,27 @@ +
  • + + + + + Connect to Spark UI via Reverse Proxy + + + + +
  • + + + + + + + + + +
  • diff --git a/storage/docs/spark/ebs/index.html b/storage/docs/spark/ebs/index.html index 4bc318c..f6d71b6 100644 --- a/storage/docs/spark/ebs/index.html +++ b/storage/docs/spark/ebs/index.html @@ -779,6 +779,33 @@