Skip to Content

Amazon DEA-C01: What is the most cost-effective way to optimize an Amazon EMR cluster for a CPU-intensive Spark ETL job?

Learn how to reduce costs for a CPU-intensive, memory-light Apache Spark ETL job running on Amazon EMR by selecting the optimal EC2 instance type for the EMR task nodes.

Table of Contents

Question

A company currently uses a provisioned Amazon EMR cluster that includes general purpose Amazon EC2 instances. The EMR cluster uses EMR managed scaling between one to five task nodes for the company’s long-running Apache Spark extract, transform, and load (ETL) job. The company runs the ETL job every day.

When the company runs the ETL job, the EMR cluster quickly scales up to five nodes. The EMR cluster often reaches maximum CPU usage, but the memory usage remains under 30%.

The company wants to modify the EMR cluster configuration to reduce the EMR costs to run the daily ETL job.

Which solution will meet these requirements MOST cost-effectively?

A. Increase the maximum number of task nodes for EMR managed scaling to 10.
B. Change the task node type from general purpose EC2 instances to memory optimized EC2 instances.
C. Switch the task node type from general purpose Re instances to compute optimized EC2 instances.
D. Reduce the scaling cooldown period for the provisioned EMR cluster.

Answer

The most cost-effective solution to reduce EMR costs for the company’s daily CPU-intensive but memory-light Spark ETL job is to choose option C:

Switch the task node type from general purpose EC2 instances to compute optimized EC2 instances.

Explanation

The key factors given are:

  • The Spark ETL job is CPU-intensive, often reaching maximum CPU usage on the task nodes
  • Memory usage remains low, under 30%
  • The goal is to reduce EMR costs to run this daily job

Compute optimized EC2 instances provide the highest performing processors and lowest price/compute performance ratio. They are ideal for CPU-bound workloads like high performance computing, machine learning inference, ad serving, highly scalable multiplayer gaming, and video encoding.

Since the workload is CPU-intensive but memory-light, switching to compute optimized instances will provide the CPU power needed at a lower cost than general purpose instances. The low memory usage means memory optimized instances are unnecessary and would be more expensive.

Increasing the maximum number of nodes is not ideal, as the cluster already scales to 5 nodes and hits CPU limits, so adding more of the same instance type is not cost-effective. Reducing the cooldown period also does not address the root issue.

Therefore, switching the task nodes from general purpose to compute optimized EC2 instances is the most cost-effective way to optimize this EMR cluster for the given Spark ETL workload. The powerful CPUs will maximize performance while the low memory requirements mean a memory-optimized instance is not needed, providing the best price/performance fit.

Amazon AWS Certified Data Engineer – Associate DEA-C01 certification exam assessment practice question and answer (Q&A) dump including multiple choice questions (MCQ) and objective type questions, with detail explanation and reference available free, helpful to pass the Amazon AWS Certified Data Engineer – Associate DEA-C01 exam and earn Amazon AWS Certified Data Engineer – Associate DEA-C01 certification.