Learn the best solution to update Amazon Redshift tables without introducing duplicate records when rerunning an AWS Glue job that writes processed data from Glue tables.
Table of Contents
Question
A company uploads .csv files to an Amazon S3 bucket. The company’s data platform team has set up an AWS Glue crawler to perform data discovery and to create the tables and schemas.
An AWS Glue job writes processed data from the tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creates the Amazon Redshift tables in the Redshift database appropriately.
If the company reruns the AWS Glue job for any reason, duplicate records are introduced into the Amazon Redshift tables. The company needs a solution that will update the Redshift tables without duplicates.
Which solution will meet these requirements?
A. Modify the AWS Glue job to copy the rows into a staging Redshift table. Add SQL commands to update the existing rows with new values from the staging Redshift table.
B. Modify the AWS Glue job to load the previously inserted data into a MySQL database. Perform an upsert operation in the MySQL database. Copy the results to the Amazon Redshift tables.
C. Use Apache Spark’s DataFrame dropDuplicates() API to eliminate duplicates. Write the data to the Redshift tables.
D. Use the AWS Glue ResolveChoice built-in transform to select the value of the column from the most recent record.
Answer
A. Modify the AWS Glue job to copy the rows into a staging Redshift table. Add SQL commands to update the existing rows with new values from the staging Redshift table.
Explanation
When an AWS Glue job writes data to Amazon Redshift tables, simply rerunning the job can introduce duplicate records if the job is not designed to handle updates properly.
The best approach is to modify the Glue job to first copy the data into a staging table in Redshift. Then, use SQL UPDATE commands to update the existing rows in the target Redshift table with the new values from the staging table. This allows updating existing records without creating duplicates.
The other options are not ideal:
B. Loading data into a separate MySQL database to perform an upsert, then copying to Redshift adds unnecessary complexity and extra steps. It’s more efficient to handle the updates directly in Redshift.
C. Using Spark’s dropDuplicates() will only prevent inserting duplicate new records, but won’t update existing records in the Redshift table with new values.
D. The ResolveChoice transform selects the most recent record, but does not actually update existing records in place in the Redshift table.
Therefore, option A, staging the data and using SQL updates, is the most effective way to update the Redshift tables via AWS Glue without creating duplicate records.
Amazon AWS Certified Data Engineer – Associate DEA-C01 certification exam assessment practice question and answer (Q&A) dump including multiple choice questions (MCQ) and objective type questions, with detail explanation and reference available free, helpful to pass the Amazon AWS Certified Data Engineer – Associate DEA-C01 exam and earn Amazon AWS Certified Data Engineer – Associate DEA-C01 certification.