Skip to Content

Amazon DEA-C01: How to Identify Matching Records in Amazon S3 Data Lake Without Common Unique Identifier?

Learn the best solution to identify matching records in an Amazon S3 data lake when records lack a common unique identifier. Discover how to use AWS Lake Formation FindMatches transform in an AWS Glue ETL job to match records accurately.

Table of Contents

Question

A company ingests data from multiple data sources and stores the data in an Amazon S3 bucket. An AWS Glue extract, transform, and load (ETL) job transforms the data and writes the transformed data to an Amazon S3 based data lake. The company uses Amazon Athena to query the data that is in the data lake.

The company needs to identify matching records even when the records do not have a common unique identifier.

Which solution will meet this requirement?

A. Use Amazon Macie pattern matching as part of the ETL job.
B. Train and use the AWS Glue PySpark Filter class in the ETL job.
C. Partition tables and use the ETL job to partition the data on a unique identifier.
D. Train and use the AWS Lake Formation FindMatches transform in the ETL job.

Answer

D. Train and use the AWS Lake Formation FindMatches transform in the ETL job.

Explanation

The AWS Lake Formation FindMatches transform is the best solution to identify matching records in the given scenario where the records in the Amazon S3 data lake do not have a common unique identifier.

The FindMatches transform uses machine learning (ML) to identify matching records based on the similarities between record field values. It can be trained on a subset of the data to learn what constitutes a match vs a non-match.

Once trained, the FindMatches transform can be incorporated into the AWS Glue ETL job. As part of the ETL process, it will identify and label matching records even if they don’t share an exact common identifier. This allows the company to find duplicate or related records in their data lake.

The other options are not suitable for the given use case:

A) Amazon Macie pattern matching focuses on data security and identifying sensitive data. It’s not designed for record matching.

B) The AWS Glue PySpark Filter class is used to filter records based on specified conditions. It doesn’t have built-in capabilities to identify matching records.

C) Partitioning tables on a unique identifier requires the records to already have a common ID to partition on, which is not the case here.

Therefore, using the AWS Lake Formation FindMatches ML transform in the ETL job is the correct solution to identify matching records in the Amazon S3 data lake.

Amazon AWS Certified Data Engineer – Associate DEA-C01 certification exam assessment practice question and answer (Q&A) dump including multiple choice questions (MCQ) and objective type questions, with detail explanation and reference available free, helpful to pass the Amazon AWS Certified Data Engineer – Associate DEA-C01 exam and earn Amazon AWS Certified Data Engineer – Associate DEA-C01 certification.