Amazon CLF-C02: What is the Most Efficient Way to Ingest Oracle Database Tables into Amazon S3 Data Lake for Analytics?

Discover the most efficient solution for ingesting Oracle database tables into an Amazon S3 data lake for analytics purposes. Learn how to minimize effort while ensuring data is in Apache Parquet format for optimal query performance using Amazon Athena.

Question

Table of Contents

Question
Answer
Explanation

A company is building a data lake for a new analytics team. The company is using Amazon S3 for storage and Amazon Athena for query analysis. All data that is in Amazon S3 is in Apache Parquet format.

The company is running a new Oracle database as a source system in the company’s data center. The company has 70 tables in the Oracle database. All the tables have primary keys. Data can occasionally change in the source system. The company wants to ingest the tables every day into the data lake.

Which solution will meet this requirement with the LEAST effort?

A. Create an Apache Sqoop job in Amazon EMR to read the data from the Oracle database. Configure the Sqoop job to write the data to Amazon S3 in Parquet format.
B. Create an AWS Glue connection to the Oracle database. Create an AWS Glue bookmark job to ingest the data incrementally and to write the data to Amazon S3 in Parquet format.
C. Create an AWS Database Migration Service (AWS DMS) task for ongoing replication. Set the Oracle database as the source. Set Amazon S3 as the target. Configure the task to write the data in Parquet format.
D. Create an Oracle database in Amazon RDS. Use AWS Database Migration Service (AWS DMS) to migrate the on-premises Oracle database to Amazon RDS. Configure triggers on the tables to invoke AWS Lambda functions to write changed records to Amazon S3 in Parquet format.

Answer

B. Create an AWS Glue connection to the Oracle database. Create an AWS Glue bookmark job to ingest the data incrementally and to write the data to Amazon S3 in Parquet format.

Explanation

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It provides a seamless way to connect to various data sources, including Oracle databases, and allows you to create jobs to extract, transform, and load data into your desired destination, such as Amazon S3.

In this scenario, using AWS Glue is the most efficient solution with the least effort. Here’s why:

AWS Glue Connection: By creating an AWS Glue connection to the Oracle database, you establish a secure and managed connection between the source system and AWS Glue. This eliminates the need for manual configuration or additional infrastructure setup.
Incremental Data Ingestion: AWS Glue bookmark jobs enable incremental data ingestion. This means that only the new or changed data since the last ingestion will be processed, reducing the amount of data transferred and minimizing the processing time. Bookmark jobs automatically keep track of the last processed data, making incremental updates efficient and hassle-free.
Parquet Format: AWS Glue supports writing data to Amazon S3 in Apache Parquet format out of the box. Parquet is a columnar storage format that provides optimized query performance and efficient compression. By directly writing the data in Parquet format, you eliminate the need for additional data transformation steps.
Integration with Amazon Athena: Amazon Athena is a serverless query service that allows you to analyze data stored in Amazon S3 using standard SQL. By having the data in Parquet format, Athena can efficiently query and analyze the data without the need for additional data processing or transformation.

The other options have limitations or require more effort:

Option A involves using Apache Sqoop and Amazon EMR, which requires additional setup and management of EMR clusters.
Option C uses AWS DMS for ongoing replication, but it may not handle incremental updates as efficiently as AWS Glue bookmark jobs.
Option D introduces unnecessary complexity by migrating the Oracle database to Amazon RDS and using triggers and Lambda functions for data ingestion.

In summary, using AWS Glue with a bookmark job to ingest data incrementally from the Oracle database and write it to Amazon S3 in Parquet format provides the most efficient and least effortful solution for building the data lake for analytics purposes.

What is the Most Efficient Way to Ingest Oracle Database Tables into Amazon S3 Data Lake for Analytics?

Amazon AWS Certified Cloud Practitioner CLF-C02 certification exam assessment practice question and answer (Q&A) dump including multiple choice questions (MCQ) and objective type questions, with detail explanation and reference available free, helpful to pass the Amazon AWS Certified Cloud Practitioner CLF-C02 exam and earn Amazon AWS Certified Cloud Practitioner CLF-C02 certification.