Skip to Content

Amazon AWS Certified Machine Learning – Specialty: What’s the Best Way to Ingest and Transform Real-Time Device Data into S3 for Machine Learning?

Learn the most efficient approach to ingest real-time data from remote devices into S3, transform it into clean CSV for machine learning, and handle transformation failures – as covered in the AWS Certified Machine Learning Specialty Exam.

Table of Contents

Question

A company is building a predictive maintenance system using real-time data from devices on remote sites. There is no AWS Direct Connect connection or VPN connection between the sites and the company’s VPC. The data needs to be ingested in real time from the devices into Amazon S3.

Transformation is needed to convert the raw data into clean .csv data to be fed into the machine learning (ML) model. The transformation needs to happen during the ingestion process. When transformation fails, the records need to be stored in a specific location in Amazon S3 for human review. The raw data before transformation also needs to be stored in Amazon S3.

How should an ML specialist architect the solution to meet these requirements with the LEAST effort?

A. Use Amazon Data Firehose with Amazon S3 as the destination. Configure Firehose to invoke an AWS Lambda function for data transformation. Enable source record backup on Firehose.
B. Use Amazon Managed Streaming for Apache Kafka. Set up workers in Amazon Elastic Container Service (Amazon ECS) to move data from Kafka brokers to Amazon S3 while transforming it. Configure workers to store raw and unsuccessfully transformed data in different S3 buckets.
C. Use Amazon Data Firehose with Amazon S3 as the destination. Configure Firehose to invoke an Apache Spark job in AWS Glue for data transformation. Enable source record backup and configure the error prefix.
D. Use Amazon Kinesis Data Streams in front of Amazon Data Firehose. Use Kinesis Data Streams with AWS Lambda to store raw data in Amazon S3. Configure Firehose to invoke a Lambda function for data transformation with Amazon S3 as the destination.

Answer

A. Use Amazon Data Firehose with Amazon S3 as the destination. Configure Firehose to invoke an AWS Lambda function for data transformation. Enable source record backup on Firehose.

Explanation

Amazon Kinesis Data Firehose is the easiest way to ingest real-time streaming data and load it into data lakes, data stores and analytics services. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk.

By using Firehose with S3 as the destination, you can stream data directly from the remote devices into S3 without needing a VPN or Direct Connect. Firehose provides a simple, fully managed service for this real-time data ingestion.

To handle the required data transformations, you can configure Firehose to invoke a Lambda function. The Lambda function will receive batches of records, perform the necessary transformations to convert the raw data into clean CSV format, and return the transformed records to Firehose. Firehose will then deliver the transformed data to the configured S3 destination.

By enabling source record backup on Firehose, the raw data will automatically be backed up to S3 before transformation. This covers the requirement to store the pre-transformation raw data in S3.

Finally, when transformation via the Lambda function fails for any records, Firehose will store those failed records in S3 using a specified “error output prefix”. This allows the unsuccessfully transformed records to be stored separately in S3 for human review later, as required.

The other options are not as suitable because:

B) Using Amazon MSK with ECS workers is much more complex to set up and manage compared to Firehose with Lambda. It doesn’t meet the requirement for “least effort”.

C) Using Glue for the transformation instead of Lambda would be more costly and complex for this use case. Firehose supports Lambda natively making it the simpler choice.

D) Using Kinesis Data Streams in addition to Firehose is unnecessary, as Firehose can ingest the streaming data directly. It adds complexity without any benefit for these requirements.

Therefore, using Amazon Kinesis Data Firehose with S3, Lambda for transformation, and source record backup enabled is the best solution. It will meet all the requirements with the least implementation effort.

Amazon AWS Certified Machine Learning – Specialty certification exam assessment practice question and answer (Q&A) dump including multiple choice questions (MCQ) and objective type questions, with detail explanation and reference available free, helpful to pass the Amazon AWS Certified Machine Learning – Specialty exam and earn Amazon AWS Certified Machine Learning – Specialty certification.