Amazon AWS Certified Machine Learning - Specialty: What is the most cost-effective Amazon SageMaker solution for real-time ML inference with predictable traffic patterns?

To minimize recommendation latency for users while optimizing costs, the best solution is serverless inference with provisioned concurrency when traffic is predictably high but low at other times.

Question

Table of Contents

Question
Answer
Explanation

A media company wants to deploy a machine learning (ML) model that uses Amazon SageMaker to recommend new articles to the company’s readers. The company’s readers are primarily located in a single city.

The company notices that the heaviest reader traffic predictably occurs early in the morning, after lunch, and again after work hours. There is very little traffic at other times of day. The media company needs to minimize the time required to deliver recommendations to its readers. The expected amount of data that the API call will return for inference is less than 4 MB.

Which solution will meet these requirements in the MOST cost-effective way?

A. Real-time inference with auto scaling
B. Serverless inference with provisioned concurrency
C. Asynchronous inference
D. A batch transform task

Answer

B. Serverless inference with provisioned concurrency

Explanation

For this scenario, using serverless inference with provisioned concurrency in Amazon SageMaker would be the most cost-effective way to meet the requirements:

The media company has very predictable traffic patterns, with heavy usage in the morning, after lunch, and after work, but little traffic at other times. Provisioned concurrency allows them to specify the number of instances they want ready to respond to inference requests, enabling faster response times during peak hours. Outside of those time windows, provisioned concurrency can be set to 0 to avoid paying for unused instances.
Minimizing the time to deliver recommendations is a key requirement. Serverless inference with provisioned concurrency provides low latency, typically on the order of 50-100 ms. The instances are already running and ready to handle requests without any cold start delay.
The expected response payload is small at under 4 MB. Serverless inference has a 6 MB payload limit, so it can easily handle payloads of this size.
From a cost perspective, serverless inference has a low per-request cost and no charge for idle instances. With provisioned concurrency, costs scale directly with usage. This is ideal for the media company’s case of predictably high traffic at certain hours but very low usage otherwise.

The other options are not as suitable:

Real-time inference with auto scaling (A) is not as cost-effective, as some instances would likely sit idle outside of peak hours, incurring charges.
Asynchronous inference (C) is intended for requests that don’t need immediate responses and can tolerate some delays. It doesn’t fit this use case of providing fast recommendations to users.
Batch transform (D) is for offline, bulk inference on an entire dataset. It’s not designed for real-time requests and responses as the media company requires.

Therefore, serverless inference with provisioned concurrency is the optimal choice to deliver responsive, scalable performance to readers while keeping costs closely aligned with actual usage patterns.

Amazon AWS Certified Machine Learning – Specialty certification exam assessment practice question and answer (Q&A) dump including multiple choice questions (MCQ) and objective type questions, with detail explanation and reference available free, helpful to pass the Amazon AWS Certified Machine Learning – Specialty exam and earn Amazon AWS Certified Machine Learning – Specialty certification.