Learn the correct steps for preparing image data before uploading to Amazon S3 for training an image classification model using Amazon SageMaker and pipe mode.
Table of Contents
Question
A social media company wants to develop a machine learning (ML) model to detect inappropriate or offensive content in images. The company has collected a large dataset of labeled images and plans to use the built-in Amazon SageMaker image classification algorithm to train the model. The company also intends to use SageMaker pipe mode to speed up the training.
The company splits the dataset into training, validation, and testing datasets. The company stores the training and validation images in folders that are named Training and Validation, respectively. The folders contain subfolders that correspond to the names of the dataset classes. The company resizes the images to the same size and generates two input manifest files named training.lst and validation.lst, for the training dataset and the validation dataset, respectively. Finally, the company creates two separate Amazon S3 buckets for uploads of the training dataset and the validation dataset.
Which additional data preparation steps should the company take before uploading the files to Amazon S3?
A. Generate two Apache Parquet files, training.parquet and validation.parquet, by reading the images into a Pandas data frame and storing the data frame as a Parquet file. Upload the Parquet files to the training S3 bucket.
B. Compress the training and validation directories by using the Snappy compression library. Upload the manifest and compressed files to the training S3 bucket.
C. Compress the training and validation directories by using the gzip compression library. Upload the manifest and compressed files to the training S3 bucket.
D. Generate two RecordIO files, training.rec and validation.rec, from the manifest files by using the im2rec Apache MXNet utility tool. Upload the RecordIO files to the training S3 bucket.
Answer
D. Generate two RecordIO files, training.rec and validation.rec, from the manifest files by using the im2rec Apache MXNet utility tool. Upload the RecordIO files to the training S3 bucket.
Explanation
To train an image classification model using the built-in Amazon SageMaker algorithm and pipe mode, the company should convert the image data and corresponding manifest files into the RecordIO format. The im2rec utility from the Apache MXNet library is used to generate RecordIO files from the manifest files (training.lst and validation.lst). This process combines the image data and labels into a single file optimized for efficient data loading during training.
After generating the RecordIO files (training.rec and validation.rec), the company should upload them to the designated training S3 bucket. The SageMaker built-in algorithm for image classification expects the data to be in the RecordIO format when using pipe mode for faster training.
The other options, such as generating Parquet files or compressing the directories using Snappy or gzip, are not the appropriate data preparation steps for this specific use case involving the SageMaker built-in image classification algorithm and pipe mode.
Amazon AWS Certified Machine Learning – Specialty (MLS-C01) certification exam practice question and answer (Q&A) dump with detail explanation and reference available free, helpful to pass the Amazon AWS Certified Machine Learning – Specialty (MLS-C01) exam and earn Amazon AWS Certified Machine Learning – Specialty (MLS-C01) certification.