Skip to Content

Amazon AWS Certified Machine Learning – Specialty: What Causes Prediction Power Score of 1 in Amazon SageMaker Data Wrangler?

Learn what it means when you see a feature prediction power score of 1 while using Amazon SageMaker Data Wrangler and how to identify target leakage in your dataset.

Table of Contents

Question

A data scientist uses Amazon SageMaker Data Wrangler to obtain a feature summary from a dataset that the data scientist imported from Amazon S3. The data scientist notices that the prediction power for a dataset feature has a score of 1.

What is the cause of the score?

A. Target leakage occurred in the imported dataset.
B. The data scientist did not fine-tune the training and validation split.
C. The SageMaker Data Wrangler algorithm that the data scientist used did not find an optimal model fit for each feature to calculate the prediction power.
D. The data scientist did not process the features enough to accurately calculate prediction power.

Answer

A prediction power score of 1 for a feature in the dataset summary provided by Amazon SageMaker Data Wrangler indicates that target leakage occurred in the imported dataset (Option A).

A. Target leakage occurred in the imported dataset.

Explanation

Target leakage happens when information from the target variable leaks into the feature data, allowing the model to learn from or be influenced by the very thing it is trying to predict. This results in the model appearing to have very high accuracy during training, but performing poorly on new, unseen data.

A perfect prediction power score of 1 means the feature can be used to predict the target variable with 100% accuracy, which is unrealistic in real-world scenarios and a telltale sign of data leakage. Some common causes of target leakage include:

  • Including data that would not be available at the time of prediction
  • Accidentally including the target variable or a proxy for it in the features
  • Improper splitting of training and validation data allowing the validation set to influence model training

To address target leakage, the data scientist should carefully review the features and dataset to identify and remove any leaky variables. This may involve changing data preprocessing steps, feature engineering, or using techniques like k-fold cross validation to create better training and validation splits.

The high prediction power score is not related to insufficient feature processing (Option D) or hyperparameter tuning of the model (Options B and C). Rather, it directly points to an issue with the underlying data that needs to be corrected before proceeding with model training.

Amazon AWS Certified Machine Learning – Specialty certification exam assessment practice question and answer (Q&A) dump including multiple choice questions (MCQ) and objective type questions, with detail explanation and reference available free, helpful to pass the Amazon AWS Certified Machine Learning – Specialty exam and earn Amazon AWS Certified Machine Learning – Specialty certification.