Skip to Content

AI-900: Proper Data Splitting for Machine Learning Model Evaluation

When training a machine learning model, the data must be split correctly into training and evaluation sets. Learn why random row-wise splitting is preferred over splitting by feature or label.

Question

For a machine learning progress, how should you split data for training and evaluation?

A. Use features for training and labels for evaluation.
B. Randomly split the data into rows for training and rows for evaluation.
C. Use labels for training and features for evaluation.
D. Randomly split the data into columns for training and columns for evaluation.

Answer

B. Randomly split the data into rows for training and rows for evaluation.

Explanation

In Azure Machine Learning, the percentage split is the available technique to split the data. In this technique, random data of a given percentage will be split to train and test data.

The Split Data module is particularly useful when you need to separate data into training and testing sets. Use the Split Rows option if you want to divide the data into two parts. You can specify the percentage of data to put in each split, but by default, the data is divided 50-50. You can also randomize the selection of rows in each group, and use stratified sampling.

The correct answer is B. You should randomly split the data into rows for training and evaluation for a machine learning project.

A machine learning project involves using data to train a model that can perform a specific task, such as classification, regression, or clustering. To train and evaluate a model, you need to split the data into two subsets: a training set and an evaluation set.

The training set is the data that you use to train the model, which means to adjust the model parameters to minimize the error between the model predictions and the actual outcomes. The evaluation set is the data that you use to evaluate the model, which means to measure how well the model performs on new and unseen data.

To split the data into training and evaluation sets, you should randomly split the data into rows, not columns. A row in a data set represents an observation or an example, which consists of a set of features and a label. A feature is an attribute or a characteristic of the observation, such as age, height, or color. A label is the outcome or the target variable that you want to predict, such as income, grade, or category.

By randomly splitting the data into rows, you are ensuring that both the training and evaluation sets have a representative sample of the population, and that the features and labels are consistent across the sets. This way, you can train and evaluate the model on the same data format and distribution.

If you split the data into columns, you are splitting the data into features and labels, not into training and evaluation sets. This is not a valid way to split the data, because you need both features and labels for both training and evaluation. If you use features for training and labels for evaluation, or vice versa, you are not training or evaluating the model properly, because you are not providing the model with the input and output data that it needs.

Reference

Microsoft Learn > Previous Versions > Module Categories and Descriptions > Data Transformation > Sample and Split > Split Data

Microsoft Azure AI Fundamentals AI-900 certification exam practice question and answer (Q&A) dump with detail explanation and reference available free, helpful to pass the Microsoft Azure AI Fundamentals AI-900 exam and earn Microsoft Azure AI Fundamentals AI-900 certification.

Microsoft Azure AI Fundamentals AI-900 certification exam practice question and answer (Q&A) dump