Table of Contents
Question
You work for a company that wants to improve spam filtering for mobile email applications. Your data science team gathers one million messages that have been correctly labeled as spam. You then train an artificial neural network to correctly identify these spam messages. After you train the system, one of the product managers asks why you don’t use those same million messages to test the network for accuracy.
How should you respond?
A. An artificial neural network does not use test data like other machine learning systems.
B. If you use the training data, then you’re not testing how well the system will do in the future to identify spam.
C. It is a good idea to use the same messages, but the machine learning system can test its accuracy.
D. That is an efficient way to train the system without having to find another several million email messages.
Answer
B. If you use the training data, then you’re not testing how well the system will do in the future to identify spam.
Explanation
The correct answer is B. If you use the training data, then you’re not testing how well the system will do in the future to identify spam.
A machine learning system, such as an artificial neural network, needs to be evaluated on its performance and accuracy on new or unseen data, not on the data that it was trained on. This is because the goal of machine learning is to generalize from the training data to the test data, and to make accurate predictions or classifications for any data that it may encounter in the future.
If you use the training data to test the system, then you are not measuring how well the system will do in the future to identify spam, but rather how well it has memorized or fitted the training data. This can lead to overfitting, which means that the system performs very well on the training data, but poorly on the test data or any new data. Overfitting can result from using too complex a model, too little data, or too many iterations of training.
Therefore, to avoid overfitting and to test the system’s accuracy, you should use a separate set of data that is different from the training data. This can be done by splitting the original data into a training set and a test set, or by using cross-validation techniques that divide the data into multiple folds and use each fold as a test set once. By using a test set that is independent of the training set, you can measure how well the system can generalize and adapt to new or unseen data.
The other options are incorrect because they do not explain why you should not use the training data to test the system.
- A. An artificial neural network does not use test data like other machine learning systems. This option is false, as an artificial neural network is a type of machine learning system that does use test data to evaluate its performance and accuracy. An artificial neural network is not different from other machine learning systems in this regard.
- C. It is a good idea to use the same messages, but the machine learning system can test its accuracy. This option is contradictory, as using the same messages for training and testing is not a good idea, as explained above. The machine learning system cannot test its accuracy on the same messages that it was trained on, as this does not reflect its ability to generalize and handle new or unseen data.
- D. That is an efficient way to train the system without having to find another several million email messages. This option is misleading, as using the same messages for training and testing is not an efficient way to train the system, but rather a way to overfit the system and reduce its accuracy and generalization. Moreover, finding another several million email messages may not be necessary, as a smaller but representative sample of email messages may suffice for testing purposes.
Reference
- Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges (hindawi.com)
- Spam detection through feature selection using artificial neural network and sine–cosine algorithm | SpringerLink
- Spam Email Detection Using Machine Learning and Neural Networks | SpringerLink
- Training and Test Sets: Splitting Data | Machine Learning | Google for Developers
- Training vs Testing Data in Machine Learning | GiniMachine
- Machine Learning: High Training Accuracy And Low Test Accuracy » EML (enjoymachinelearning.com)
- python – Should my model always give 100% accuracy on Training dataset? – Stack Overflow
- python – Train Accuracy vs Test Accuracy vs Confusion matrix – Data Science Stack Exchange
- Train and Test datasets in Machine Learning – Javatpoint
- The Difference Between Training Data vs. Test Data in Machine Learning (obviously.ai)
- What Is Training Data in Machine Learning? (monkeylearn.com)
- What is the difference between training data and test data? And how to obtain the high-quality training data? (speechocean.com)
The latest Generative AI Skills Initiative certificate program actual real practice exam question and answer (Q&A) dumps are available free, helpful to pass the Generative AI Skills Initiative certificate exam and earn Generative AI Skills Initiative certification.