Confounding Variables

Jay Mahabal

One goal of machine learning algorithms is to use past information to predict the future. The advantage of machine learning over traditional analytics is the ability for the machine-learning algorithm to automatically build a good model, saving time, preventing overfitting, and generally being more robust. To do this the algorithm builds a model, calculates the error rate of the model, adjusts parameters to lower the error rate, and iterates again, 'learning' from its mistakes.

There's a step in between: calculating the error rate requires us to split our dataset into a training and test dataset, in which we train the model on the training dataset, and calculate the error rate on the test dataset. We need to do this because if we calculate the error rate while training on the entire dataset we would get a low error rate since the model is trained on that specific data, and this would be misleading for predicting future, unknown data. So using a separate test data set is better measure of how a model actually performs.

Below, we've split a dataset into the testing and training sets already; if you were to build a rudimentary model on the following data, how would you draw it?