Supervised Learning | Linear Regression | Machine Learning | Scikit-Learn | Part-3


Lets load the Advertising dataset and understand the features and target. We use pandas library to load the data.
Features are,
  • TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
  • Radio: advertising dollars spent on Radio
  • Newspaper: advertising dollars spent on Newspaper
Response is
  • Sales: sales of a single product in a given market (in thousands of items)
What else do we know?
  • Because the response variable is continuous, this is a regression problem.
  • There are 200 observations (represented by the rows), and each observation is a single market.

Linear regression
Pros: Fast, no tuning required, highly interpretable, well-understood
Cons: unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response)
Regression: Predict a continuous response.
Form of linear regression
y=β0+β1x1+β2x2+...+βnxny=β0+β1x1+β2x2+...+βnxn 
  • y  is the response
  • β0  is the intercept
  • β1  is the coefficient for  x1  (the first feature)
  • βn  is the coefficient for  xn  (the nth feature)
In this case: y=β0+β1×TV+β2×Radio+β3×Newspapery
The  β  values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions!
Preparing X and y using pandas
  • Scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays.
  • However, pandas is built on top of NumPy.
  • Thus, X can be a pandas DataFrame and y can be a pandas Series!
Splitting X and y into training and testing sets
Linear regression in scikit-learn
Thus linear Equation is  y=2.88+0.0466×TV+0.179×Radio+0.00345×Newspaper
How do we interpret the TV coefficient (0.0466)?
  • For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
  • Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional $1,000 spent on TV ads is associated with an increase in sales of 46.6 items.
Important notes:
  • This is a statement of association, not causation.
  • If an increase in TV ad spending was associated with a decrease in sales, β1 would be negative.

Model evaluation metrics for regression
We need an evaluation metric in order to compare our predictions with the actual values! Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.
Comparing these metrics:

  • MAE is the easiest to understand, because it's the average error.
  • MSE is more popular than MAE, because MSE "punishes" larger errors.
  • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.
Sales predictions model evaluation:


Comparing the sales prediction vs actual:

No comments:

Post a Comment

Note: only a member of this blog may post a comment.