ATech Guide

Regression in Machine Learning

Published: 2019-01-09 • Updated: 2019-09-08

Machine Learning is a branch of Artificial Intelligence in which computer systems are given the ability to learn from data and make predictions without being programmed explicitly or any need for human intervention.


I've discussed this topic deeply in this post. So let's begin with answering

What is Regression problem in Machine Learning

Regression technique is supervised learning which is used to predict real values like salary (dependent variable) for example with time (independent variable). There are multiple regression techniques,

  • Simple Linear Regression
  • Multiple Linear Regression
  • Polynomial Regression
  • Support Vector Regression (SVR)
  • Decision Tree Regression
  • Random Forest Regression

Let's look into each one of machine learning algorithms of regression individually beginning with

Simple Linear Regression in Machine Learning

A Simple Linear Regression Technique is basically following formula,

y=b0+b1x1y = b_{0} + b_{1} x_{1} where,
yy = Dependent Variable (something we are trying to explain)
x1x_{1} = Independent Variable
b1b_{1} = Coefficient which determines how a unit change in x1 will cause a change in y
b0b_{0} = constant


Suppose we have a Salary vs. Experience data and we want to predict Salary based on Experience. Plotting the data, it looks something like,

Simple Linear Regression in Machine Learning

In our scenario regression equation looks like
salary = b0b_{0} + b1b_{1}*Experience, where
b0b_{0} = salary at zero experience
b1b_{1} = change in salary with increase in experience. Higher the b1 (slope) it will yield more salary with increase in experience


We want to find the best fit line that best fits the observations marked as (+).


How to find that best fit line?


Chart of Salary vs Experience for best fit line

In the above diagram, let L1 be the line representing the simple linear regression model. We have drawn green lines from the actual observation (+) to the model.


a1 = tell us where the person should be sitting according to the model in terms of salary, i.e. model-observation
a2 = Actual salary of the person
green line = Difference between what he's actually earning and what he should earn according to model.


To find the best fitting line, we do the following

  • Squaring all the green lines, i.e. (a1-a2)²
  • Summing up the squared green lines, i.e. Σ(a1-a2)²
  • Best fit line is min(Σ(a1-a2)²)

Let's quickly create a model based on the following data, Where we would be predicting the salary based on years of experience that a particular candidate has


YearsExperienceSalary
1.139343
1.346205
1.537731
243525
2.239891
2.956642
360150
3.254445
3.264445
3.757189
3.963218
455794
456957
4.157081
4.561111
4.967938
5.166029
5.383088
5.981363
693940
6.891738
7.198273
7.9101302
8.2113812
8.7109431
9105582
9.5116969
9.6112635
10.3122391
10.5121872

Click here to get Full data.
Data Credits All the data used in this tutorial is take from Superdatascience data set


So let's quickly start with


Step 1: Loading and processing the data

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values # Independent Variable
y = dataset.iloc[:, 1].values # Dependent Variable

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

# We don't need feature scaling because the LinearRegression library will take care of it

Step 2: Fitting Simple Linear Regression to training data

# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
linearRegressor = LinearRegression() # creating linearRegressor object
linearRegressor.fit(X_train, y_train) # Fitting model with training set

Next step is to check how our Simple Linear Regression machine learned the correlation in training by looking into predictions on test set observations


Step 3: Creating a Vector of Predicted Values

# Predicting the Test set results
prediction = linearRegressor.predict(X_test)

Finally lets plot the predictions of linear regression model w.r.t. real observations.


Step 4: Visualization of Model w.r.t. training set

plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, linearRegressor.predict(X_train), color = 'blue')
# Y coordinate is the prediction of train set
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.show()

It looks something like Visualization of Salary vs Experience for Training Set


In the above graph, real values are red dots and predicted values are in the blue simple linear regression line


Step 5: Visualization of Model w.r.t. test set

plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, linearRegressor.predict(X_train), color = 'blue')
# We will obtain same linear regression line by plotting it with either train set or test set
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.show()

It looks something like Visualization of Salary vs Experience for Test Set


In the above graph, red dots are observations of the test set and predicted values are in blue simple linear regression line.


Moving On let's look into another Machine Learning Technique, which is

Multiple Linear Regression in Machine Learning

Multiple Linear Regression Technique is basically following formula,

y=b0+b1x1+b2x2+...+bnxny = b_{0} + b_{1} x_{1} + b_{2} x_{2} + ... + b_{n} x_{n} where,
yy = Dependent Variable (something we are trying to explain)
x1x_{1}, x2x_{2}, xnx_{n} = Independent Variable
b1b_{1}, b2b_{2}, bnb_{n} = Coefficient which determines how a unit change in x1x_{1} will cause a change in yy
b0b_{0} = constant


Multiple regression extends linear regression in which we predict a dependent variable based on multiple independent variables. For example, Salary of a person will be dependent on his experience , certification, courses and so on


Similarly, suppose a venture capitalist is looking to spend his money on startups. He looks into the portfolio of each company which looks like,


R&D SpendAdministrationMarketing SpendStateProfit
165349.2136897.8471784.1New York192261.83
162597.7151377.59443898.53Florida191792.06
153441.51101145.55407934.54Florida191050.39
144372.41118671.85383199.62New York182901.99
142107.3491391.77366168.42Florida166187.94
131876.999814.71362861.36New York156991.12
134615.46147198.87127716.82New York156122.51
130298.13145530.06323876.68Florida155752.6

Click here to get Full data.
Data Credits All the data used in this tutorial is take from Superdatascience data set


To maximize the profit of his investment, he looks into all the independent variables viz, R&D Spend, Administration, Marketing Spend, State and wants to draw a correlation between them and profit (Dependent Variable).


A Multiple Linear Regression equation looks like,

yy = b0b_{0} + b1b_{1} (R&D Spend) + b2b_{2} (Administration) + b3b_{3} (Marketing Spend) + b4b_{4} (State) ```


let's quickly move on to build a model based on the above data.


Below is a step by step approach to accomplish that

Step 1: Loading the data
Let me remind you to set your working directly correctly before importing the dataset

# Data Preprocessing Template

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values # Independent Variable
y = dataset.iloc[:, 4].values # Dependent Variable

Step 2: Encoding the categorical variable (State)
One fine point to note here is State is a Categorical Variable, so we need to convert it to Numerical variable to be able to use it for Mathematical Equation.


We will follow the Step 5 dicussed in this post to create dummy Variables of each category (New York and Florida in our case).


After creating Dummy Variables, our data looks like,


R&D SpendAdministrationMarketing SpendNew YorkFloridaProfit
165349.2136897.8471784.110192261.83
162597.7151377.59443898.5301191792.06
153441.51101145.55407934.5401191050.39
144372.41118671.85383199.6210182901.99
142107.3491391.77366168.4201166187.94
131876.999814.71362861.3610156991.12
134615.46147198.87127716.8210156122.51
130298.13145530.06323876.6801155752.6

If we only keep New York column all our data is preserved as 1 in New York column means New York and 0 means Florida. So let's drop Florida column.


After dropping Florida it looks like,


R&D SpendAdministrationMarketing SpendNew YorkProfit
165349.2136897.8471784.11192261.83
162597.7151377.59443898.530191792.06
153441.51101145.55407934.540191050.39
144372.41118671.85383199.621182901.99
142107.3491391.77366168.420166187.94
131876.999814.71362861.361156991.12
134615.46147198.87127716.821156122.51
130298.13145530.06323876.680155752.6

Below is the code to achieve that

# Encoding the categorical variable (State)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 3] = labelencoder_X.fit_transform(X[:, 3]) 
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the Dummy Variable Trap
X = X[:, 1:] # Removed first column from X

Step 3: Splitting the dataset into the Training set and Test set

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Step 4: Fitting Multiple Linear Regression to training data

# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Next step is to check how our Multiple Linear Regression machines learned the correlation in training by looking into predictions on test set observations


PS:As we have four independent variables and one dependent variable, we would need five dimensions for visual plotting which is hard to represent. So we are skipping that.


Step 5: Predicting the Test set results py # Predicting the Test set results y_pred = regressor.predict(X_test)


Comparison between y_pred and y_test looks like,


Y_predY_test
103015103282
132582144259
132448146122
71976.177798.8
178537191050
116161105008
67851.781229.1
98791.797483.6
113969110352
167921166188

In row 1, Actual profit is 103282 and our model predicted 103015 which is an amazing prediction.
In row 8, Actual profit is 97483.6 and our model predicted 98791.7 which is again a fantastic prediction.


That's decent predictions. However, can we further improve the model?


Out of all the independent variables some are highly statistically significant (i.e., they have a high impact on the dependent variable) and some are not.


To find out the statistical significance of independent variable we have the following 5 techniques

  • All In
  • Backward Elimination
  • Forward Selection
  • Bidirectional Elimination
  • Score Comparison

Let's look into them one by one, starting with


All In
In this method, we choose all the independent variables to build a model.


We choose this path when either we have prior knowledge that all the variables are true predictors OR a framework/company dictates it.


That's what we have done in building the above model.


Moving on let's explore


Backward Elimination
In this method we need to follow following steps,


Step 1 Select significave level to stay in model (e.g. SL = .05 i.e. 5%)
Step 2 Fit the model with all the predictors i.e. use all independent variables
Step 3 Consider predictor with highest P Value. If P > SL, go to Step 4, else go to FINISH
PS: P value briefly is the statistical measure to determine wheather or not the hypotheses is correct or not. P value below a certain pre-determined amount (e.g. 0.05), means the hypothesis that this variables had no meaningful effect on the results is rejected.
Step 4 Remove that predictor
Step 5 Fit the model without this variable


After Step 5, Go to Step 3 and repeat till the highest P value is < SL


Moving on. Let's look into


Forward Selection
In this method we need to follow following steps,


Step 1 Select significave level to enter in the model (e.g. SL = .05 i.e. 5%)


Step 2 Create a regression model of the dependent variable with all the independent variable. From all those models select the one with the lowest P value.
For example, Y is independent variable and x1, x2, x3 are independent variables. We will create the model of Y with x1 then x2 then x3. Let's say model with x2 has the lowest P value so we will choose it.


Step 3 Create a regression model of dependent Variable with the chosen variable in Step 2 and one extra predictor.
Continuing with the above example, Now we will build a model of Y with (x2, x1) and (x2, x3).


Step 4 From all those models select the one with lowest P value. If P < SL, go to Step 3, else go to finish.
Suppose a model of Y with (x2, x1) has lowest P value and it less than SL. So we will choose it and will continue with step 3. Also, we will keep doing that until we hit a model whose lowest P value (in the current iteration) is higher than SL, then we will exit. Also, make sure while exiting, we are keeping the previous model, not the current model (i.e., whose P value is > SL).


Next model that we will learn will combine Backward Elimination and Forward Selection and is known as


Bidirectional Elimination
In this method we need to follow following steps,


Step 1 Select significave level to enter in the model e.g. SLENTER = .05 and to stay in the model e.g. SLSTAY = .05
Step 2 Perform the next step of forward selection.
Step 3 Perform ALL the next step of backward elemination.
What we mean here is suppose we went from 5 independent variables to 6 in the previous step of forward selection. Now we will perform all the steps of backward elimination instead of eliminating just one variable.
Step 4 Keep on repeating Step 3 and Step 4 till a point where no new variable can enter, and no old variable can exit. Proceed to finish, and now our model is ready.


Finally the last approach, which is


All Possible Models
In this method we need to follow following steps,


Step 1 Select a criterion of goodness of fit e.g. Akaike criterion.
Step 2 Construct all possible Regression Models i.e. 2^n -1 total combinations.
Step 3 Select the one with the best criterion. Proceed to finish and now our model is ready.


This is the most resource-consuming technique. Suppose we have 10 columns we will generate 1023 models which require insane compute.


We now have a brief knowledge of how each technique works. In our post, we will choose the statistically significant variables by using


Backward Elimination Method
Step 1: Adding a row corresponding to x0 in our Multiple Linear Equation
Multiple Linear equation is of form y = b0 + b1*x1 + b2*x2 ..... bn*xn, with b0 there is an X0 = 1 associated with it.


In sklearn linear models that x0 is taken care of by library itself. However, in the statsmodel library (which we will use for backward elimination), we need to add a row for x0 = 1 ourselves


Below code will help in accomplishing it,

# Building optimal model using backward elimination
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis=1)

Step 2: Building an array of statistically significant independent Variables
Let's call thay array as X_opt

X_opt = X[:, [0,1,2,3,4,5]] ## Initially X_opt contains all the independent Variables

## Fitting the model with statsmodels regressor with all indenpendent Variables (i.e. all possible predectors)
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() 
regressor_OLS.summary() ## To check the summary

## From the summary we can check the P value of each variable and remove the one
## with the highest P Value

O/P of regressor_OLS.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.951
Model:                            OLS   Adj. R-squared:                  0.945
Method:                 Least Squares   F-statistic:                     169.9
Date:                Tue, 01 Jan 2019   Prob (F-statistic):           1.34e-27
Time:                        19:05:51   Log-Likelihood:                -525.38
No. Observations:                  50   AIC:                             1063.
Df Residuals:                      44   BIC:                             1074.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5.013e+04   6884.820      7.281      0.000    3.62e+04     6.4e+04
x1           198.7888   3371.007      0.059      0.953   -6595.030    6992.607
x2           -41.8870   3256.039     -0.013      0.990   -6604.003    6520.229
x3             0.8060      0.046     17.369      0.000       0.712       0.900
x4            -0.0270      0.052     -0.517      0.608      -0.132       0.078
x5             0.0270      0.017      1.574      0.123      -0.008       0.062
==============================================================================
Omnibus:                       14.782   Durbin-Watson:                   1.283
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               21.266
Skew:                          -0.948   Prob(JB):                     2.41e-05
Kurtosis:                       5.572   Cond. No.                     1.45e+06
==============================================================================

X2 has highest P value greater than SL (where SL = .05)


So lets remove x2

## Removing X2 and fitting the model without X2 i.e. index = 2
X_opt = X[:, [0,1,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() 
regressor_OLS.summary()

Again the variable with highest P Value is with index = 1 (from the summary) so we will remove it now

## Removing index = 1 and fitting the model
X_opt = X[:, [0,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() 
regressor_OLS.summary()

From summary we see index = 4 having highest value so lets remove it

## Removing index = 4 and fitting the model
X_opt = X[:, [0,3,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() 
regressor_OLS.summary()

Now we see index = 5 with p value 0.06 which is slightly above .05 SL which we remove it

## Removing index = 5 and fitting the model
X_opt = X[:, [0,3]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() 
regressor_OLS.summary()

From the summary, we see all p values < .05. So now we have the strong predictor which is index = 3 i.e., R&D Spend


Moving on Let's look into

Polynomial Linear Regression in Machine Learning

Polynomial Regression Technique is basically following formula,

y=b0+b1x1+b2x12+...+bnx1ny = b_{0} + b_{1} x_{1} + b_{2} x^{2}_{1} + ... + b_{n} x^{n}_{1} where,
yy = Dependent Variable (something we are trying to explain)
x1x_{1} = Independent Variable
b1b_{1}, b2b_{2}, bnb_{n} = Coefficient which determines how a unit change in x1 will cause a change in y
b0b_{0} = constant


If we have a data set as seen in the below figure we recognize that it doesn't fit that well by a linear line. What we can do is to use polynomial regression formula to fit the data set perfectly by a parabolic figure.

Polynomial Linear Regression in Machine Learning


Even though it is a polynomial equation with x1x_{1}, x2x^{2} etc, We call it Polynomial Linear Regression as its linearly is due to the fact of Coefficient that we are using i.e. b1b_{1}, b2b_{2}, bnb_{n}


Let’s quickly create a model based on the following data, Where we would like to predict the Salary of a particular candidate based on its level


PositionLevelSalary
Business Analyst145000
Junior Consultant250000
Senior Consultant360000
Manager480000
Country Manager5110000
Region Manager6150000
Partner7200000
Senior Partner8300000
C-level9500000
CEO101000000

Click here to get Full data.
Data Credits All the data used in this tutorial is take from Superdatascience data set


Step 1: Loading and processing the data

# Polynomial Regression

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values ## Choosing level
y = dataset.iloc[:, 2].values  ## Choosing Salary

# We don't have enough data so we will not be splitting it into test set and training set
# Also we don't need to apply feature scaling as the library does it by itself

Step 2: Fitting Polynomial Linear Regression to training data

# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
linearRegressor = LinearRegression()
linearRegressor.fit(X,y) # We have also created a linear model to compare its efficiency with polynomial model


# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
polyRegression = PolynomialFeatures(degree=2) ## We are choosing a degree of 2

X_poly = polyRegression.fit_transform(X) ## Creating new metrics X_poly with polynomial feature

lin_reg_2 = LinearRegression() 
lin_reg_2.fit(X_poly, y) # Creating a polynomial model with polinomial metrics X_poly

X_poly looks like

111
124
139
1416
1525
1636
1749
1864
1981
110100

Where the first column represents constant and third column is the polynomial term


Step 3: Visualisation of Linear Regression Vs Polynomial Regression

# Visualisation of Linear Regression
plt.scatter(X, y, color = 'red')
plt.plot(X, linearRegressor.predict(X), color = 'blue') # Blue line is the prediction of linear regression model
plt.title('Linear Regression')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

The graph for Linear Regression looks like Linear Regression Prediction


We can see that predictions are not that great except where red points are close to the blue line


# Visualisation of Polynomial Regression
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg_2.predict(polyRegression.fit_transform(X)), color = 'blue') # Blue line is the prediction of linear regression model
plt.title('Polynomial Regression')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

The graph for Polynomial Regression looks like Polynomial Regression Prediction


We can see that predictions are much better than a linear regression model


To further improve the prediction lets increase the degree from 2 to 3 and the graph looks like Polynomial Regression Prediction


Which is an improvement over degree 2


To further improve the prediction lets increase the degree from 3 to 4 and the graph looks like Polynomial Regression Prediction


Which is an improvement over degree 3 and our model curve passing through almost all the data points


Can we apply any other regression model to get a decent result?


The answer is YES, Let's explore another regression technique which is

Support Vector Regression in Machine Learning

Support Vector Regression is a type of Support Vector Machine (details of which is outside the scope of this tutorial) which supports linear and non-linear regression. SVR performs linear regression in higher dimensional space where each data in training set represents its own dimension.


The main goal of SVR is to make sure the errors do not exceed the threshold, unlike linear regression where we are trying to minimizing the error between prediction and data.


We will analyze the dataset of Polynomial regression in Support Vector Regression (SVM), Code of which looks like,


# SVR

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values ## Choosing level
y = dataset.iloc[:, 2:3].values  ## Choosing Salary

# We don't have enough data so we will not be splitting it into test set and training set

# SVR class doesn't apply Feature Scaling so we need to do it ourself
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()

X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)

# Fitting SVR to the dataset
from sklearn.svm import SVR
svr_regression = SVR(kernel='rbf')  ## rbf kernel is for non linear problem
## Kernel defines wheather we need linear, polynomial or Gaussian SVR

svr_regression.fit(X, y)

# Visualisation of SVR results
plt.scatter(X, y, color = 'red')
plt.plot(X, svr_regression.predict(X), color = 'blue') # Blue line is the prediction of linear regression model
plt.title('SVR Regression')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

Graph of SVR looks like Support Vector Regression Prediction


Which is again a decent prediction except for the last Red Dot which is considered as an outlier by SVR regression

Decision Tree Regression in Machine Learning

Decision Tree Technique is basically used in analysing the data, which looks like,

Decision Tree Regression Chart 1


where x1x_{1}, x2x_{2} are the independent variables and dependent variable yy is represent in another plane perpendicular to x1x_{1}, x2x_{2} which looks like

Decision Tree Regression Chart 2


The Decision Tree algorithm is a non-linear and not continuous regression model which creates split in the dataset (represented as blue dotted lines) using Information Entropy which looks like,

Decision Tree Regression in Machine Learning


However, the question is,

What is entropy in machine learning?

Information entropy is a complex mathematical concept which aims at splitting the data points in a way that it adds information to the data that we have. This grouping adds value to the data that we have. The splitting stops when

  • The algorithm can't add any more information by creating new splits OR
  • We have less than 5% of the points in the new split

So far so good but how we will be able to use the information of the split?


Suppose we have a point with x1x_{1} = 30 and x2x_{2} = 50, it will fall under split 4. To predect its yy value we will take the average of all the yy's present in split 4. Suppose its equal to 0.7, so for x1x_{1} = 30 and x2x_{2} = 50 we have yy = 0.7.


Plotting yy value in all the splits,

Decision Tree Regression Chart 4


From above graph we can easily plot the following decision tree. Decision Tree Regression Chart 5

Let's try analyzing the dataset of Polynomial regression in Decision Tree Regression, Code of which looks like,

# Decision Tree Regression

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values ## Choosing level
y = dataset.iloc[:, 2:3].values  ## Choosing Salary

# We don't have enough data so we will not be splitting it into test set and training set

# Fitting Decision Tree regression to the dataset
from sklearn.tree import DecisionTreeRegressor
dt_regression = DecisionTreeRegressor(random_state = 0) ## we will use default criterion

dt_regression.fit(X, y)

# Visualisation of Decision Tree results
## We need to Visualise this regression model in high resolution because its non-linear and not continuous

X_grid = np.arange(min(X), max(X), .001)
X_grid = X_grid.reshape((len(X_grid), 1))

plt.scatter(X, y, color = 'red')
plt.plot(X_grid, dt_regression.predict(X_grid), color = 'blue') # Blue line is the prediction of linear regression model
plt.title('Decision Tree Regression')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

Graph of Decision Tree regression looks like

Decision Tree Regression Chart


We can see in the above diagram,

  • 0-2 is interval 1
  • 2-4 is interval 2 ... and so on

Salary in each interval = Avg of each interval

PS: Decision Tree is a compelling model for data in 2 or more dimensions


Can we do better?


Can we utilize a team of Decision Trees to improve our model?


Yes, we can, and the technique is known as

Random Forest Regression in Machine Learning

Random Forest is a version of Ensembled Learning.


In Ensembled Learning we use several algorithms or same algorithm multiple times to build something more powerful.


Below are the steps of building a Random Forest
Step 1: Pick Random K data points from the training set.
Step 2: Build Decision Tree associated with these K data points.
Step 3: Repeat Step 1 and Step 2 N times and build N trees.
Step 4: Use all N Trees to predict the value of a data point. Then average out all the predicted values.


We can see it improves the accuracy of our prediction because we are taking an average of all the predictions


Let’s try analyzing the dataset of Polynomial regression in Random Forest Regression code of which looks like,

# Random Forest Regression

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values ## Choosing level
y = dataset.iloc[:, 2:3].values  ## Choosing Salary

# We don't have enough data so we will not be splitting it into test set and training set

# Fitting Random Forest regression to the dataset
from sklearn.ensemble import RandomForestRegressor
rf_regression = RandomForestRegressor(n_estimators = 10, random_state = 0) ## n_estimators tells us how many trees we want to build in a forest

rf_regression.fit(X, y)

# Visualisation of Random Forest results
## We need to Visualise this regression model in high resolution because its non-linear and not continuous

X_grid = np.arange(min(X), max(X), .001)
X_grid = X_grid.reshape((len(X_grid), 1))

plt.scatter(X, y, color = 'red')
plt.plot(X_grid, rf_regression.predict(X_grid), color = 'blue') # Blue line is the prediction of linear regression model
plt.title('Random Forest Regression')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

Graph of Random Forest Regression looks like Random Forest Regression Chart


Random forest is calculating many different averages from its decision trees predictions in each interval, resulting in multiple steps in each interval.


One fine point is, if we keep on increasing the decision trees, it will not increase the number of steps because the average of different predictions made by the trees will converge to the same average.

That's all folks !!!


Conclusion

I tried explaining all the models with short examples, but there is always a confusion/apprehension on selecting a model. Let me assist you with that as well, If

  • The problem is linear, Choose Simple Linear Regression (for one feature) Or Multiple Linear Regression (for several features)
  • The problem is non-linear, Choose between Polynomial Regression Or SVR Or Decision Tree Or Random Forest. Analyze the performance of each model with your data set and choose the one that solves your problem more efficiently.

So that's the summary of regression in machine learning !!!

Reference

Machine Learning
Share this Post