Machine LearningPublished: 2018-11-14 • Updated: 2019-02-25
Machine Learning a term coined by Arthur Samuel in 1959, is a branch of Artificial Intelligence in which computer systems are given ability to learn from data and make predictions without being programmed explicitly or any need of human intervention.
In simple words, Machine Learning is a science to device models on data which is then used by data scientist, research engineers etc. to make predictions.
Machine Learning can be applied in a variety of fields viz,
- Mail classification based on user interaction with his mailbox
- Facial Recognition
- Speech to Text
- Voice Recognition
- Recommendation System of Amazon
The purpose of this Machine Learning tutorial is to provide a brief machine learning introduction and equip the readers with basic machine learning tools. We will use Python as programming language to process data and build models.
But before moving any further let's ponder over
Why we need Machine Learning
We need machine learning in computing tasks where designing and programming algorithms based on static program instructions are infeasible if not impossible.
Classical example of such task will be Email Filtering. Each individual has its own ways of classifying emails as trash or important (there can be other categories also). So an App having hard coded filters will not work for all the individuals.
A good app/software should learn from how a user is interacting with his inbox and then assist him in classifying future mails.
Sounds interesting !!!
Let's quickly look into
How machine learning Solves it
A machine learning algorithm will learn from the user behaviour and dynamically help him in classifying future emails in appropriate categories.
Moving on !!!
Based on data being fed to a machine learning model, we have following
4 Classification of Machine Learning
Supervised Machine Learning In this method computer is given complete and labeled training set of inputs and desired outputs. Computer then derives a corelation between input and output to make predictions. There is a complete list of supervised learning algorithms which is outside the scope of this tutorial.
Semi-supervised Machine Learning In this method computer is given incomplete training set of inputs where outputs are missing for few (sometimes most) records.
Unsupervised Machine Learning In this method computer is given incomplete and unlabeled training set of inputs. Computer is left on it's own to find structure in that data.
Reinforcement Machine Learning In this method computer is given feedback for his predictions which is either reward points or punishment. This feedback is helpful to improve the accuracy of future predictions made by machine learning model.
I guess that's enough of theory let's quickly look into building a machine learning model. But before that we need to install basic tools/softwares. Starting with,
Anaconda comes with a bundle of useful tools. It installs Python (I'm using Python 3), IDE's for Python e.g. spyder, useful packages like numpy, pandas etc that we will use for Machine Learning and what not.
You can use this link to install Anaconda
After installing, Type for Anaconda Navigator to start it.
It looks like the following,
After we are done installing basic tools, lets launch Spyder from Anaconda Navigator,
This is how it looks like in default Layout
- You can select the
Spyder Default Layoutfrom
View > Windows layout
- To customise Edior, Go
Preferences > Editor
- If any of the pane is missing from it, You can add it from
view > panes
This is what I have for myself,
Let's test everything by running a basic print command,
print("Hello from Spyder")in Test editor and save it.
- Run it by selecting the code and pressing
- You can see
Hello from Spyderin IPython Console
Moving On ...
Lets learn the basics of how to load, process data before fitting it into a machine learning model.
The data looks something like this,
Country Age Salary Purchased France 44 72000 No Spain 27 48000 Yes Germany 30 54000 No Spain 38 61000 No Germany 40 Yes France 35 58000 Yes Spain 52000 No France 48 79000 Yes Germany 50 83000 No France 37 67000 Yes
Data Credits All the data used in this tutorial is take from Superdatascience data set
In above data,
Country, Age and Salaryare the details of the customer. They are called Independent Variables because predictions will be made by analysing them.
Purchasedtells us wheather the customer bought the product of the company. This is Dependent Variable or Predicted output
In Machine Learning models, we use independent variables to predict dependent variables
Country, Age and Salary we are going to predict wheather the customer bought the product of the company.
Once we have the data in hand, we will now process it step by step. Beginning with,
Step 1: Importing the Libraries
import numpy as np # Contains mathematical tools import matplotlib.pyplot as plt # Tools for plotting Charts import pandas as pd # Helps in importing data sets and managing data sets
To load libraries, Select the code & Execute it (Shift + Enter for MAC)
Step 2: Importing the Data Sets
- First set the working directory. To do so, either you can put the code file in same directory of data file and execute it again (F5) Or you can choose the appropriate directory from
dataset = pd.read_csv('Data.csv') # read_csv is a function from pandas which we have used to import data set.
In variable explorer you can see something like this,
Step 3: Segregate Independent Variables and Dependent Variables
X = dataset.iloc[:, :-1].values """ In iloc, left of comma are line and (:) implies we are taking all the lines Right of comma are colums and (:-1) implies we are taking all the columns except last one """ Y = dataset.iloc[:, 3].values # 3 is index for Purchased Column
Above code will create two variable X and Y.
X will contain data of Independent Variables and Y will contain data for Dependent Variables.
If we type X in IPython Console and press enter, we will see something like
array([['France', 44.0, 72000.0], ['Spain', 27.0, 48000.0], ['Germany', 30.0, 54000.0], ['Spain', 38.0, 61000.0], ['Germany', 40.0, nan], ['France', 35.0, 58000.0], ['Spain', nan, 52000.0], ['France', 48.0, 79000.0], ['Germany', 50.0, 83000.0], ['France', 37.0, 67000.0]], dtype=object)
In Y, we have
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'], dtype=object)
Step 4: Fixing the missing data
We have missing data in both Age and Salary Column. So we have two options,
- Removing rows with missing data.
- If this row contains crucial information then it's dangerous to remove observation
- Replacing the missing data with mean of the column
- This is most favourable approach.
# Fixing the missing data from sklearn.preprocessing import Imputer # importing Imputer imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis=0) imputer = imputer.fit(X[:, 1:3]) """ : => all the rows 1:3 => Age and Salary column, 3 is upper bound and is excluded """ X[:, 1:3] = imputer.transform(X[:, 1:3])
After above code run,
X variable looks like
Out: array([['France', 44.0, 72000.0], ['Spain', 27.0, 48000.0], ['Germany', 30.0, 54000.0], ['Spain', 38.0, 61000.0], ['Germany', 40.0, 63777.77777777778], ['France', 35.0, 58000.0], ['Spain', 38.77777777777778, 52000.0], ['France', 48.0, 79000.0], ['Germany', 50.0, 83000.0], ['France', 37.0, 67000.0]], dtype=object)
Step 5: Encoding the categorical variables
We have two categorical data in our dataset
- Country => France, Spain, Germany
- Purchased => Yes, No
Since machine learning models are based on mathematical equations. So kepping text in categorical variable will create problems in the equations as we only want numbers in the equations.
So we need to encode the categoriacal variables into numbers.
Below code does exactly that
# Encoding the categorical variables from sklearn.preprocessing import LabelEncoder labelencoder_X = LabelEncoder() X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) """ 0 is index of country column fit_transform() will return the encoded version of country column """
Above code assign encoding like,
- France => 0
- Spain => 1
- Germany => 2
Which is deeply problematic.
Assigning 2 to Germany gives it more precedence over France and Spain which is so not the case. We need to make sure that encoding variables should not attribute an order into categorical variables.
We can achieve it by creating three separate columns for Germany, France and Spain and use 1 Or 0 to denote that this row belongs to which category.
France Germany Spain 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0
Below code does exactly that
# Encoding the categorical variables from sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder(categorical_features = ) X = onehotencoder.fit_transform(X).toarray()
X looks like
France Germany Spain Age Salary 1 0 0 44 72000 0 0 1 27 48000 0 1 0 30 54000 0 0 1 38 61000 0 1 0 40 63777.8 1 0 0 35 58000 0 0 1 38.7778 52000 1 0 0 48 79000 0 1 0 50 83000 1 0 0 37 67000
To encode Y we can still use LabelEncoder. As it is a dependent Variable and Machine learning model will know that it's a category and there is no order between the two
labelencoder_Y = LabelEncoder() Y = labelencoder_X.fit_transform(Y)
Y looks like
Purchased 0 1 0 0 1 1 0 1 0 1
Step 6: Splitting data set into Training set and Test set
We want to create training set and test set from our data set to check the correctness and performance of our model.
Training set is defined as data on which we build the machine learning model.
Test set is defined as data on which we test the performance of machine learning model.
We build machine learning model on training set by establishing correlation between independent variable and dependent variable in train set.
Once our machine learning model understands the correlation between independent variable and dependent variable. We will test if the machine learning model can apply the correlations learned from training set on the test set i.e. we will check the accuracy or correctness of the predictions on test set.
Below code does exactly that
# splitting the dataset into Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0) """ .02 => we have 80% train set and 20% test set """
Result form above code looks like,
France Germany Spain Age Salary 0 1 0 40 63777.8 1 0 0 37 67000 0 0 1 27 48000 0 0 1 38.7778 52000 1 0 0 48 79000 0 0 1 38 61000 1 0 0 44 72000 1 0 0 35 58000
France Germany Spain Age Salary 0 1 0 30 54000 0 1 0 50 83000
Similarly we have 8 and 2 observations in Y_train, Y_test resprectively.
Step 6: Variable Scaling
Variables in Age and salary column contains numerical numbers which are not in same scale. We have age which goes from
27 to 50 and Salary from
40k to 90k
A lot of machine learning models are based on
Euclidean Distance. The euclidean distance of Salary between two data points will dominate the euclidean distance of Age as the range of salary is much higher than Age.
To mitigate this situation we need to bring them both in same scale / range for e.g. range of -1 to +1 etc.
We can achive it by following methods
X(stand) = (X - mean(X)) / standard deviation (X)
X(norm) = (X - min(X)) / (max(X) - min(X))
Below code does exactly that
# Feature scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) # we don't need to fit test set as it is already fitted for training set
Result form above code looks like, X_train
France Germany Spain Age Salary -1 2.64575 -0.774597 0.263068 0.123815 1 -0.377964 -0.774597 -0.253501 0.461756 -1 -0.377964 1.29099 -1.9754 -1.53093 -1 -0.377964 1.29099 0.0526135 -1.11142 1 -0.377964 -0.774597 1.64059 1.7203 -1 -0.377964 1.29099 -0.0813118 -0.167514 1 -0.377964 -0.774597 0.951826 0.986148 1 -0.377964 -0.774597 -0.597881 -0.482149
France Germany Spain Age Salary -1 2.64575 -0.774597 -1.45883 -0.901663 -1 2.64575 -0.774597 1.98496 2.13981
- Feature scaling on X_test is same as Feature scaling on X_train because the object StandardScaler was fitted to X_train. So its important to fit the object to X_train first so that X_test and X_train are scaled on same basis.
- Dependent variable vector Y_train and Y_test is a categorical variable with value either 0 and 1. So we don't need feature scaling in this case. If dependent variable takes huge range of values then we need to apply feature scaling in dependent variable as well.
Full Code for Data Preprocessing
# Improting the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing dataset dataset = pd.read_csv('Data.csv') X = dataset.iloc[:, :-1].values Y = dataset.iloc[:, 3].values # Fixing the missing data from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis=0) imputer = imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3]) # Encoding the categorical variables from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) onehotencoder = OneHotEncoder(categorical_features = ) X = onehotencoder.fit_transform(X).toarray() labelencoder_Y = LabelEncoder() Y = labelencoder_X.fit_transform(Y) # splitting the dataset into Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0) # Feature scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test)
Phew !! That's it !!!
We have looked into all the data processing steps.
One thing to note is we may not be doing all the steps on a available dataset. It highly depends on the format of data in deciding which all steps of processing is required before applying any machine learning algorithm.
Let's move on in building models beginning with
Regression model in Machine Learning
Regression model is used to predict real values like salary (dependent variable) with time (independent variable). There are multiple regression techniques,
- Simple Linear Regression
- Multiple Linear Regression
- Polynomial Regression
- Support Vector for Regression (SVR)
A Simple Linear Regression Technique is basically following formula,
= Dependent Variable (something we are trying to explain)
= Independent Variable
= Coefficient which determines how a unit change in x1 will cause a change in y
Suppose we have a
Salary vs Experience data and we want to predict
Salary based on
Experience. Plotting the data it looks something like,
In our scenario regression equation looks like
salary = + *Experience, where
= salary at zero experience
= change in salary with increase in experience. Higher the b1 (slope) it will yield more salary with increase in experience
We want to find best fit line that best fits the observations marked as (+).
How to find that best fit line?
In above diagram, let L1 be the line representing simple linear regression model. We have drawn green lines from the actual observation (+) to the model.
a1 = tell us where the person should be sitting according to the model in terms of salary i.e. model observation
a2 = Actual salary of the person
green line = Difference between what he's actually earning and what he should earn according to model.
To find best fitting line, we do the following
- Squaring all the green lines i.e.
- Summing up the squared green lines i.e.
- Best fit line is
Let's quickly create a model based on data, which looks like
YearsExperience Salary 1.5 37731.0 1.1 39343.0 2.2 39891.0 2.0 43525.0 1.3 46205.0 3.2 54445.0 4.0 55794.0 2.9 56642.0 4.0 56957.0 4.1 57081.0 3.7 57189.0 3.0 60150.0 4.5 61111.0 3.9 63218.0 3.2 64445.0 5.1 66029.0 4.9 67938.0 5.9 81363.0 5.3 83088.0 6.8 91738.0 6.0 93940.0 7.1 98273.0 7.9 101302.0 9.0 105582.0 8.7 109431.0 9.6 112635.0 8.2 113812.0 9.5 116969.0 10.5 121872.0 10.3 122391.0
Step 1: Loading and processing the data
import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Salary_Data.csv') X = dataset.iloc[:, :-1].values # Independent Variable y = dataset.iloc[:, 1].values # Dependent Variable # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0) # We don't need feature scaling because the LinearRegression library will take care of it
Step 2: Fitting Simple Linear Regression to training data
# Fitting Simple Linear Regression to the Training set from sklearn.linear_model import LinearRegression linearRegressor = LinearRegression() # creating linearRegressor object linearRegressor.fit(X_train, y_train) # Fitting model with training set
Next step is to check how our Simple Linear Regression machine learned the corelation in a training by looking into predictions on test set observations
Step 3: Creating a Vector of Predicted Values
# Predicting the Test set results prediction = linearRegressor.predict(X_test)
Finally lets plot the predictions of linear regression model w.r.t. real observations.
Step 4: Visualization of Model w.r.t. training set
plt.scatter(X_train, y_train, color = 'red') plt.plot(X_train, linearRegressor.predict(X_train), color = 'blue') # Y coordinate is the prediction of train set plt.title('Salary vs Experience (Training set)') plt.xlabel('Experience') plt.ylabel('Salary') plt.show()
It looks something like
In above graph, real values are red dots and predicted values are in blue simple linear regression line
Step 5: Visualization of Model w.r.t. test set
plt.scatter(X_test, y_test, color = 'red') plt.plot(X_train, linearRegressor.predict(X_train), color = 'blue') # We will obtain same linear regression line by plotting it with either train set or test set plt.title('Salary vs Experience (Test set)') plt.xlabel('Experience') plt.ylabel('Salary') plt.show()
It looks something like
In above graph, red dots are observations of test set and predicted values are in blue simple linear regression line.
That's all folks, we have now build our very first machine learning model and made same decent predictions.
I would like to conclude this artice by highlighting the fact that Machine Learning is truely opening new prospects from petabytes of data that organisations have piled up.
One fine example would be Amazon Recommendation system. This article published on Forbes beautifully highlights the potential of Machine Learning in driving revenue in the enterprise.