How to Create Your First Machine Learning Model in Python

In this article, you’ll learn how to create your first machine learning model in Python. Specifically, you will build regression models using traditional linear regression as well as other machine learning algorithms.

1. Your first machine learning model

So what kind of machine learning model are we building today? In this article, we are going to build a regression model using random forest algorithm on a solubility dataset.

After building the model, we are going to use it to make predictions, then evaluate the model’s performance and visualize its results.

2. Data set

2.1. Toy Datasets

So what dataset are we going to use? The default answer might be to use a toy dataset as an example, such as the Iris dataset (classification) or the Boston housing dataset (regression).

While both are great examples to start with, typically most tutorials don’t actually load this data directly from an external source (such as a CSV file) but instead import it from a Python library such as datasetssub-module scikit-learn.

For example, you can use the following block of code to load the Iris dataset:

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

The advantage of using toy datasets is that they are very easy to use: import data directly from the library in a format that can be easily used to build models. The downside to this convenience is that novices may not actually see which functions are loading data, which are doing the actual preprocessing, which are building the model, etc.

2.2. Your own data set

In this tutorial we’ll take a hands-on approach and focus on creating real-life models that you can easily reproduce. Since we’re going to be reading the input data directly from the CSV file, you can easily replace the input data with your own and repurpose the workflow described here for it.

The dataset we use today is solubilitydata set. It consists of 1444 rows and 5 columns. Each row represents a unique molecule and each is described by 4 molecular properties (the first 4 columns) and the last column represents the target variable that needs to be predicted. This target variable represents the solubility of a molecule, which is an important parameter in a therapeutic drug because it helps the molecule move within the body to reach its target. Below are the first few lines of the set solubilitydata.

2.2.1. Loading Data

Full solubilityThe dataset is available on GitHub Data Professor at the following link: Download solubility data set .

To be usable in any data science project, the data contents of CSV files can be read into the Python environment using the library Pandas. I’ll show you how to do this with the example below:

import pandas as pd
df = pd.read_csv('data.csv')

The first line imports pandaslibrary in the form of a short acronym called pd(for ease of entry). From pdwe’re going to use this read_csv()function and therefore introduce pd.read_csv(). Thus, by entering pdin front, we know which library read_csv()belongs to the function.

Input argument inside read_csv()function is the name of the CSV file, which in our example above 'data.csv’. Here we are assigning the data content from the CSV file to a variable called df.

In this tutorial we are going to use the solubility dataset (available at address). So we will load the data using the following code:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/dataprofessor/data/master/delaney_solubility_with_descriptors.csv')

2.2.2. Data processing

Now that we have the data as a frame in a variable dfwe need to prepare them in a suitable format for use by the library scikit-learn because the dfthe library cannot use them yet.

How do we do it? We will need to split them into 2 variables: XAnd y.

The first 4 columns, excluding the last one, will be assigned to a variable Xand the latter will be assigned to the variable y.

2.2.2.1. Assigning variables to X

To assign the first 4 columns to a variable Xwe will use the following lines of code:

X = df.drop(['logS'], axis=1)

As we can see, we did this by discarding or removing the last column ( logS).

2.2.2.2. Assignment to variable y

To assign the last column to a variable ywe simply select the last column and assign it to a variable yin the following way:

y = df.iloc[:,-1]

As you can see, we did this by explicitly selecting the last column. Two alternative approaches can also be used to obtain the same results, the first being the following:

y = df['logS']

And the second approach is as follows:

y = df.logS

Select one of the above options and proceed to the next step.

3. Data sharing

Data partitioning allows you to objectively evaluate the model’s performance on fresh data that the model has not previously seen. Specifically, if the full data set is split into training and test using an 80/20 split ratio, then the model can be built using the 80% subset of the data (i.e., which we can call the training set) and subsequently estimate on the 20% subset of the data (i.e. what we can call test set). In addition to applying the trained model to the test set, we can also apply it to the training set (i.e., the data that was used to build the model in the first place).

Subsequent comparison of the model’s performance of both data splits (i.e., training and test sets) will allow us to assess whether the model suitable or re-equipped . Underfitting typically occurs when the performance of the training and test sets is poor, whereas with overfitting, the test set performs significantly worse than the training set.

To perform data separation in scikit-learnthe library has train_test_split()function that allows us to do this. An example of using this function to split a dataset into training and testing is shown below:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In the above code, the first line imports the function train_test_split()from sklearn.model_selectionsubmodule. As we can see, the input argument consists of XAnd yinput data, the size of the test set is specified as 0.2 (i.e., 20% of the data will go to the test set, and the remaining 80% to the training set) and a random seed. The number is set to 42.

From the above code we can see that we have created 4 variables at the same time, consisting of separated variables XAnd yfor teaching ( X_trainAnd y_train) and test sets ( X_testAnd y_test).

Now we are ready to use these 4 variables to build the model.

4. Model building

Here comes the most interesting part! Now we are going to build some regression models.

4.1. Linear regression

4.1.1. Model building

Let’s start with traditional linear regression.

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

The first line imports LinearRegression()function from sklearn.linear_modelsubmodule. Then LinearRegression()function is assigned to a variable lrAnd .fit()the function performs the actual training of the model on the input data X_trainAnd y_train.

Now that the model is built, we are going to use it to predict the training and test sets as follows:

y_lr_train_pred = lr.predict(X_train)
y_lr_test_pred = lr.predict(X_test)

As we can see in the above code, the model ( lr) is used for forecasting using lr.predict()functions on training and test sets.

4.1.2. Model performance

Now we are going to calculate the performance metrics to be able to determine the performance of the model.

lr_train_mse = mean_squared_error(y_train, y_lr_train_pred) 
lr_train_r2 = r2_score(y_train, y_lr_train_pred)
lr_test_mse = mean_squared_error(y_test, y_lr_test_pred) 
lr_test_r2 = r2_score(y_test, y_lr_test_pred)

In the above code we are importing the functions mean_squared_errorAnd r2_scorefrom sklearn.metricssubmodule for calculating performance indicators. The input arguments for both functions are actual and predicted values Y ( y_lr_train_predAnd y_lr_test_pred).

Let’s talk about the naming convention used here: we assign a function to self-explanatory variables that explicitly indicate what the variable contains. For example, lr_train_mseAnd lr_train_r2explicitly reports that the variables contain MSE and R2 performance metrics for models built using linear regression on the training set. The benefit of using this naming convention is that the performance of any future models built using a different machine learning algorithm can be easily identified by their variable names. For example, we could use it rf_train_mseto denote the MSE of the training set for a model built using random forest.

Performance metrics can be displayed by simply printing out the variables. For example, to print the MSE for the training set:

print(lr_train_mse)

what gives 1.0139894491573003.

To see the results for the other three metrics, we could also print them out one by one, but that would be a bit repetitive.

Another way is to create a neat display of the four metrics like this:

lr_results = pd.DataFrame(['Linear regression',lr_train_mse, lr_train_r2, lr_test_mse, lr_test_r2]).transpose()
lr_results.columns = ['Method','Training MSE','Training R2','Test MSE','Test R2']

which produces the following data frame:

4.2. Random forest

Random Forest (RF) is an ensemble learning method that combines the predictions of multiple decision trees. A distinctive feature of RF is its built-in feature importance (i.e., the Gini index values ​​it produces for the models it builds).

4.2.1. Model building

Let’s now build an RF model using the following code:

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(max_depth=2, random_state=42)
rf.fit(X_train, y_train)

In the above code, the first line imports the function RandomForestRegressor(i.e. it can also be called a regressor) from the submodule sklearn.ensemble. It should be noted here that RandomForestRegresso is a version of regression (i.e. it is used when the Y variable contains numeric values), and its sister version is RandomForestClassifierversion of the classification (i.e., it is used when the Y variable contains categorical values). ).

In this example we set max_depthparameter equal to 2, and a random seed (via random_state) – 42. Finally, the model is trained using the function rf.fit()in which we have installed X_trainAnd y_trainas input.

Now we are going to apply the built model to predict the training and test sets as follows:

y_rf_train_pred = rf.predict(X_train)
y_rf_test_pred = rf.predict(X_test)

Similar to how it is used in lrmodels, rfthe model is also used for forecasting using rf.predict()functions on training and test sets.

4.2.2. Model performance

Let’s now calculate the performance indicators for the constructed random forest model as follows:

from sklearn.metrics import mean_squared_error, r2_score
rf_train_mse = mean_squared_error(y_train, y_rf_train_pred)
rf_train_r2 = r2_score(y_train, y_rf_train_pred)
rf_test_mse = mean_squared_error(y_test, y_rf_test_pred)
rf_test_r2 = r2_score(y_test, y_rf_test_pred)

To consolidate the results, we will use the following code:

rf_results = pd.DataFrame(['Random forest',rf_train_mse, rf_train_r2, rf_test_mse, rf_test_r2]).transpose()
rf_results.columns = ['Method','Training MSE','Training R2','Test MSE','Test R2']

which produces:

4.3. Other machine learning algorithms

To build models using other machine learning algorithms (besides sklearn.ensemble.RandomForestRegressorwhich we used above), we only need to decide which algorithms to use from the available regressors (since the dataset’s Y variable contains categorical values).

4.3.1. List of regressors

Let’s look at some examples of regressors we can choose from:

  • sklearn.linear_model.Ridge

  • sklearn.linear_model.SGDRegressor

  • sklearn.ensemble.ExtraTreesRegressor

  • sklearn.ensemble.GradientBoostingRegressor

  • sklearn.neighbors.KNeighborsRegressor

  • sklearn.neural_network.MLPRegressor

  • sklearn.tree.DecisionTreeRegressor

  • sklearn.tree.ExtraTreeRegressor

  • sklearn.svm.LinearSVR

  • sklearn.svm.SVR

A more extensive list of regressors can be found in Scikit-learnreference book via API .

4.3.2. Using a regressor

Let’s say what we would like to use is sklearn.tree.ExtraTreeRegressorwe would use like this:

from sklearn.tree import ExtraTreeRegressoret = ExtraTreeRegressor(random_state=42) et.fit(X_train, y_train)

Notice how we import the regressor function sklearn.tree.ExtraTreeRegressorin the following way:
from sklearn.tree import ExtraTreeRegressor

The regressor function is then assigned to a variable (i.e. etin this example) and subjected to model training using .fit()functions as in et.fit().

4.4. Combining results

Recall that the model performance metrics we previously generated above for the linear regression and random forest models are stored in variables lr_resultsAnd rf_results.

Since both variables are data frames, we are going to combine them using pd.concat()functions as below:

pd.concat([lr_results, rf_results])

This creates the following data frame:

It should be noted that performance metrics for additional training methods can also be added, expanding the list [lr_results, rf_results].

For example, svm_results could be added to the list, which would then become [lr_results, rf_results, svm_results].

5. Data visualization of forecasting results

Let us now visualize the relationship of the actual Y values ​​with their predicted Y values, that is, the experimental logS and the predicted logS values.

As shown above we are going to use Matplotliblibrary for constructing a scatterplot, and Numpy used to create a data trend line. Here we set the shape size to 5×5 using figsizefunction parameter plt.figure().

Function plt.scatter()used to create a scatterplot where y_trainAnd y_lr_train_pred(i.e., training set predictions made using linear regression) are used as input. The color is set to green using the HTML color code (hex code) #7CAE00.

Trend line on a chart using np.polyfit()functions are displayed using plt.plot()functions as shown above. Finally, the X and Y axis labels are added using the functions plt.xlabel()And plt.ylabel()respectively.

A visualized scatterplot is shown above.

What’s next?

Congratulations on creating your first machine learning model!

What next, you ask. The answer is quite simple: build more models! Tune parameters, try new algorithms, experiment with adding new features to your machine learning pipeline, and most importantly, don’t be afraid to make mistakes. In fact, the fastest way to speed up your learning is to fail often, get back up, and try again. Learning is about having fun in the process, and if you persist long enough, you will become more confident on your path to becoming a data professional, whether you are a data scientist, a data analyst, or a data engineer. But most importantly, as I always like to say:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *