K-Nearest Neighbors method. Parsing without using libraries and using libraries

Let’s start the analysis of machine learning algorithms with the most transparent classification problem for understanding. To understand what this task is and why it is being solved at all, let’s recall a very promising development – unmanned vehicles. It is clear that the car itself does not understand what a roadway is, and cannot distinguish a person from a traffic light – for this it must be taught to distinguish between signs, people, animals, etc. Here, in addition to rather complex branches of machine learning, such as machine vision and decision-making systems, classification is used: the car “learns” to distinguish between obstacles that need to be bypassed; people to let them pass when crossing the road; signs to follow the rules exactly. In simple terms, the system assigns objects to a particular class in order to behave correctly when meeting with them, that is, classification in machine learning is nothing more than the task of assigning an object to one of predefined classes based on its features.

Let’s start with kNN, one of the most common classification methods in ML. It is quite simple to implement, unlike other algorithms, so to illustrate how classification generally works, we will first write our own implementation and look at the results by applying the method to the standard Iris dataset, and then compare it with the library implementation from the sklearn library. We will not analyze the following algorithms so thoroughly because of the laborious implementation – we will consider the general methodology and analyze on the basis of which the algorithm made a decision in favor of a particular class.

The kNN method belongs to the category of lazy classifiers. In general, this indicates that the method “learns” only on new data, not taking into account previous experience. In this case, this means that the algorithm does nothing during the learning process, but only saves the labeled training data. The classification process itself begins only when new unmarked data appears – the algorithm, for some reason, calculates the distances between already marked and still unknown objects in space. Obviously, if we calculate the distances between the known data and each new set of unlabeled data, then each time we will get different distances between objects, so initially “train” the algorithm at the same distances so that it remembers at what position it belongs to which class object does not make sense at all – there can be an infinite number of positions, so storing them all in memory will not work.

Another important point: how to understand how accurately the model makes predictions? In this case, one of the simplest ways to determine the error is simply to calculate the ratio of the number of correct predictions to the total number of predictions made. This metric is called the percentage of correct answers. Obviously, its values ​​will be in the range from 0 to 1, where 0 is a completely useless model, and 1 is absolutely accurate.

The kNN algorithm consists of three consecutive steps:

1) calculate the distance from the target object (which needs to be classified) to each of the training sample objects (already labeled with some class);

2) select k objects of the training sample, the distances to which are minimal (at the first stage, k is chosen arbitrarily, then the best value of k is iteratively selected based on the accuracy of the predictions obtained for each of the chosen k );

3) get the class of the object based on the most common among k nearest neighbors (this can be a number or the name of the class depending on how the classes were originally designated – for example, in the drone example, this could be “person” or “concrete block” ).

It is worth noting here that the model itself determines the number of classes already in the process of classification: information about the class is contained in the labeled data objects themselves (in all neighbors), so simply counting the number of instances of the same class for the algorithm is not difficult.

Let’s look at an example how this algorithm works. Let’s build the algorithm ourselves without using libraries and using libraries using the example of a data set with data about irises flowers. Importing the Iris dataset. It is a collection of morphological measurements of several varieties of irises and for each plant has 4 characteristics:

sepal length sepal width petal length petal width

Let’s load the dataset.

from sklearn.datasets import load_iris 
iris_dataset = load_iris()

Let’s see what the dataset contains.

iris_dataset.keys()

Let’s look at our classes or types of flowers.

iris_dataset['target_names']

Let’s see what characterizes each class, what features we have.

iris_dataset['feature_names']

Let’s look at the type of object and the size of our feature selection.

print(type(iris_dataset['data'])) 
iris_dataset['data'].shape

Let’s form a dataset and build a graph of dependencies with histograms.

import pandas as pd

iris_dataframe = pd.DataFrame(iris_dataset['data'], columns=iris_dataset.feature_names)
scat_mtrx = pd.plotting.scatter_matrix(iris_dataframe, c=iris_dataset['target'], figsize=(10, 10), marker="o",
                                       hist_kwds={'bins': 20}, s=40, alpha=.8)

From the graphs, we can notice that the class data seems to be well separated (from separate – to separate) by the measurements of the petals and sepals, so most likely the machine learning model will be able to learn to separate them quite well.

But with four parameters, it is quite difficult to imagine how the objects are located relative to each other, since you have to work in four-dimensional space. It can be seen from the graphs that flowers are best broken down by measurements of the length and width of the petal (petal length, petal width), so for clarity, we will leave only these data.

Let’s take a closer look at the two most dependent features.

iris_dataframe_simple = pd.DataFrame(iris_dataset.data[:, 2:4], columns=iris_dataset.feature_names[2:4])
scat_mtrx = pd.plotting.scatter_matrix(iris_dataframe_simple, c=iris_dataset['target'], figsize=(10, 10), marker="o",
                                       hist_kwds={'bins': 20}, s=40, alpha=.8)

Let’s divide the data into training and test datasets and, for ease of implementation of the algorithm, combine arrays of object features and labels of their classes so that it is clear which class each object belongs to.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(iris_dataset.data[:, 2:4], 
                                                    iris_dataset['target'],
                                                    random_state=0) # random_state - для воспроизводимости

print(f'X_train shape: {x_train.shape}, y_train shape: {y_train.shape},\n'
      f'X_test shape: {x_test.shape}, y_test shape: {y_test.shape}')
import numpy as np

x_train_concat = np.concatenate((x_train, y_train.reshape(112, 1)), axis=1)
x_test_concat = np.concatenate((x_test, y_test.reshape(38, 1)), axis=1)
print(f'X_train shape: {x_train_concat.shape},\n'
      f'X_test shape: {x_test_concat.shape}')

As we can see, now we have class labels in the last column.

Let’s start implementing the algorithm.

First, let’s define a metric by which we will determine the distance between objects. Denote by =(1,2,…,)x=(x1,x2,…,xn) the coordinates of the object x in n-dimensional space, and by =(1,2,…,)y=(y1 ,y2,…,yn) – object coordinates y.

By default, the algorithm uses the Minkowski metric, which, in the case of the degree p = 2, refers to the Euclidean metric known from school geometry – the distance between two points in space:

We will use it.

import math

def euclidean_distance(data1, data2):
    distance = 0
    for i in range (len(data1) - 1):
        distance += (data1[i] - data2[i]) ** 2
    return math.sqrt(distance)

We calculate the distances to all points of the training sample and select k neighbors (that is, those with the minimum distances).

def get_neighbors(train, test, k=1):
    distances = [(train[i][-1], euclidean_distance(train[i], test))
                  for i in range (len(train))]
    distances.sort(key=lambda elem: elem[1])
    
    neighbors = [distances[i][0] for i in range (k)]
    return neighbors

Now we get a prediction based on neighbor classes. Let’s count how many objects of each class are present among the k closest to the target, and then assign it to the class with the most instances.

def prediction(neighbors):
    count = {}
    for instance in neighbors:
        if instance in count:
            count[instance] +=1
        else :
            count[instance] = 1
    target = max(count.items(), key=lambda x: x[1])[0]
    return target

Let’s write the last function to evaluate the accuracy of forecasts. It was discussed at the very beginning – it is simply the ratio of correct predictions to the total number of predictions.

def accuracy(test, test_prediction):
    correct = 0
    for i in range (len(test)):
        if test[i][-1] == test_prediction[i]:
            correct += 1
    return (correct / len(test))

Let’s see how our algorithm works.

predictions = []
for x in range (len(x_test_concat)):
    neighbors = get_neighbors(x_train_concat, x_test_concat[x], k=5)
    result = prediction(neighbors)
    predictions.append(result)
#     print(f'predicted = {result}, actual = {x_test_concat[x][-1]}') # если есть интерес посмотреть, какие конкретно прогнозы некорректны
accuracy = accuracy(x_test_concat, predictions)
print(f'Accuracy: {accuracy}')

Now we import the library version of the algorithm.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

The knn object encapsulates the algorithm that will be used to build the model from the training data as well as to predict new data points. It will also contain the information that the algorithm has learned from the training data. In the case of KNeighborsClassifier , it will simply store the training set.

To build a model on the training set, the fit method of the knn object is called, which takes as arguments a NumPy x_train array containing training data and a NumPy y_train array of the corresponding training labels.

knn_model = knn.fit(x_train, y_train)

For predictions, the predict method is called, which takes test data as arguments.

knn_predictions = knn.predict(x_test)
knn_predictions

For verification, we import a simple built-in metric accuracy_score, which determines the proportion of correct answers.

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, knn_predictions)
print(f'Accuracy: {accuracy}')

As we can see, the manually implemented algorithm with these parameters is comparable in accuracy to the library model, however, in practice, it is worth using ready-made implementations, since they are often much better optimized and work faster/better with large data samples.

In general, despite the fact that the algorithm is quite simple to understand and implement, it has found wide application in areas such as recommender systems, semantic search, and anomaly detection. However, to apply it, we must be able to store the entire training set in memory, and performing classifications can be computationally expensive since the algorithm analyzes all data points for each classification. For these reasons, kNN is best applied to small datasets that do not have a huge feature set.

We analyzed the construction of a kNN model on the Iris dataset, using only the two most correlated features.

However, other features could also be important for the model to obtain more detailed information about objects and, accordingly, to build more accurate forecasts. Let’s build these models.

Let’s add the remaining features to the features (petal length, petal width) from the lesson one by one to get: 1) (sepal length, petal length, petal width); 2) (sepal width, petal length, petal width).

Note: Features can be placed in any order.

Let’s repeat all the actions from scratch, and build one more model already on several grounds.

from sklearn.datasets import load_iris

Recall the order of the features in the data array

iris_dataset = load_iris()
iris_dataset['feature_names']

To form an array of new features, you can select columns by condition or simply delete one of the columns we don’t need using the library function numpy.delete

Example

import numpy as np
a = np.array([[ 0,  1,  2,  3],
               [ 4,  5,  6,  7],
               [ 8,  9, 10, 11],
               [12, 13, 14, 15]])

a_new = np.delete(a, 0, axis=1)
a_new

Let’s look again and remember what the dataset contains.

iris_dataset.keys()

Types of irises

iris_dataset['target_names']

Characteristics of each flower

iris_dataset['feature_names']

Let’s create a dataframe

import pandas as pd
iris_dataset = pd.DataFrame(iris_dataset['data'], columns=iris_dataset.feature_names)
iris_dataset.head(2)

Let’s add the remaining features to the features (petal length, petal width) in order to get: 1) (sepal length, petal length, petal width); 2) (sepal width, petal length, petal width).

iris_dataset_1 = iris_dataset[['sepal length (cm)','petal length (cm)', 'petal width (cm)']]
iris_dataset_2 = iris_dataset[['sepal width (cm)','petal length (cm)', 'petal width (cm)']]

Let’s see if our yes

iris_dataset_1.head(2)
iris_dataset_2.head(2)

Now let’s look at a three-dimensional graph, how well the data is separated by each of the sets of three parameters.

from mpl_toolkits import mplot3d
import matplotlib.pyplot as plt

Example building a three-dimensional graph

ax = plt.axes(projection='3d')

zdata = 15 * np.random.random(100) # точки оси Z
xdata = np.sin(zdata) + 0.1 * np.random.randn(100) # точки оси X
ydata = np.cos(zdata) + 0.1 * np.random.randn(100) # точки оси Y
colors = np.random.randint(3, size=100)

ax.scatter3D(xdata, ydata, zdata, alpha=.8, c=colors)

Note: to set the color in the function, use c=iris_dataset.target.

ax = plt.axes(projection='3d')


zdata = iris_dataset_1['sepal length (cm)'] # точки оси Z
xdata = iris_dataset_1['petal length (cm)'] # точки оси X
ydata = iris_dataset_1['petal width (cm)']
colors = np.random.randint(3, size=150)

ax.scatter3D(xdata, ydata, zdata, alpha=.8, c=colors)

Let’s do the same for another dataframe

ax = plt.axes(projection='3d')

zdata = iris_dataset_2['sepal width (cm)'] # точки оси Z
xdata = iris_dataset_2['petal length (cm)'] # точки оси X
ydata = iris_dataset_2['petal width (cm)']
colors = np.random.randint(3, size=150)

ax.scatter3D(xdata, ydata, zdata, alpha=.8, c=colors)

Using the sklearn.model_selection.train_test_split function, we divide the data into training and test datasets and then, using the library version of the sklearn.neighbors.KNeighborsClassifier algorithm, build a model for the datasets iris_dataset_1 and iris_dataset_2 (default use n_neighbors=5).

Note: in the train_test_split function, we use the random_state=17 parameter for reproducible results.

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

Let’s divide our dataset into a training set and a validation set.

x_train_1, x_test_1, y_train_1, y_test_1 = train_test_split(iris_dataset_1, iris_dataset['target'], test_size = 0.2, random_state=0)
x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split(iris_dataset_2, iris_dataset['target'], test_size = 0.2, random_state=0)

Let’s check the dimensions.

print(x_train_1.shape)
print(y_train_1.shape)
print(x_test_1.shape)
print(y_test_1.shape)

Let’s create an object of the K nearest neighbors network class. And we will train two models with a different set of features.

knn = KNeighborsClassifier(n_neighbors=5)
knn_model_1 = knn.fit(x_train_1, y_train_1)
knn_model_2 = knn.fit(x_train_2, y_train_2)

Let’s look at the predictions of models on test samples.

knn_predictions_1 = knn_model_1.predict(x_test_1)
print(knn_predictions_1)
print(y_test_1)
knn_predictions_2 = knn_model_2.predict(x_test_2)
print(knn_predictions_2)
print(y_test_2)

Let’s check the accuracy of both models using the built-in function sklearn.metrics.accuracy_score. Compare the result of their work with the result obtained on the data set with two features (which was discussed in the lesson), and indicate the answer.

from sklearn.metrics import accuracy_score
accuracy_1 = accuracy_score(y_test_1, knn_predictions_1)
accuracy_2 = accuracy_score(y_test_2, knn_predictions_2)

print(f'Accuracy_1: {accuracy_1}, accuracy_2: {accuracy_2}')

sharpen the model on the data x_train_1, y_train_1 with the n_neighbors hyperparameter ranging from 1 to 20 inclusive, and specify the n_neighbors values ​​that correspond to the highest result of the accuracy_score() function.

Note: You can use a for loop to avoid manually writing all 20 model variations.

iris_dataset = load_iris()

iris_dataset_for_split = pd.DataFrame(iris_dataset['data'], columns=iris_dataset.feature_names)

iris_dataset_1 = iris_dataset_for_split[['sepal length (cm)','petal length (cm)', 'petal width (cm)']]

iris_dataset_2 = iris_dataset_for_split[['sepal width (cm)','petal length (cm)', 'petal width (cm)']]

x_train_1, x_test_1, y_train_1, y_test_1 = train_test_split(iris_dataset_1, iris_dataset['target'], test_size = 0.2, random_state=0)
x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split(iris_dataset_2, iris_dataset['target'], test_size = 0.2, random_state=0)

for i in range(21):
    knn = KNeighborsClassifier(n_neighbors = i+1)
    knn_model_1 = knn.fit(x_train_1, y_train_1)
    knn_predictions_1 = knn_model_1.predict(x_test_1)
    accuracy_1 = accuracy_score(y_test_1, knn_predictions_1)
    print(f'Accuracy_1: {accuracy_1}, step/iter : {i}')

As we can see, the accuracy on test set 1 is everywhere except for 7,8,9,11 and 13 nearest neighbors.

On this, we analyzed the theory, the principle of the library and implemented several neural networks using a dataframe with features for Iris flowers.

Similar Posts

Leave a Reply Cancel reply