Artificial Intelligence 9 min read

k-Nearest Neighbors (kNN) Algorithm: Overview, Pros/Cons, Data Preparation, Implementation, and Handwritten Digit Recognition

This article explains the k‑Nearest Neighbors classification method, discusses its advantages and drawbacks, describes data preparation and normalization, presents Python code for the algorithm and a full handwritten digit recognition project, and reports an error rate of about 1.2%.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
k-Nearest Neighbors (kNN) Algorithm: Overview, Pros/Cons, Data Preparation, Implementation, and Handwritten Digit Recognition

1 Overview: The k‑Nearest Neighbors (k‑NN) algorithm classifies samples by measuring distances between feature vectors.

2 Advantages and disadvantages: high accuracy, insensitive to outliers, no assumptions about data; however computationally expensive, high space complexity, cannot be saved as a model. Suitable for numeric and categorical data.

3 Data preparation: test vectors, training dataset without target vectors, label vectors, and the parameter k (usually ≤20).

Normalization formula: new_value = (old_value‑min)/(max‑min). Example Python implementation shown below.

<code>import numpy
class Normalization(object):
    def auto_norm(self, matrix):
        """数据清洗,归一化
        new_value=(old_value-min)/(max-min)
        :param matrix: 矩阵
        :return: 归一化的矩阵,范围数据,最小值
        """
        # 0表示从列中选值
        # 每列的最小值组成一个向量
        min_value = matrix.min(0)
        # 每列的最大值组成一个向量
        max_value = matrix.max(0)
        # 每列的范围值
        ranges = max_value - min_value

        m = matrix.shape[0]
        norm_matrix = numpy.zeros(numpy.shape(matrix))
        # 分子
        norm_matrix = matrix - numpy.tile(min_value, (m, 1))
        # 不是矩阵除法,矩阵除法是linalg.solve(matA,matB)
        norm_matrix = norm_matrix / numpy.tile(ranges, (m, 1))
        return norm_matrix, ranges, min_value
</code>

4 Principle: After preparing data, copy the test vector to match the training set size, compute Euclidean distances, sort them, select the k nearest neighbors, vote by label frequency, and return the label with highest vote.

5 Core k‑NN implementation in Python (shown below) includes functions to create a toy dataset, compute Euclidean distance, and classify a new sample.

<code>import operator
from numpy import *
class kNN(object):
    def createDataSet(self):
        """创建测试数据集
        :return:矩阵,标签
        """
        group = numpy.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
        labels = ['A', 'A', 'B', 'B']
        return group, labels

    def classify0(self, inX, dataSet, labels, k):
        """k-近邻,欧式距离计算两个向量的距离
        :param inX: 输入向量
        :param dataSet: 训练样本集
        :param labels: 标签向量
        :param k: 最近邻居的数目
        :return: 最近的结果
        """
        dataSetSize = dataSet.shape[0]
        diffMat = tile(inX, (dataSetSize, 1)) - dataSet
        sqDiffMat = diffMat ** 2
        sqDistances = sqDiffMat.sum(axis=1)
        distances = sqDistances ** 0.5
        sortedDistIndices = distances.argsort()
        classCount = {}
        for i in range(k):
            voteIlabel = labels[sortedDistIndices[i]]
            classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
        sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]

if __name__ == '__main__':
    knn = kNN()
    group, labels = knn.createDataSet()
    result = knn.classify0([0, 0], group, labels, 3)
    print(result)  # B
</code>

6 Practical application: Using the same k‑NN framework to build a simple handwritten digit recognizer, loading training and test image files, converting them to 1024‑dimensional vectors, classifying with k=3, and achieving an error rate of about 1.2%.

<code>... (output showing results and error count) ...
</code>

The article concludes that the k‑NN method, despite its simplicity, can deliver good performance on digit classification tasks.

machine learningPythonclassificationkNNEuclidean distancehandwritten digit recognition
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.