Introduction to Machine Learning Concepts: Data, Features, Labels, Training, and Common Algorithms
This article provides a beginner-friendly overview of machine learning fundamentals, covering the definition of data, the distinction between features and labels, types of features, dimensionality, training and test datasets, normalization, supervised and unsupervised learning methods, algorithm selection, development workflow, and recommended Python libraries such as NumPy.
Machine learning often feels lofty because of its terminology; this article explains the core concepts in plain language.
1. Data
In programming we frequently use database; a row in a database corresponds to a record, which contains many attributes. These attributes are directly related to machine learning features.
2. Features
Each attribute of a record is called a feature in machine learning. Features are divided into discrete and continuous types.
2.1 Discrete Features
Examples from high‑school math: continuous vs discrete. In a sample Bird class we define attributes such as weight, length, fins, color, and type. These attributes are also called features.
Color is an enumerated feature.
Weight and length are numerical features.
Fins are boolean features (also called binary features).
Type is the target variable.
2.2 Continuous Features
Continuous data are typically interval data, e.g., temperature ranges like 19~28°C or water freezing point -4~−∞°C.
3. Dimensionality
The number of features equals the dimensionality. For the Bird class with four features, each data vector has four dimensions.
4. Target Variable (Label)
The target variable is the value the algorithm predicts, e.g., y = x + 1. In classification it is usually categorical, while in regression it is continuous. The target must be present in the training set to relate features to the label.
5. Training
Training (algorithm training) feeds a large set of labeled data to produce an algorithm model. The data used for training is called the training set.
Example: feeding data to an algorithm yields f(x) = x/2, which becomes the model.
6. Datasets
6.1 Training Dataset
The training dataset provides input data for model training.
6.2 Test Dataset
The test dataset is used after training to evaluate the model’s performance; it must contain the target variable to measure accuracy and must not be used during training.
7. Normalization
Normalization (numeric scaling) rescales features to the 0~1 range to avoid dominance of large‑scale attributes in distance calculations.
Simple formula: new_value = (old_value - min) / (max - min) where min and max are the feature’s minimum and maximum in the dataset.
8. Supervised Learning
Supervised learning uses labeled data. It requires knowledge of both features and target variable.
8.1 Classification
Classification assigns data to predefined categories; it is a primary task of machine learning.
Banana Orange Fork Cola
| | | |
Fruit Fruit Utensil Drink8.2 Regression
Regression predicts continuous values, e.g., fitting a curve to data points.
8.3 Common Supervised Algorithms
(Illustration omitted.)
9. Unsupervised Learning
Unsupervised learning works with only features (no target). It discovers structure such as clusters.
9.1 Clustering
Clustering groups similar data points into clusters.
Apple Orange Spoon Fork Cola Sprite
\ / \ / / \
Fruit Utensil Drink9.2 Density Estimation
Estimating the statistical distribution of data.
9.3 Common Unsupervised Algorithms
(Illustration omitted.)
10. Choosing the Right Algorithm
Select an algorithm based on the task and data type: use supervised learning for predicting a target; if the target is discrete, choose a classification algorithm; if continuous, choose regression. If no target is needed, consider unsupervised methods such as clustering or density estimation.
11. Machine Learning Development Steps
Collect data (web crawling, sensors, public datasets).
Prepare input data (formatting, discretization).
Analyze data (missing values, outliers, visualizations).
Train algorithm (choose appropriate model; unsupervised methods skip training).
Test algorithm (evaluate model performance).
Deploy algorithm (engineer the model for production).
12. Recommended Language and Libraries
Python is recommended for its powerful libraries and simplicity.
Common libraries: numpy for vector/matrix operations and matplotlib for 2D/3D plotting.
13. Numpy Common Operations
13.1 Create Random Data
import numpy
numpy.random.ramd(4,4)13.2 Convert Array to Matrix
import numpy
random_array = numpy.random.ramd(4,4)
random_matrix = numpy.mat(random_array)13.3 Matrix Inverse
import numpy
random_matrix = numpy.mat(numpy.random.ramd(4,4))
random_matrix.I13.4 Matrix Multiplication
import numpy
random_matrix = numpy.mat(numpy.random.ramd(4,4))
result = random_matrix * random_matrix.I13.5 Create Identity Matrix
import numpy
matrix = numpy.eye(4)- END -
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
