Artificial Intelligence 14 min read

Essential Python Libraries for Data Analysis and Machine Learning: A Hands‑On Guide

This article introduces Python's core data structures and the most widely used libraries—NumPy, SciPy, Matplotlib, Pandas, Scikit‑Learn, Keras, and Gensim—providing concise explanations and runnable code examples to help readers quickly start data analysis and machine‑learning projects.

MaGe Linux Operations

Aug 19, 2019

Essential Python Libraries for Data Analysis and Machine Learning: A Hands‑On Guide

Python Language Features for Data Analysis

Key built‑in features frequently used in data analysis and mining include mutable lists, immutable tuples, dictionary key‑value structures, sets (mathematical sets), and functional programming tools such as lambda, map, reduce, and filter.

Common Python Data‑Analysis Libraries

NumPy

Provides true n‑dimensional arrays with C‑level performance; it underpins libraries like SciPy, Matplotlib, and Pandas.

import numpy as np  # standard alias

a = np.array([2, 0, 1, 5])
print(a)
print(a[:3])
print(a.min())
a.sort()  # sorts in place
print(a)

b = np.array([[1, 2, 3], [4, 5, 6]])
print(b * b)

Output:

[2 0 1 5]
[2 0 1]
0
[0 1 2 5]
[[ 1  4  9]
 [16 25 36]]

SciPy

Extends NumPy with scientific computing capabilities, offering optimization, linear algebra, integration, interpolation, FFT, signal and image processing, and ODE solvers.

# Solve a nonlinear system
from scipy.optimize import fsolve

def f(x):
    x1, x2 = x
    return [2 * x1 - x2 ** 2 - 1, x1 ** 2 - x2 - 2]

result = fsolve(f, [1, 1])
print(result)

# Numerical integration
from scipy import integrate

def g(x):
    return (1 - x ** 2) ** 0.5

pi_2, err = integrate.quad(g, -1, 1)
print(pi_2 * 2, err)

Output:

[ 1.91963957  1.68501606]
3.141592653589797 1.0002356720661965e-09

Matplotlib

A popular 2‑D (and simple 3‑D) plotting library.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 10000)
y = np.sin(x) + 1
z = np.cos(x ** 2) + 1

plt.figure(figsize=(8, 4))
plt.plot(x, y, label='$\sin (x+1)$', color='red', linewidth=2)
plt.plot(x, z, 'b--', label='$\cos x^2+1$')
plt.xlim(0, 10)
plt.ylim(0, 2.5)
plt.xlabel('Time(s)')
plt.ylabel('Volt')
plt.title('Matplotlib Sample')
plt.legend()
plt.show()

Pandas

Pandas offers powerful data‑analysis tools built on NumPy, supporting SQL‑like operations, time‑series analysis, and handling of missing data. Its core structures are Series (1‑D) and DataFrame (2‑D).

import pandas as pd

s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
d = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]], columns=['a', 'b', 'c'])
d2 = pd.DataFrame(s)

print(s)
print(d.head())  # first 5 rows
print(d.describe())

# Read CSV (avoid Chinese paths)
# df = pd.read_csv('G:/data.csv', encoding='utf-8')
# print(df)

Scikit‑Learn

Machine‑learning library that depends on NumPy, SciPy, and Matplotlib, providing preprocessing, classification, regression, clustering, and model evaluation utilities.

Common model interface methods:

model.fit() – train the model (supervised: fit(X, y) , unsupervised: fit(X) )

model.predict(X_new) – predict new samples

model.predict_proba(X_new) – predict class probabilities (for applicable models)

Example – linear SVM on the Iris dataset:

from sklearn import datasets, svm

iris = datasets.load_iris()
clf = svm.LinearSVC()
clf.fit(iris.data, iris.target)
print(clf.predict([[5, 3, 1, 0.2], [5.0, 3.6, 1.3, 0.25]]))

Keras

Keras (built on Theano) simplifies building deep‑learning models such as autoencoders, RNNs, and CNNs.

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD

model = Sequential()
model.add(Dense(20, 64))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, 64))
model.add(Activation('tanh'))
model.add(Dense(64, 1))
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)
# model.fit(x_train, y_train, epochs=20, batch_size=16)
# score = model.evaluate(X_test, y_test, batch_size=16)

Gensim

Gensim focuses on natural‑language processing tasks such as similarity calculation, LDA, and Word2Vec.

import logging
from gensim import models

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = [['first', 'sentence'], ['second', 'sentence']]
model = models.Word2Vec(sentences, min_count=1)
print(model['sentence'])

Conclusion

This note provides a brief overview of the most common tools for data analysis and mining in Python; detailed usage will be covered in future articles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Pandas machine-learning data-analysis

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.