Essential Python Libraries for Data Analysis and Machine Learning: A Hands‑On Guide
This article introduces Python's core data structures and the most widely used libraries—NumPy, SciPy, Matplotlib, Pandas, Scikit‑Learn, Keras, and Gensim—providing concise explanations and runnable code examples to help readers quickly start data analysis and machine‑learning projects.
Python Language Features for Data Analysis
Key built‑in features frequently used in data analysis and mining include mutable lists, immutable tuples, dictionary key‑value structures, sets (mathematical sets), and functional programming tools such as lambda, map, reduce, and filter.
Common Python Data‑Analysis Libraries
NumPy
Provides true n‑dimensional arrays with C‑level performance; it underpins libraries like SciPy, Matplotlib, and Pandas.
import numpy as np # standard alias
a = np.array([2, 0, 1, 5])
print(a)
print(a[:3])
print(a.min())
a.sort() # sorts in place
print(a)
b = np.array([[1, 2, 3], [4, 5, 6]])
print(b * b)Output:
[2 0 1 5]
[2 0 1]
0
[0 1 2 5]
[[ 1 4 9]
[16 25 36]]SciPy
Extends NumPy with scientific computing capabilities, offering optimization, linear algebra, integration, interpolation, FFT, signal and image processing, and ODE solvers.
# Solve a nonlinear system
from scipy.optimize import fsolve
def f(x):
x1, x2 = x
return [2 * x1 - x2 ** 2 - 1, x1 ** 2 - x2 - 2]
result = fsolve(f, [1, 1])
print(result)
# Numerical integration
from scipy import integrate
def g(x):
return (1 - x ** 2) ** 0.5
pi_2, err = integrate.quad(g, -1, 1)
print(pi_2 * 2, err)Output:
[ 1.91963957 1.68501606]
3.141592653589797 1.0002356720661965e-09Matplotlib
A popular 2‑D (and simple 3‑D) plotting library.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 10000)
y = np.sin(x) + 1
z = np.cos(x ** 2) + 1
plt.figure(figsize=(8, 4))
plt.plot(x, y, label='$\sin (x+1)$', color='red', linewidth=2)
plt.plot(x, z, 'b--', label='$\cos x^2+1$')
plt.xlim(0, 10)
plt.ylim(0, 2.5)
plt.xlabel('Time(s)')
plt.ylabel('Volt')
plt.title('Matplotlib Sample')
plt.legend()
plt.show()Pandas
Pandas offers powerful data‑analysis tools built on NumPy, supporting SQL‑like operations, time‑series analysis, and handling of missing data. Its core structures are Series (1‑D) and DataFrame (2‑D).
import pandas as pd
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
d = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]], columns=['a', 'b', 'c'])
d2 = pd.DataFrame(s)
print(s)
print(d.head()) # first 5 rows
print(d.describe())
# Read CSV (avoid Chinese paths)
# df = pd.read_csv('G:/data.csv', encoding='utf-8')
# print(df)Scikit‑Learn
Machine‑learning library that depends on NumPy, SciPy, and Matplotlib, providing preprocessing, classification, regression, clustering, and model evaluation utilities.
Common model interface methods:
model.fit() – train the model (supervised: fit(X, y) , unsupervised: fit(X) )
model.predict(X_new) – predict new samples
model.predict_proba(X_new) – predict class probabilities (for applicable models)
Example – linear SVM on the Iris dataset:
from sklearn import datasets, svm
iris = datasets.load_iris()
clf = svm.LinearSVC()
clf.fit(iris.data, iris.target)
print(clf.predict([[5, 3, 1, 0.2], [5.0, 3.6, 1.3, 0.25]]))Keras
Keras (built on Theano) simplifies building deep‑learning models such as autoencoders, RNNs, and CNNs.
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
model = Sequential()
model.add(Dense(20, 64))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, 64))
model.add(Activation('tanh'))
model.add(Dense(64, 1))
model.add(Activation('sigmoid'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)
# model.fit(x_train, y_train, epochs=20, batch_size=16)
# score = model.evaluate(X_test, y_test, batch_size=16)Gensim
Gensim focuses on natural‑language processing tasks such as similarity calculation, LDA, and Word2Vec.
import logging
from gensim import models
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = [['first', 'sentence'], ['second', 'sentence']]
model = models.Word2Vec(sentences, min_count=1)
print(model['sentence'])Conclusion
This note provides a brief overview of the most common tools for data analysis and mining in Python; detailed usage will be covered in future articles.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
