Essential Pandas Functions for Data Analysis in Python
This article introduces Python's pandas library as a powerful open‑source alternative to MATLAB for data modeling competitions, covering basic, intermediate, and advanced functions—including data I/O, inspection, logical filtering, visualization, aggregation, and integration with tqdm for progress tracking—complete with code examples.
In modeling competitions, MATLAB is often used for scientific computing, but as a commercial product it has limitations; the open‑source Python ecosystem, especially pandas, provides a free and versatile alternative for matrix operations, data processing, scientific calculations, visualization, and machine learning.
Basic Functions
1. Read data : data = pd.read_csv('newfile.csv') (also read_excel , read_clipboard , read_sql ).
2. Write data : data.to_csv('2_newfile.csv', index=None) (similarly to_excel , to_json , to_pickle ).
3. Inspect data : data.shape gives rows and columns; data.describe() provides basic statistics.
4. View data : data.head(3) shows the first three rows; data.tail() shows the last row; data.loc[8] accesses the eighth row; data.loc[8, 'column_1'] accesses a specific cell; data.loc[range(4,6)] selects rows 4‑5.
Intermediate Functions
1. Count occurrences : data['column_1'].value_counts() .
2. Apply functions to a column : data['column_1'].map(len) and chain operations, e.g., data['column_1'].map(len).map(lambda x: x/100).plot() ; data.apply(sum) applies a function to a column; data.applymap() applies a function to every cell.
3. Progress monitoring with tqdm :
from tqdm import tqdm_notebook tqdm_notebook().pandas() Then replace map / apply / applymap with progress_map , e.g., data['column_1'].progress_map(lambda x: x.count('e')) .
4. Correlation and scatter matrix : data.corr() gives the correlation matrix; data.corr().applymap(lambda x: int(x*100)/100) rounds it; pd.plotting.scatter_matrix(data, figsize=(12,8)) visualizes pairwise relationships.
Advanced Operations
1. SQL‑style merge : data.merge(other_data, on=['column_1','column_2','column_3']) .
2. Group by : data.groupby('column_1')['column_2'].apply(sum).reset_index() aggregates and reshapes the result.
3. Row iteration :
dictionary = {} for i, row in data.iterrows(): dictionary[row['column_1']] = row['column_2'] The iterrows() method provides both the index and the row data.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.