How LLMs + Python Are Redefining Data Analysis: A Practical Guide
This article explains how large language models combined with Python's data‑science ecosystem can automate metadata extraction, data cleaning, and analysis tasks—illustrated with a step‑by‑step Titanic passenger dataset case study, complete prompts, code snippets, and best‑practice recommendations.
Introduction
As data volumes and varieties grow, enterprises need fast ways to explore and extract value. Using Large Language Models (LLMs) together with Python offers a new paradigm for data analysis.
Why Python?
Rich data‑science libraries such as Pandas, NumPy, Scikit‑learn, Matplotlib, and Seaborn.
LLMs are trained on these libraries and can generate or understand code that uses them.
How LLMs Empower Python Data‑Analysis Workflows
Automated metadata extraction and anomaly detection.
Natural‑language interaction, code assistance, and insight generation.
Goal
Provide a beginner‑friendly workflow for data analysis using Python and LLMs.
Case Study: Titanic Passenger Data
We use the publicly available Titanic dataset ( CSV link ) to answer the question: “Among surviving passengers, how many are of each gender and what is the age‑group distribution?”
1. Let LLM Understand the Data
Load the CSV with Pandas, inspect df.info() and df.head(), then send the summary to the LLM with a clear prompt asking it to describe the table structure, column types, and missing values.
#role 数据表信息分析助手
# task
根据用户提供的表头信息,示例数据分析数据表的信息。
表信息在<table>标签中进行表述,列信息在<columns>标签中描述,包含 name、type、description、contains_null 等字段。2. Data Cleaning and Pre‑processing
Based on the LLM’s feedback, generate Python code to standardize data types, fill missing values, and drop completely empty rows or columns.
import pandas as pd
try:
df["PassengerId"] = pd.to_numeric(df["PassengerId"], errors="coerce")
df["Age"].fillna(df["Age"].mean(), inplace=True)
df.dropna(how="all", inplace=True)
df.dropna(axis=1, how="all", inplace=True)
except Exception as e:
print(f"Data cleaning error: {e}")
print(df.info())3. Analysis Code Generation
Prompt the LLM to produce analysis code that selects surviving passengers, counts gender, creates age groups, and outputs the results as JSON with Chinese keys.
# role 你是一个编写数据分析代码的助手
# 代码要求
1. 使用 Python3.11
2. 使用 pandas
3. 输出 JSON,键使用中文
<analysis>
根据用户需求,统计生还乘客的性别人数和年龄段分布。
</analysis>
<thinking>
1. df_survived = df[df["Survived"] == 1]
2. sex_count = df_survived["Sex"].value_counts()
3. df_survived["Age"].fillna(df_survived["Age"].mean(), inplace=True)
4. 定义 age_group 函数并创建 AgeGroup 列
5. age_group_count = df_survived["AgeGroup"].value_counts()
6. 组装为 JSON 并打印
</thinking>
<answer>
```python
import pandas as pd, json
df_survived = df[df["Survived"] == 1]
sex_count = df_survived["Sex"].value_counts()
df_survived["Age"].fillna(df_survived["Age"].mean(), inplace=True)
def age_group(age):
if age <= 18:
return "0-18岁"
elif age <= 30:
return "19-30岁"
elif age <= 50:
return "31-50岁"
else:
return "50岁以上"
df_survived["AgeGroup"] = df_survived["Age"].apply(age_group)
age_group_count = df_survived["AgeGroup"].value_counts()
result = {
"性别人数统计": [{"性别": k, "人数": int(v)} for k, v in sex_count.items()],
"年龄段分布": [{"年龄段": k, "人数": int(v)} for k, v in age_group_count.items()]
}
print(json.dumps(result, ensure_ascii=False))
```4. Execution and Result
Running the generated code yields a JSON object such as:
{
"性别人数统计": [{"性别": "female", "人数": 233}, {"性别": "male", "人数": 109}],
"年龄段分布": [{"年龄段": "19-30岁", "人数": 148}, {"年龄段": "31-50岁", "人数": 102}, {"年龄段": "0-18岁", "人数": 70}, {"年龄段": "50岁以上", "人数": 22}]
}Best Practices, Challenges, and Limitations
Prompt Design
Make prompts explicit and unambiguous.
Provide data dictionaries and business context.
Assign a clear role to the LLM (e.g., “data‑analysis expert”).
Model Limitations
Multi‑step reasoning may be error‑prone.
Generated insights can be fabricated or inaccurate.
Input length limits restrict direct analysis of massive datasets.
Execution Environment Constraints
Pandas on a single machine struggles with very large data; consider databases or big‑data frameworks.
Generated code may depend on specific library versions.
Debugging LLM‑generated code often requires iterative refinement.
Data Quality Issues
“Garbage in, garbage out” – analysis quality is bounded by input data quality.
Hidden business rules may not be discoverable by the model.
Conclusion
LLMs combined with Python provide a powerful, interactive way to accelerate data‑analysis pipelines, but practitioners must craft precise prompts, validate generated code, and be aware of scalability and accuracy constraints.
G7 EasyFlow Tech Circle
Official G7 EasyFlow tech channel! All the hardcore tech, cutting‑edge innovations, and practical sharing you want are right here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
