How LLMs + Python Are Redefining Data Analysis: A Practical Guide

This article explains how large language models combined with Python's data‑science ecosystem can automate metadata extraction, data cleaning, and analysis tasks—illustrated with a step‑by‑step Titanic passenger dataset case study, complete prompts, code snippets, and best‑practice recommendations.

G7 EasyFlow Tech Circle
G7 EasyFlow Tech Circle
G7 EasyFlow Tech Circle
How LLMs + Python Are Redefining Data Analysis: A Practical Guide

Introduction

As data volumes and varieties grow, enterprises need fast ways to explore and extract value. Using Large Language Models (LLMs) together with Python offers a new paradigm for data analysis.

Why Python?

Rich data‑science libraries such as Pandas, NumPy, Scikit‑learn, Matplotlib, and Seaborn.

LLMs are trained on these libraries and can generate or understand code that uses them.

How LLMs Empower Python Data‑Analysis Workflows

Automated metadata extraction and anomaly detection.

Natural‑language interaction, code assistance, and insight generation.

Goal

Provide a beginner‑friendly workflow for data analysis using Python and LLMs.

Case Study: Titanic Passenger Data

We use the publicly available Titanic dataset ( CSV link ) to answer the question: “Among surviving passengers, how many are of each gender and what is the age‑group distribution?”

1. Let LLM Understand the Data

Load the CSV with Pandas, inspect df.info() and df.head(), then send the summary to the LLM with a clear prompt asking it to describe the table structure, column types, and missing values.

#role 数据表信息分析助手
# task
根据用户提供的表头信息,示例数据分析数据表的信息。
表信息在<table>标签中进行表述,列信息在<columns>标签中描述,包含 name、type、description、contains_null 等字段。

2. Data Cleaning and Pre‑processing

Based on the LLM’s feedback, generate Python code to standardize data types, fill missing values, and drop completely empty rows or columns.

import pandas as pd
try:
    df["PassengerId"] = pd.to_numeric(df["PassengerId"], errors="coerce")
    df["Age"].fillna(df["Age"].mean(), inplace=True)
    df.dropna(how="all", inplace=True)
    df.dropna(axis=1, how="all", inplace=True)
except Exception as e:
    print(f"Data cleaning error: {e}")
print(df.info())

3. Analysis Code Generation

Prompt the LLM to produce analysis code that selects surviving passengers, counts gender, creates age groups, and outputs the results as JSON with Chinese keys.

# role 你是一个编写数据分析代码的助手
# 代码要求
1. 使用 Python3.11
2. 使用 pandas
3. 输出 JSON,键使用中文

<analysis>
根据用户需求,统计生还乘客的性别人数和年龄段分布。
</analysis>
<thinking>
1. df_survived = df[df["Survived"] == 1]
2. sex_count = df_survived["Sex"].value_counts()
3. df_survived["Age"].fillna(df_survived["Age"].mean(), inplace=True)
4. 定义 age_group 函数并创建 AgeGroup 列
5. age_group_count = df_survived["AgeGroup"].value_counts()
6. 组装为 JSON 并打印
</thinking>
<answer>
```python
import pandas as pd, json

df_survived = df[df["Survived"] == 1]
sex_count = df_survived["Sex"].value_counts()
df_survived["Age"].fillna(df_survived["Age"].mean(), inplace=True)

def age_group(age):
    if age <= 18:
        return "0-18岁"
    elif age <= 30:
        return "19-30岁"
    elif age <= 50:
        return "31-50岁"
    else:
        return "50岁以上"

df_survived["AgeGroup"] = df_survived["Age"].apply(age_group)
age_group_count = df_survived["AgeGroup"].value_counts()
result = {
    "性别人数统计": [{"性别": k, "人数": int(v)} for k, v in sex_count.items()],
    "年龄段分布": [{"年龄段": k, "人数": int(v)} for k, v in age_group_count.items()]
}
print(json.dumps(result, ensure_ascii=False))
```

4. Execution and Result

Running the generated code yields a JSON object such as:

{
    "性别人数统计": [{"性别": "female", "人数": 233}, {"性别": "male", "人数": 109}],
    "年龄段分布": [{"年龄段": "19-30岁", "人数": 148}, {"年龄段": "31-50岁", "人数": 102}, {"年龄段": "0-18岁", "人数": 70}, {"年龄段": "50岁以上", "人数": 22}]
}

Best Practices, Challenges, and Limitations

Prompt Design

Make prompts explicit and unambiguous.

Provide data dictionaries and business context.

Assign a clear role to the LLM (e.g., “data‑analysis expert”).

Model Limitations

Multi‑step reasoning may be error‑prone.

Generated insights can be fabricated or inaccurate.

Input length limits restrict direct analysis of massive datasets.

Execution Environment Constraints

Pandas on a single machine struggles with very large data; consider databases or big‑data frameworks.

Generated code may depend on specific library versions.

Debugging LLM‑generated code often requires iterative refinement.

Data Quality Issues

“Garbage in, garbage out” – analysis quality is bounded by input data quality.

Hidden business rules may not be discoverable by the model.

Conclusion

LLMs combined with Python provide a powerful, interactive way to accelerate data‑analysis pipelines, but practitioners must craft precise prompts, validate generated code, and be aware of scalability and accuracy constraints.

PythonLLMprompt engineeringData Analysisdata cleaningpandas
G7 EasyFlow Tech Circle
Written by

G7 EasyFlow Tech Circle

Official G7 EasyFlow tech channel! All the hardcore tech, cutting‑edge innovations, and practical sharing you want are right here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.