Fundamentals 13 min read

Master Pandas Text Manipulation: From Basics to Advanced String Operations

This guide walks you through handling textual data with pandas, covering basic and new string dtypes, essential string methods for formatting, alignment, counting, encoding, and advanced operations such as splitting, replacing, concatenating, matching, and extracting patterns, all illustrated with clear code examples.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master Pandas Text Manipulation: From Basics to Advanced String Operations

In daily work we often need to extract data from textual information for analysis. Pandas provides powerful tools for handling such text data.

1. Text Data Types

Pandas stores text using two dtypes: object and string. Before version 1.0, object was the only text type; from 1.0 onward a dedicated string dtype offers better string handling.

To use the string dtype you can specify dtype="string" when creating a Series or DataFrame, or convert later with astype("string"). The helper method df.convert_dtypes() automatically selects appropriate dtypes.

1.2 Type Differences

The string and object dtypes differ in how they represent missing values and the return types of accessor methods, e.g., numeric conversion returns nullable integer types for string but int/float for object.

2. String Methods

Series and Index expose string operations via the str accessor, automatically skipping NA values.

2.1 Text Formatting

>> s = pd.Series(["A", "B", "Aaba", "Baca", np.nan, "cat"], dtype="string")
>>> s.str.lower()
0       a
1       b
2    aaba
3    baca
4    <NA>
5     cat
dtype: string
>>> s.str.upper()
0       A
1       B
2    AABA
3    BACA
4    <NA>
5     CAT
dtype: string
>>> s.str.title()
0       A
1       B
2    Aaba
3    Baca
4    <NA>
5     Cat
dtype: string
>>> s.str.capitalize()
0       A
1       B
2    Aaba
3    Baca
4    <NA>
5     Cat
dtype: string
>>> s.str.swapcase()
0       a
1       b
2    aABA
3    bACA
4    <NA>
5     CAT
dtype: string
>>> s.str.casefold()
0       a
1       b
2    aaba
3    baca
4    <NA>
5     cat
dtype: string

2.2 Text Alignment

>> s.str.center(10, fillchar='-')
0    ----A-----
1    ----B-----
2    ---Aaba---
3    ---Baca---
4          <NA>
5    ---cat----
>>> s.str.ljust(10, fillchar='-')
0    A---------
1    B---------
2    Aaba------
3    Baca------
4          <NA>
5    cat-------
>>> s.str.rjust(10, fillchar='-')
0    ---------A
1    ---------B
2    ------Aaba
3    ------Baca
4          <NA>
5    -------cat
>>> s.str.pad(width=10, side='left', fillchar='-')
0    ---------A
1    ---------B
2    ------Aaba
3    ------Baca
4          <NA>
5    -------cat
>>> s.str.zfill(3)
0    00A
1    00B
2    Aaba
3    Baca
4    <NA>
5    cat

2.3 Counting and Encoding

>> s.str.count("a")
0    0
1    0
2    2
3    2
4    <NA>
5    1
dtype: Int64
>>> s.str.len()
0    1
1    1
2    4
3    4
4    <NA>
5    3
dtype: Int64
>>> s.str.encode('utf-8')
0    b'A'
1    b'B'
2    b'Aaba'
3    b'Baca'
4    <NA>
5    b'cat'
dtype: object
>>> s.str.encode('utf-8').str.decode('utf-8')
0    A
1    B
2    Aaba
3    Baca
4    <NA>
5    cat
dtype: object

2.4 Format Checks

>> s.str.isalpha()
0    True
1    True
2    True
3    False
4    False
5    <NA>
6    True
dtype: boolean
>>> s.str.isnumeric()
0    False
1    False
2    False
3    True
4    True
5    <NA>
6    False
dtype: boolean
>>> s.str.isalnum()
0    True
1    True
2    True
3    True
4    True
5    <NA>
6    True
dtype: boolean
>>> s.str.isdigit()
0    False
1    False
2    False
3    True
4    True
5    <NA>
6    False
dtype: boolean
>>> s.str.isdecimal()
0    False
1    False
2    False
3    True
4    True
5    <NA>
6    False
dtype: boolean
>>> s.str.isspace()
0    False
1    False
2    False
3    False
4    False
5    <NA>
6    False
dtype: boolean
>>> s.str.islower()
0    False
1    False
2    False
3    False
4    False
5    <NA>
6    True
dtype: boolean
>>> s.str.isupper()
0    True
1    True
2    False
3    False
4    False
5    <NA>
6    False
dtype: boolean
>>> s.str.istitle()
0    True
1    True
2    True
3    False
4    False
5    <NA>
6    False
dtype: boolean

These methods mirror native Python string methods.

3. Advanced Text Operations

Advanced techniques include splitting, replacing, concatenating, matching, and extracting text.

3.1 Text Splitting

The split() method returns a list; you can access elements with get or []. Use the expand parameter to expand the list into separate columns, and n to limit splits. Regular expressions can be passed as the separator for complex patterns.

3.2 Text Replacement

Replace substrings with str.replace(). Set regex=False for literal replacement; otherwise the pattern is treated as a regular expression. You can also use str.slice_replace() to keep selected content while replacing the rest.

3.3 Text Concatenation

Combine strings with str.cat(). By default missing values are ignored, but you can specify a placeholder with na_rep. Alignment can be controlled with the join parameter (left, right, outer, inner).

3.4 Text Matching

Use str.findall(), str.find(), and str.contains() for pattern searching. str.startswith() and str.endswith() check prefixes and suffixes, while str.match() applies regular expressions.

3.5 Text Extraction

Extract specific patterns with str.extract(), which returns a new column based on a regular expression. Use named groups ( ?P<name>) to label extracted parts, and set expand=False to get a Series instead of a DataFrame.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythondata analysispandastext processingString Methods
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.