Master Pandas Text Manipulation: From Basics to Advanced String Operations
This guide walks you through handling textual data with pandas, covering basic and new string dtypes, essential string methods for formatting, alignment, counting, encoding, and advanced operations such as splitting, replacing, concatenating, matching, and extracting patterns, all illustrated with clear code examples.
In daily work we often need to extract data from textual information for analysis. Pandas provides powerful tools for handling such text data.
1. Text Data Types
Pandas stores text using two dtypes: object and string. Before version 1.0, object was the only text type; from 1.0 onward a dedicated string dtype offers better string handling.
To use the string dtype you can specify dtype="string" when creating a Series or DataFrame, or convert later with astype("string"). The helper method df.convert_dtypes() automatically selects appropriate dtypes.
1.2 Type Differences
The string and object dtypes differ in how they represent missing values and the return types of accessor methods, e.g., numeric conversion returns nullable integer types for string but int/float for object.
2. String Methods
Series and Index expose string operations via the str accessor, automatically skipping NA values.
2.1 Text Formatting
>> s = pd.Series(["A", "B", "Aaba", "Baca", np.nan, "cat"], dtype="string")
>>> s.str.lower()
0 a
1 b
2 aaba
3 baca
4 <NA>
5 cat
dtype: string
>>> s.str.upper()
0 A
1 B
2 AABA
3 BACA
4 <NA>
5 CAT
dtype: string
>>> s.str.title()
0 A
1 B
2 Aaba
3 Baca
4 <NA>
5 Cat
dtype: string
>>> s.str.capitalize()
0 A
1 B
2 Aaba
3 Baca
4 <NA>
5 Cat
dtype: string
>>> s.str.swapcase()
0 a
1 b
2 aABA
3 bACA
4 <NA>
5 CAT
dtype: string
>>> s.str.casefold()
0 a
1 b
2 aaba
3 baca
4 <NA>
5 cat
dtype: string2.2 Text Alignment
>> s.str.center(10, fillchar='-')
0 ----A-----
1 ----B-----
2 ---Aaba---
3 ---Baca---
4 <NA>
5 ---cat----
>>> s.str.ljust(10, fillchar='-')
0 A---------
1 B---------
2 Aaba------
3 Baca------
4 <NA>
5 cat-------
>>> s.str.rjust(10, fillchar='-')
0 ---------A
1 ---------B
2 ------Aaba
3 ------Baca
4 <NA>
5 -------cat
>>> s.str.pad(width=10, side='left', fillchar='-')
0 ---------A
1 ---------B
2 ------Aaba
3 ------Baca
4 <NA>
5 -------cat
>>> s.str.zfill(3)
0 00A
1 00B
2 Aaba
3 Baca
4 <NA>
5 cat2.3 Counting and Encoding
>> s.str.count("a")
0 0
1 0
2 2
3 2
4 <NA>
5 1
dtype: Int64
>>> s.str.len()
0 1
1 1
2 4
3 4
4 <NA>
5 3
dtype: Int64
>>> s.str.encode('utf-8')
0 b'A'
1 b'B'
2 b'Aaba'
3 b'Baca'
4 <NA>
5 b'cat'
dtype: object
>>> s.str.encode('utf-8').str.decode('utf-8')
0 A
1 B
2 Aaba
3 Baca
4 <NA>
5 cat
dtype: object2.4 Format Checks
>> s.str.isalpha()
0 True
1 True
2 True
3 False
4 False
5 <NA>
6 True
dtype: boolean
>>> s.str.isnumeric()
0 False
1 False
2 False
3 True
4 True
5 <NA>
6 False
dtype: boolean
>>> s.str.isalnum()
0 True
1 True
2 True
3 True
4 True
5 <NA>
6 True
dtype: boolean
>>> s.str.isdigit()
0 False
1 False
2 False
3 True
4 True
5 <NA>
6 False
dtype: boolean
>>> s.str.isdecimal()
0 False
1 False
2 False
3 True
4 True
5 <NA>
6 False
dtype: boolean
>>> s.str.isspace()
0 False
1 False
2 False
3 False
4 False
5 <NA>
6 False
dtype: boolean
>>> s.str.islower()
0 False
1 False
2 False
3 False
4 False
5 <NA>
6 True
dtype: boolean
>>> s.str.isupper()
0 True
1 True
2 False
3 False
4 False
5 <NA>
6 False
dtype: boolean
>>> s.str.istitle()
0 True
1 True
2 True
3 False
4 False
5 <NA>
6 False
dtype: booleanThese methods mirror native Python string methods.
3. Advanced Text Operations
Advanced techniques include splitting, replacing, concatenating, matching, and extracting text.
3.1 Text Splitting
The split() method returns a list; you can access elements with get or []. Use the expand parameter to expand the list into separate columns, and n to limit splits. Regular expressions can be passed as the separator for complex patterns.
3.2 Text Replacement
Replace substrings with str.replace(). Set regex=False for literal replacement; otherwise the pattern is treated as a regular expression. You can also use str.slice_replace() to keep selected content while replacing the rest.
3.3 Text Concatenation
Combine strings with str.cat(). By default missing values are ignored, but you can specify a placeholder with na_rep. Alignment can be controlled with the join parameter (left, right, outer, inner).
3.4 Text Matching
Use str.findall(), str.find(), and str.contains() for pattern searching. str.startswith() and str.endswith() check prefixes and suffixes, while str.match() applies regular expressions.
3.5 Text Extraction
Extract specific patterns with str.extract(), which returns a new column based on a regular expression. Use named groups ( ?P<name>) to label extracted parts, and set expand=False to get a Series instead of a DataFrame.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
