Practical Data Sampling Techniques and Code Examples for Various Business Scenarios
This article presents ten real‑world business scenarios illustrating data sampling methods such as random, stratified, time‑window, sliding‑window, keyword, group, interval, click‑based, and weight‑based sampling, each accompanied by clear Python pandas code examples.
Data sampling is a commonly used technique in data analysis that can improve efficiency while preserving representativeness of the original data.
1. E‑commerce scenario: Randomly sample a subset of user purchase records for analysis.
import pandas as pd<br/># Read purchase records<br/>purchase_data = pd.read_csv('purchase_data.csv')<br/># Randomly sample 1,000 records<br/>sample_data = purchase_data.sample(n=1000)2. Market research scenario: Use stratified sampling based on respondent attributes such as gender.
# Read survey data<br/>survey_data = pd.read_csv('survey_data.csv')<br/># Stratified sampling by gender<br/>male_data = survey_data[survey_data['Gender'] == 'Male'].sample(n=500)<br/>female_data = survey_data[survey_data['Gender'] == 'Female'].sample(n=500)3. Healthcare scenario: Apply a time‑window sample to select records within a specific year.
# Read medical records<br/>medical_data = pd.read_csv('medical_data.csv')<br/># Time‑window sampling for 2022<br/>sample_data = medical_data[(medical_data['Date'] >= '2022-01-01') & (medical_data['Date'] <= '2022-12-31')]4. Finance scenario: Use a sliding‑window sample to take the most recent 30 days of stock trading data.
# Read stock data<br/>stock_data = pd.read_csv('stock_data.csv')<br/># Sliding‑window sample: last 30 days<br/>sample_data = stock_data.tail(30)5. Social media scenario: Keyword sampling selects comments containing a specific keyword.
# Read comment data<br/>comment_data = pd.read_csv('comment_data.csv')<br/># Keyword sampling for "好评"
sample_data = comment_data[comment_data['Content'].str.contains('好评')]6. Human resources scenario: Group sampling picks a proportion of employees from each department.
# Read performance data<br/>performance_data = pd.read_csv('performance_data.csv')<br/># Group sampling: 10% from each department<br/>sample_data = performance_data.groupby('Department').apply(lambda x: x.sample(frac=0.1))7. Education scenario: Stratified random sampling selects a percentage of students from each grade.
# Read exam scores<br/>exam_scores = pd.read_csv('exam_scores.csv')<br/># Stratified random sampling: 20% per grade<br/>sample_data = exam_scores.groupby('Grade').apply(lambda x: x.sample(frac=0.2))8. Hotel scenario: Interval sampling picks records at regular time intervals (e.g., every week).
# Read booking data<br/>booking_data = pd.read_csv('booking_data.csv')<br/># Interval sampling: every 7th record (weekly)
sample_data = booking_data[::7]9. Marketing scenario: Click‑based sampling selects ads with the highest number of clicks.
# Read ad data<br/>ad_data = pd.read_csv('ad_data.csv')<br/># Click‑based sampling: top 100 ads by clicks
sample_data = ad_data.nlargest(100, 'Clicks')10. Logistics scenario: Weight‑based sampling chooses shipments with the largest weight, using a quantile threshold.
# Read shipment data<br/>shipment_data = pd.read_csv('shipment_data.csv')<br/># Determine 90th percentile weight threshold<br/>threshold = shipment_data['Weight'].quantile(0.9)<br/># Weight‑based sampling: shipments >= threshold
sample_data = shipment_data[shipment_data['Weight'] >= threshold]These examples demonstrate how different sampling strategies can reduce data volume, speed up analysis, and maintain the representativeness of the original dataset across diverse business domains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
