Big Data 9 min read

What Happens When Most Language Learners Quit? A Data‑Driven Dive into Shanbay Users

Using Python’s Scrapy, pandas, and seaborn, the author scraped and cleaned public Shanbay user data, stored it in PostgreSQL, and analyzed registration and study habits to reveal that over 68% of users abandon word‑learning on day one, with only a tiny fraction persisting beyond 100 days.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
What Happens When Most Language Learners Quit? A Data‑Driven Dive into Shanbay Users

0x00 Introduction

The author wonders how many people actually finish using a vocabulary‑learning app on the first day and how many persist in studying words over time.

0x01 Problem Definition and Task Breakdown

Key questions include: (1) How many users keep up with word learning (defined as >100 days)? (2) How many dreams are lost due to lack of persistence? (3) Does the amount of learned vocabulary follow a normal distribution?

0x02 Task 1 – Data Crawling

Public user data from Shanbay (e.g., http://www.shanbay.com/bdc/review/progress/2 ) was scraped using Python 2 and Scrapy. The site’s anti‑scraping measures required using proxy servers and disabling cookies.

0x03 Task 2 – Cleaning and Storage

Collected records were stored in a PostgreSQL database. Basic cleaning was performed with SQL statements and pandas operations; further purification was optional.

0x04 Task 3 – Analysis

Analysis was carried out in an IPython notebook (Python 3, Anaconda). Visualizations were created with seaborn.

Histogram of total check‑in counts
Histogram of total check‑in counts
Histogram of non‑zero check‑in counts
Histogram of non‑zero check‑in counts

Further segmented histograms for ranges 0‑20, 20‑100, 100‑500, and 500‑2000 days are also included.

0x05 Conclusions

Highest check‑in days: chainyu – 1830 days

Highest growth value: Lerystal – 28,767

Highest word count: chenmaoboss – 38,313

Average metrics per user:

Average check‑in days: 14.18 (11.69% exceed average growth)

Average growth value: 121.79 (11.42% exceed average)

Average learned words: 78.92 (≈2.19% exceed average)

Key findings from the sample (≈600 k users):

68.15% abandon word learning on day 0.

76.40% abandon on day 1.

79.31% abandon on day 2.

83.52% abandon on day 5.

86.95% abandon on day 10.

90.28% abandon on day 20.

94.28% abandon on day 50.

96.69% abandon on day 100.

98.36% abandon before day 200.

98.81% abandon before day 263.

Thus, only a very small fraction of users persist beyond 200 days.

0x06 Reflections and Limitations

The sample is biased toward early‑registered users (IDs 1‑1,111,111). Later users have lower chances of long‑term persistence. Anti‑scraping defenses limited crawling speed; proxy handling and cookie disabling were necessary.

0x07 Code

The scraping and analysis code, along with the IPython notebook, are available on GitHub: https://github.com/twocucao/DataScience/ . The repository does not include proxy settings.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data analysisWeb Scrapinglanguage learningShanbay
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.