Big Data 11 min read

Statistical Monitoring Using Normal Distribution and Boxplot: Theory, Implementation, and API Design

The article explains the origin of the normal distribution, the central limit theorem, and how boxplots identify anomalies, then describes a Java‑based API that partitions data into five median‑centered levels using same‑period and year‑over‑year ratios to automatically detect and classify abnormal trends in daily metrics.

vivo Internet Technology

Jan 13, 2021

Statistical Monitoring Using Normal Distribution and Boxplot: Theory, Implementation, and API Design

Background : The article starts with a whimsical description of how the normal distribution originated, using a coin‑toss analogy to illustrate that a single random experiment (heads = +1, tails = –1) leads to a simple two‑point distribution, while repeated tosses (10, 100, or infinitely many) gradually form the familiar bell‑shaped curve.

Central Limit Theorem : It explains that when many independent factors influence a variable (e.g., a student's score), the sum (or average) of those factors tends toward a normal distribution regardless of the individual factor distributions. This theoretical foundation justifies why scores, measurement errors, and many real‑world quantities appear normal.

Boxplot for Anomaly Detection : The text introduces the boxplot (five‑number summary: minimum, Q1, median, Q3, maximum) as a compact way to visualise data distribution, identify outliers, and assess symmetry. It notes that for approximately normal data, the extreme 0.35 % on each side can be treated as anomalies.

Monitoring Solution : Leveraging the normal‑distribution threshold principle, the solution automatically partitions data into five levels based on median‑centered intervals. By comparing same‑period (环比) and year‑over‑year (同比) ratios, four anomaly types are defined: abnormal rise, abnormal, abnormal decline, and no anomaly.

Implementation – Code :

/*
* 数据分析API服务
*/
public class DataAnalysis {
    /*
    * 波动分析
    * input：json，分析源数据（样例）
    * {
    *   "org_data": [
    *       { "date":"2020-02-01",  "data":"10123230" },  // 日期类型、long类型
    *       { "date":"2020-02-02", "data":"9752755" },
    *       { "date":"2020-02-03",  "data":"12123230" },
    *       .......
    *   ]
    * }
    * output：json，分析结果
    * {
    *   "type": 1,  // 调用正常返回1，异常返回0
    *   "message":"", // 异常原因
    *   "date": "2020-02-14", // 最后一组的日期
    *   "data": 6346231, // 最后一组的数值
    *   "rate1": -0.3, // 同比值
    *   "rate2": -0.6, // 环比值
    *   "level1": 4, // 同比等级（1‑5）
    *   "level2": 3, // 环比等级（1‑5）
    *   "result":"异常下降" // 四种类型之一
    * }
    */
    public String fluctuationAnalysis(String org_data) {
        // 第一步，校验输入数据
        if (checkOrgdata(org_data)) return "{\"result\":0, \"message\":\"\"}";
        // 第二步，计算同环比
        computeOrgdata(org_data);
        // 第三步，数据升序排序，获取最后一组数据并计算等级
        // ... 省略具体实现细节 ...
        return "..."; // 返回 JSON 字符串
    }

    public boolean checkOrgdata(String org_data) {
        // 检验日期是否连续，数量至少 14 天
        // ...
        return true;
    }

    public String computeOrgdata(String org_data) {
        // 计算环比 = (今日 data / 昨日 data - 1) * 100%
        // 计算同比 = (今日 data / 上周同日 data - 1) * 100%
        // 返回包含所有计算结果的 JSON
        return "...";
    }
}

API Specification : The service accepts a JSON array named org_data containing at least 14 consecutive daily records (preferably up to 100). Each record includes date and data. The API returns the fields described above, with null ratios treated as 0, and caps the sample size at the most recent 90 days when more data are provided.

Packaging : The implementation is packaged as a JAR for use in the big‑data center, where HQL scripts can invoke the API directly within data‑warehouse workflows.

Application Scenarios : The solution has been deployed in several key business platforms, e.g., monitoring daily metrics of an app store. Data flow includes extracting raw tables (e.g., da_appstore_core_data_di), processing them through the API, and storing results in da_appstore_core_data_result_di. Anomalous points trigger alerts for rapid risk mitigation.

References :

《创世纪·数理统计·正态分布的前世今生》

Zhihu contributions by 小尧 and jinzhao on normal‑distribution threshold theory

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java Big Data Anomaly Detection Boxplot data analysis API normal distribution statistical monitoring

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.