How a Huawei Maintenance Engineer Turned Painful On‑Call Duty into Efficient Knowledge Management
A Huawei maintenance engineer shares a decade‑long journey of turning 24/7 on‑call pain into systematic knowledge management, building comprehensive fault‑handling documentation, automating tools, and guiding the team’s evolution toward SRE practices that dramatically reduce manual effort and improve reliability.
Knowledge‑Management Workflow
The author first built a personal knowledge base for Huawei’s SUN platform dual‑machine environment. Over roughly one month he:
Collected product manuals, monitoring scripts, and online case studies.
Extracted fault‑mode patterns from historical incidents.
Organised the material into a 100‑page troubleshooting guide that describes mechanisms, principles, and step‑by‑step resolutions, covering >90% of local dual‑machine faults.
This systematic approach reduced reliance on ad‑hoc colleague assistance and enabled independent problem solving.
Team‑Level Fast‑Recovery Documentation (2014)
To spread the benefit across the maintenance group, a department‑mandated “fast‑recovery” project was launched. The author led the effort, which involved:
Gathering several hundred incident cases from multiple teams.
Classifying each case by root‑cause rather than surface symptom, creating a fault‑mode library.
Mapping each fault mode to a concrete recovery procedure.
The result was a 400‑page “SUN Platform Major Issue Fast‑Recovery Document” containing 70 distinct fault scenarios spanning hardware, operating system, database, and application layers. The document addresses >95% of business‑interrupting incidents and provides verified, repeatable diagnosis and remediation steps.
Automation Tool Development
With the documentation as a foundation, the team automated the most frequent failure paths. A notable example is a database‑repair utility that:
Detects more than twenty common database corruption patterns (e.g., damaged tables, broken indexes, inconsistent metadata).
Offers a one‑click “repair” operation that, after operator authorization, executes the following sequence automatically:
export‑data → rebuild‑database → import‑dataIn a real incident involving a carrier’s disk failure, the manual process would have required hours of command‑line work. The tool completed the export, rebuild, and import steps automatically, eliminating overnight work and reducing human error.
From Documentation to SRE
The maintenance group is shifting from a purely reactive model to Site Reliability Engineering (SRE). Unlike traditional maintenance, SRE emphasizes:
Design‑for‑reliability (DFx) from the product’s inception.
Self‑healing mechanisms and rapid diagnosis built on a structured knowledge base.
Progressive toolisation of high‑frequency fault scenarios, using the mature fast‑recovery documents as the source of truth.
This transition leverages the accumulated fault‑mode library, documented recovery procedures, and automation tools as the core assets for building more resilient services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
