monitoring | BestHub

Collection size

1794 articles

Page 72 of 90

360 Tech Engineering

Dec 17, 2019 · Backend Development

Diagnosing Java Memory Leaks: JVM GC Roots, Monitoring, and Code Fixes

This article explains how Java memory leaks can occur despite automatic garbage collection, describes JVM GC‑Root analysis, outlines practical monitoring with Spring Boot Actuator, Prometheus, and Grafana, and provides step‑by‑step debugging commands and code adjustments to locate and fix the leak.

JVMJavaSpring Boot

0 likes · 10 min read

Diagnosing Java Memory Leaks: JVM GC Roots, Monitoring, and Code Fixes

360 Tech Engineering

Dec 5, 2019 · Databases

Design and Implementation of a High‑Availability InfluxDB Cluster at 360

This article introduces the fundamentals of time‑series databases, explains why InfluxDB was chosen, describes the TSM storage engine and shard concepts, outlines the internal 360 InfluxDB‑HA architecture, compares its performance with a single node, and provides integration and future‑development guidelines.

Cluster ArchitectureInfluxDBhigh availability

0 likes · 8 min read

Design and Implementation of a High‑Availability InfluxDB Cluster at 360

360 Tech Engineering

Nov 4, 2019 · Backend Development

Unified Interface Automation Testing Tool: Design, Implementation, and Real‑World Practice

This article presents a comprehensive guide to building and applying a unified API automation testing tool, covering its background, framework design, core features, data and configuration management, public functions, test case handling, logging, execution workflow, CI integration, and monitoring in a search service environment.

API testingAutomationCI

0 likes · 15 min read

Unified Interface Automation Testing Tool: Design, Implementation, and Real‑World Practice

360 Tech Engineering

Aug 3, 2018 · Operations

Guidelines for Building Long‑Lived, Stable Systems: Goals, Practices, and Continuous Improvement

This article shares practical methodologies for designing, deploying, and maintaining systems that can reliably operate for ten years, covering goal setting, holistic design considerations, carrier and data‑center choices, active‑active architecture, server and platform selection, monitoring, and continuous personal improvement.

DeploymentOperationsbest practices

0 likes · 6 min read

Guidelines for Building Long‑Lived, Stable Systems: Goals, Practices, and Continuous Improvement

360 Tech Engineering

Jul 18, 2018 · Operations

How to Monitor Elasticsearch Performance: Query, Indexing, and JVM Metrics

The article explains how to proactively monitor Elasticsearch by covering key performance areas such as query and indexing latency, JVM heap and garbage‑collection behavior, and host‑level system metrics, providing practical guidance and visual diagrams for effective operations management.

ElasticsearchIndexingJVM

0 likes · 12 min read

How to Monitor Elasticsearch Performance: Query, Indexing, and JVM Metrics

TAL Education Technology

Aug 19, 2021 · Operations

Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform

This document outlines a comprehensive SRE‑driven operational framework for ensuring stable, high‑availability online education services during peak summer and winter periods, detailing pre‑, during‑, and post‑maintenance phases, architectural principles, load testing, monitoring, capacity management, safety hardening, chaos engineering, incident response, and post‑mortem practices.

Chaos EngineeringSREcapacity planning

0 likes · 17 min read

Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform

TAL Education Technology

May 27, 2021 · Big Data

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

This article outlines the challenges of operating petabyte‑scale big‑data clusters and presents a comprehensive monitoring framework—including basic and upgraded monitoring layers, metric collection, alerting pipelines, and strategies for alarm convergence and grading—to ensure reliable, proactive SRE operations.

AlertingGrafanaOperations

0 likes · 12 min read

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

Qunar Tech Salon

May 20, 2024 · Big Data

Optimizing Kafka Production at Qunar Travel: Reducing CPU Usage by 2000 Cores

This article presents a comprehensive case study of how Qunar Travel identified and resolved Kafka production bottlenecks—through metric monitoring, thread and flush parameter tuning, and Filebeat batch adjustments—resulting in a 2000‑core CPU reduction, higher network idle rates, and lower resource consumption across three clusters.

Big DataKafkaKubernetes

0 likes · 12 min read

Optimizing Kafka Production at Qunar Travel: Reducing CPU Usage by 2000 Cores

Qunar Tech Salon

May 19, 2022 · Operations

Design and Optimization of a Large‑Scale Monitoring System at Qunar.com

This article describes the architecture, challenges, and performance optimizations of Qunar.com's Watcher monitoring platform, covering massive metric collection, master‑worker redesign, Graphite/Whisper storage enhancements, and future migration to Go‑based cloud‑native solutions.

CI/CDCloud NativeMetrics

0 likes · 13 min read

Design and Optimization of a Large‑Scale Monitoring System at Qunar.com

Qunar Tech Salon

Dec 20, 2021 · Databases

From Database Development to the New DBA: Strategies for Efficiency, Automation, and Career Growth

The article shares the author’s journey from early database development at DM6/DM7 through MySQL operations at Qunar, offering practical advice on demand‑driven implementation, data‑driven management, intelligent alerting, full‑log analysis, slow‑query risk modeling, high‑availability, and automation to transform traditional DBA work into a proactive, efficient New DBA role.

AutomationDatabase AdministrationDevOps

0 likes · 31 min read

From Database Development to the New DBA: Strategies for Efficiency, Automation, and Career Growth

Qunar Tech Salon

Jun 23, 2020 · Operations

A Simple Gray Release Solution for High‑Concurrency Flight Ticket Systems

This article presents a lightweight gray release approach for complex flight ticket services, comparing traditional hardware and soft‑routing isolation methods, describing the authors' traffic‑based gray identification, business‑focused monitoring, implementation details, and automated safeguards to enable safe incremental deployments.

DeploymentOperationsbackend

0 likes · 8 min read

A Simple Gray Release Solution for High‑Concurrency Flight Ticket Systems

Qunar Tech Salon

Jul 31, 2018 · Operations

Best Practices for Container Operations: Logging, Monitoring, Security, and Immutability

This article outlines essential container operation best practices—including native logging, JSON log formatting, sidecar aggregators, stateless and immutable design, avoiding privileged containers, effective monitoring, health checks, non‑root execution, and careful image tagging—to help developers build secure, maintainable, and observable workloads on Kubernetes.

ContainersKubernetesbest practices

0 likes · 17 min read

Best Practices for Container Operations: Logging, Monitoring, Security, and Immutability

Qunar Tech Salon

Dec 28, 2017 · Databases

7 Essential Tips for Optimizing MySQL Performance

This article presents seven practical techniques—including using EXPLAIN, creating proper indexes, adjusting default settings, loading data into memory, leveraging SSD storage, scaling horizontally, and improving observability—to keep MySQL databases fast, stable, and responsive as workloads grow.

ConfigurationIndexingMySQL

0 likes · 14 min read

7 Essential Tips for Optimizing MySQL Performance

Qunar Tech Salon

Dec 14, 2017 · Databases

TiDB Architecture, Deployment, and Monitoring Practices at Qunar

This article explains Qunar's transition from MySQL, Redis, and HBase to TiDB, detailing the background of distributed databases, TiDB's architecture, hardware selection, deployment automation, monitoring setup, and real‑world usage scenarios to address scalability and high‑availability challenges.

Big DataDatabase ArchitectureDeployment

0 likes · 14 min read

TiDB Architecture, Deployment, and Monitoring Practices at Qunar

Qunar Tech Salon

Nov 8, 2017 · Operations

Evolution of Ele.me's Operations Infrastructure: From 1.0 to 2.0 – Standardization, Automation, and Data‑Driven Management

The article recounts Ele.me's rapid growth and the resulting operational challenges, describing how the company progressed from ad‑hoc 1.0 practices to a standardized, automated 2.0 infrastructure built on ZStack private cloud, fine‑grained operations, and data‑driven management to improve quality, efficiency, and cost.

AutomationOperationsPrivate Cloud

0 likes · 21 min read

Evolution of Ele.me's Operations Infrastructure: From 1.0 to 2.0 – Standardization, Automation, and Data‑Driven Management

Qunar Tech Salon

Oct 18, 2017 · Cloud Computing

Gome Group’s Cloud Computing and Operations Automation Practices

This article details Gome Group’s transition to cloud computing and operations automation, describing its corporate background, new operational strategies, the establishment of Gome Cloud, IAAS product architecture, monitoring solutions, automation standards, and deployment practices such as gray releases and Docker integration.

DevOpsIaSOperations Automation

0 likes · 15 min read

Gome Group’s Cloud Computing and Operations Automation Practices

Qunar Tech Salon

May 19, 2017 · Mobile Development

Zero‑Instrumentation Interaction and Performance Monitoring for Large‑Scale Mobile Apps

The article presents a comprehensive approach to solving crash and performance issues in large‑scale mobile applications by reconstructing user interaction traces through a no‑track analytics platform, compile‑time AOP instrumentation, and unified data aggregation, ultimately improving debugging efficiency and reducing operational overhead.

AOPanalyticsmonitoring

0 likes · 9 min read

Zero‑Instrumentation Interaction and Performance Monitoring for Large‑Scale Mobile Apps

Qunar Tech Salon

Mar 23, 2017 · Cloud Native

Ctrip Container Cloud: Architecture, Scaling, and Operational Practices

The article details Ctrip's rapid business growth driving the need for elastic scaling, the adoption of container technology to achieve second‑level provisioning, the design of their container cloud platform—including deployment principles, network choices, orchestration evaluations, monitoring solutions, and the CDOS overview—providing practical insights for large‑scale cloud‑native operations.

Cloud NativeDevOpsScaling

0 likes · 16 min read

Ctrip Container Cloud: Architecture, Scaling, and Operational Practices

Qunar Tech Salon

Sep 14, 2016 · Cloud Computing

Design and Implementation of Ctrip's Virtual Cloud Desktop System Based on OpenStack

This article presents Ctrip's deployment of a virtual cloud desktop system for its call center, detailing the OpenStack‑based architecture, advantages over traditional PCs, challenges encountered, the evolution to a decoupled design, resource over‑commit strategies, networking issues, and the operational tools and automated testing that ensure stability.

AutomationDevOpsOpenStack

0 likes · 13 min read

Design and Implementation of Ctrip's Virtual Cloud Desktop System Based on OpenStack

Qunar Tech Salon

Feb 3, 2016 · Backend Development

The Value, Modes, and Practices of Performance Optimization

This article explains the benefits and drawbacks of performance optimization, distinguishes between single‑application and structural optimization approaches, outlines common steps, tools, and techniques for each, and presents case studies illustrating architectural evolution for improved scalability and stability.

ArchitectureCachingOptimization

0 likes · 7 min read

The Value, Modes, and Practices of Performance Optimization