Search

Discover articles.

Search across authors, categories, and technical themes. The layout mirrors the editorial references while staying responsive and fast.

Results

Matches for “observability”

668 results
Cloud Native May 31, 2024 Alibaba Cloud Infrastructure

Best Practices for Deploying AI Model Inference on Knative

This guide explains how to efficiently deploy AI model inference services on Knative by externalizing model data, using Fluid for accelerated loading, configuring secrets, ImageCache, graceful shutdown, probes, autoscaling parameters, mixed ECS/ECI resources, shared GPU scheduling, and observability features to achieve fast scaling, low cost, and high elasticity.

cloud-nativeServerlessAutoscalingBest PracticesGPUKnativeAI Model Inference
Backend Development May 31, 2024 Bilibili Tech

Design and High‑Availability Practices of Bilibili's Video Submission System

Bilibili’s video submission platform uses a layered micro‑service architecture with a DAG‑based scheduler, extensive observability, and HA tactics such as sharding, 64‑bit ID migration, full‑link stress tests, chaos engineering, and multi‑active data‑center deployment, while tooling like trace correlation and automated alerts ensures stability and guides future hybrid‑cloud migration.

Backend ArchitectureMicroservicesDAGHigh AvailabilityObservabilityBilibiliVideo Submission
Operations May 21, 2024 Efficient Ops

What Is an SRE? Roles, Skills, and Best Practices Explained

This article demystifies Site Reliability Engineering (SRE) by explaining its origins, core responsibilities, essential skill sets, and key practices such as observability, incident response, testing, capacity planning, automation, user support, on‑call duties, and the definition of SLI/SLO/SLA, providing a comprehensive guide for modern operations teams.

AutomationoperationsobservabilitySREcapacity planningincident response
Cloud Native May 14, 2024 Yang Money Pot Technology Team

Optimizing CI/CD Pipeline and Release Strategies for Microservices in a Cloud‑Native Environment

This article details a comprehensive overhaul of a company's CI/CD workflow for Java, Python, Go, and Node.js microservices, introducing automated pipelines, parallel builds, rolling, canary, and blue‑green deployments on Kubernetes with Istio to improve release speed, stability, and observability.

Cloud NativeCI/CDmicroservicesAutomationKubernetesDevOpsRelease Management
Operations May 9, 2024 ByteDance SYS Tech

How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations

The presentation explains how large‑model agents empower AIOps by automating routine tasks, enhancing anomaly detection, fault diagnosis, and remediation, while outlining architectural components, multi‑agent collaboration, and future directions for building self‑healing, observability‑driven operations platforms.

ObservabilityAgentAIOpsSelf‑HealingOperations Automation
Big Data Apr 30, 2024 DataFunTalk

Vivo's Evolution of Large‑Scale Distributed Messaging Middleware Architecture and Practices

This technical presentation details Vivo's end‑to‑end big‑data architecture, the evolution from Kafka to Pulsar for massive message processing, deployment strategies, high‑availability mechanisms, observability practices, and future plans for cloud‑native, containerized messaging middleware.

big dataobservabilityhigh availabilityKafkavivoPulsardistributed messaging
Artificial Intelligence Apr 29, 2024 Rare Earth Juejin Tech Community

Building Enterprise‑Grade Retrieval‑Augmented Generation (RAG) Systems: Challenges, Fault Points, and Best Practices

This comprehensive guide explores the complexities of building enterprise‑level Retrieval‑Augmented Generation (RAG) systems, detailing common failure points, architectural components such as authentication, input guards, query rewriting, document ingestion, indexing, storage, retrieval, generation, observability, caching, and multi‑tenant considerations, and provides actionable best‑practice recommendations for developers and technical leaders.

LLMObservabilityRAGCachingVector SearchEnterprise AI
Backend Development Apr 12, 2024 Bilibili Tech

Design and Optimization of a High‑Throughput Long‑Connection Service for Live Streaming

The article details a Golang‑based high‑throughput long‑connection service for live‑streaming, describing its five‑layer architecture, multi‑protocol support, load‑balancing, message‑queue decoupling, aggregation with brotli compression, multi‑region deployment, priority channels, and future enhancements for observability and intelligent endpoint selection.

Backend ArchitectureGolangLoad BalancingStreamingLong ConnectionHigh ThroughputMessage Compression
Backend Development Apr 11, 2024 NetEase Cloud Music Tech Team

Design and Implementation of an Online Configurable Data Consumption Service for NetEase Cloud Music Frontend Performance Monitoring (Corona)

The article details NetEase Cloud Music’s end‑to‑end, online‑configurable data‑consumption service and schema‑driven visualization platform that transform raw client logs into ClickHouse records, automatically generate tables and dashboards, and provide observability, dramatically reducing manual effort while supporting over twenty performance metrics for frontend monitoring.

frontenddata pipelineperformance monitoringClickHousevisualizationOnline Configuration
Cloud Native Apr 8, 2024 Ops Development Stories

Mastering Kubernetes Event Monitoring: Alerts, Collection, and Analysis

This guide explains how to monitor Kubernetes events, differentiate normal and warning events, and use tools like kube-eventer and kube-event-exporter to collect, alert on, and analyze cluster events through webhook, Kafka, Logstash, and Elasticsearch, enabling comprehensive observability and troubleshooting.

Cloud NativeElasticsearchKubernetesAlertingLogstashEvent Monitoring
Previous Page 18 Next