Cloud Native 13 min read

How HTTPDNS Edge Migration Boosted Performance and Cut Costs by 35%

This article details the end‑to‑end migration of ByteDance's HTTPDNS service from a central cloud to edge nodes, covering technical challenges in service placement and traffic scheduling, the edge‑native solutions implemented with visualization models and GTM, and the resulting performance, cost and reliability gains.

Volcano Engine Developer Services

Aug 8, 2024

How HTTPDNS Edge Migration Boosted Performance and Cut Costs by 35%

Abstract

This article introduces the detailed process of migrating the HTTPDNS service from a central architecture to edge nodes, focusing on challenges such as service placement and traffic scheduling, the edge‑side solution, and the performance and cost benefits achieved.

Background of HTTPDNS

Traditional DNS uses UDP queries to a local DNS server, which can suffer from cache staleness, hijacking, cross‑network resolution, and timeouts. HTTPDNS replaces UDP with HTTP/HTTPS, providing domain‑level hijack protection, precise traffic scheduling, and immediate effect of resolution results. ByteDance’s major apps (Douyin, Toutiao, Xigua Video, etc.) generate tens of trillions of DNS queries daily, requiring a highly scalable solution.

Edge Migration Motivation

To reduce cost and improve performance, the HTTPDNS team decided to move the service to edge locations, addressing three main dimensions: practical challenges, the edge solution, and the benefits from performance and cost perspectives.

Edge Migration Challenges

1. Service Placement

Edge nodes are heterogeneous and geographically dispersed, turning placement into a resource‑constrained optimization problem. Existing research addresses cost (multi‑cloud minimal redundancy), quality (service dependency models, redundancy algorithms), and traffic (migration, device mobility, spatio‑temporal trajectories).

Cost‑focused studies model minimal redundant cost for dynamic resource allocation.

Quality‑focused studies build service dependency graphs and dynamic redundancy algorithms.

Traffic‑focused studies handle traffic migration, device movement, and pre‑deployment of resources.

In practice, traffic and device migration patterns, node performance variance, and stability fluctuations complicate these models, requiring placement decisions based on real‑time traffic features and edge node resource quality.

2. Traffic Scheduling

Edge traffic scheduling includes end‑edge, cloud‑edge, and edge‑edge scenarios. Mature techniques involve DNS‑based end‑edge scheduling, BGP ANYCAST for backbone routing, and cloud‑native traffic scheduling in cloud networks.

Challenges arise from massive heterogeneous 5G/industrial IoT devices, traffic volatility due to device migration, and the difficulty of precise scheduling and leveraging client‑edge collaboration. Machine‑learning‑based real‑time traffic perception is costly at scale.

Cloud‑edge scheduling also faces resource scarcity at the edge, mismatched optimization metrics, and inconsistent infrastructure across regions, affecting disaster recovery and fault tolerance.

Edge Solution

1. Visual Evaluation Model for Service Placement

The team built a full‑link probing and data collection pipeline, using real‑time data‑driven online placement algorithm simulation to create a visual evaluation model that guides node selection and service placement in edge data centers.

2. GTM‑Based Traffic Scheduling

To address scheduling, the team integrated Global Traffic Manager (GTM) to provide DNS‑intelligent resolution, enabling users to access the nearest edge node. GTM offers distributed health checks, failover, multi‑cloud traffic orchestration, and visual health data analysis.

Key applications:

Intelligent scheduling: capacity‑first mode generates global low‑latency DNS rules, ensuring automatic disaster recovery during edge node failures.

Fault disaster recovery: edge nodes report capacity via console, API, or agent; GTM updates and distributes scheduling policies accordingly.

Benefits

Performance improvements include a 5% overall latency reduction, 4.1% TTFB decrease, and a 15% reduction in traffic scheduling drift. Cost reductions comprise a 10% bandwidth cost cut and a 35% total cost optimization (50% load‑balancer, 30% compute, 70% bandwidth).

Disaster recovery achieved 100% automatic failover within 60 seconds, and edge fault detection time reduced to 60 seconds. Capacity scheduling enabled a 20%+ performance boost across regions.

Conclusion

The HTTPDNS team continuously explored feedback‑driven GTM scheduling models to mitigate edge resource constraints, improve traffic scheduling efficiency, and lower operational complexity. Migrating HTTPDNS to edge computing demonstrated significant cost savings, performance gains, and validated edge‑native cloud‑native practices for large‑scale internet services.

Performance optimization Traffic Scheduling HTTPDNS

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.