Operations 10 min read

Design and Evolution of Bilibili Intranet DNS Service

The article details Bilibili’s internal DNS service evolution—from an initial BIND9 master‑slave setup to a multi‑level caching architecture that boosts QPS to over 1.5 million—while describing comprehensive host, business, and client monitoring, key configuration pitfalls, and best‑practice recommendations for a low‑latency, reliable intranet DNS.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Design and Evolution of Bilibili Intranet DNS Service

Domain Name System (DNS) acts as the Internet's address book, mapping complex IP addresses to easy‑to‑remember domain names and providing services such as load balancing. An internal (intranet) DNS service adds private domains, DNS hijacking for internal routing and security, specific business‑logic support, and ultra‑low latency with high throughput.

The article shares the practice of building Bilibili's internal DNS service.

Architecture Evolution

Initially, the team chose BIND9, the most widely used DNS implementation, and deployed two roles:

Authoritative Name Server (primary and secondary) for final domain resolution.

Caching Name Server (Resolver) to handle recursive queries, cache responses, and reduce load on authoritative servers.

First‑generation architecture used a simple master‑slave model with VIP‑based load balancing across IDC data centers. As traffic grew, latency spikes and limited scalability of secondary servers prompted a redesign.

Second‑generation architecture introduced multi‑level caching: dedicated Caching Name Servers for scalability, and NSCD as a client‑side cache for high‑QPS services (e.g., big data, AI). BIND9 was upgraded to a newer version supporting reuseport and log buffering, raising single‑instance QPS from ~100k to >1.5 million.

DNS Service Monitoring

Monitoring is divided into three layers:

Host layer – CPU, memory, network, disk usage; alerts for single‑core CPU or single NIC overload.

Business layer – BIND internal metrics via statistics‑channels (custom exporter replaces bind_exporter), zone record change rate alerts, and BIND error‑log monitoring.

Client layer – Probes from multiple data centers simulate real user requests, checking availability, content correctness, latency, and packet loss; also monitor public DNS stability.

Pitfalls and Best Practices

Ensure both UDP and TCP ports are reachable; DNS over UDP is limited to 512 bytes, larger responses require TCP.

After zone changes, increment the SOA serial number; otherwise master‑slave synchronization fails.

Avoid using rndc flush to clear the entire cache; prefer flushname or flushtree for targeted refreshes.

Use wildcard records cautiously; adding a specific record (e.g., TXT) without an A/AAAA or CNAME can break access.

Conclusion

Robust infrastructure services like internal DNS act as levers for business efficiency, reducing development and operational costs. Continuous evolution based on business needs ensures a stable, reliable, and easy‑to‑use DNS service.

References include the BIND 9 Administrator Reference Manual, RFC 1035, RFC 1912, and ISC knowledge‑base articles.

monitoringoperationsnetworkDNSinfrastructureBIND9
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.