Investigation and Resolution of Octavia API Slow Response Issue
This article details the background, architecture, step‑by‑step troubleshooting, analysis of network and server queues, and the final configuration changes that resolved the intermittent slow response times of the Octavia load‑balancer API in an OpenStack environment.
Octavia provides a high‑availability load‑balancing solution for OpenStack clusters, exposing a REST API that creates VIPs and routes traffic through HAProxy and LVS. In practice, the API response time varied widely from 0.2 s to 50 s, causing VIP creation and query failures.
The service architecture uses keepalived VRRP for HA and multiple Octavia API nodes behind HAProxy. The deployment diagram shows the high‑availability setup and the flow of requests.
Problem Investigation
Packet capture was performed to pinpoint where latency occurred. Custom HTTP headers distinguished requests, and analysis revealed long‑lasting packets with retransmissions, indicating issues beyond the client‑HAProxy hop.
Further analysis showed that the client‑HAProxy interaction was fast; the delay originated between HAProxy and the Octavia API, ruling out HAProxy as the bottleneck.
Hardware checks confirmed no packet loss or network jitter, and the three Octavia API servers showed low load and normal connection counts.
Application‑layer inspection found that Octavia API did not respond with SYN/ACK, prompting a check of port 9876 connections.
netstat -s | grep -i listen #发现两个数值都在增长
1173805 times the listen queue of a socket overflowed
1175909 SYNs to LISTEN sockets droppedThe growing counters relate to the half‑connection (SYN) queue and the accept queue. The accept queue size is determined by min(backlog, net.core.somaxconn) , and overflow triggers packet retransmission based on /proc/sys/net/ipv4/tcp_abort_on_overflow (0 drops ACKs, 1 sends RST).
Changing tcp_abort_on_overflow to 1 caused immediate connection termination, confirming the link between retransmissions and queue overflow.
Solution
The accept queue overflow was due to a small backlog value (5) passed when Octavia API creates its WSGI server via wsgiref.simple_server.make_server . The backlog originates from SocketServer.TCPServer.server_activate which uses a fixed request_queue_size of 5.
Increasing request_queue_size to 128 raised the backlog, eliminating queue overflow. However, response times remained high because Octavia API ran a single‑process WSGI server.
The fix involved replacing the wsgiref server with OpenStack’s oslo_service WSGI server, allowing multiple worker processes (defaulting to CPU count) and a default backlog of 128. After this change, API latency dropped to under 1 second.
Alternatively, deploying Octavia behind Apache httpd with mod_wsgi can improve throughput, as wsgiref is intended only for demonstration and not production use.
References:
https://www.cnblogs.com/Alexkk/p/12101950.html
http://jm.taobao.org/2017/05/25/525-1/
Linux 3.10.0‑957.27.2.el7 kernel source
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.