I started running a public NTP server behind my ER-X recently. The server is part of NTP Pool Project and serves many clients over IPv6 only. At its peak, the load can reach about 1000 requests per second. In the extreme cases, it could go as high as 5000 requests per second. Average is hovering between 50 and 100 requests per second. As we know NTP requests are UDP packets on port 123, those numbers translate into 1000 packets per second, 5000 p/s and 50 to 100 p/s respectively. For the NTP server, such workloads are very light.
I would think the same workloads for EdgeRouter-X are light too but it turns out ER-X consumes more CPU cycles than it should (as seen in the graph below). In the extreme case, one of the four cores stays at 100% utilization. With careful observation, I worked out a few user-space processes that contribute to the issue, namely
- nsm - routing agent
- ribd - routing agent
- charon - part of the strongswan IPsec daemon
- dnsmasq - the DHCP server and DNS forwarder
The issue looks weird because UDP port 123 has nothing to do with any of the four processes as far as I'm aware. For the sake of clarity, my public NTP server is discrete hardware sitting behind ER-X. And I know ER-X has a NTP daemon internally. I do not use it and it's even not started. When the issue happens, WAN traffic is predominantly NTP requests at aforementioned rates. There is very little other WAN as well as LAN activities. So I would think the processes should be mostly idle or sleeping.
That's not the case. When there are incoming NTP packets, the four processes consistently woke up. Not sure what activities they're performing but each process is observed with CPU utilization between 0 and a few percentage points. Each process' percentage points increase with the rate of incoming NTP packets. In severe cases, either charon or dnsmasq will go 100% on one core and occasionally does not drop even NTP traffic is completely stopped. In extreme cases, EdgeRouter-X reboots itself - one such instance was shortly after 12:00 in the following graphs.
For completeness, I've already disabled conntrack on UDP port 123. While the NTP request rates even at 6000 p/s aren't considered high, the number of clients with unique IPv6 addresses is tremendous. For example, over a span of 24 hours, I saw millions of clients. Disabling conntrack on the UDP port eliminates unnecessary state tracking in ER-X. While it eases memory usage to some extent, the CPU hogging remains. So we could rule out conntrack as a possible cause.
My gut feeling tells the four processes must share some common attributes that couple with the particular version of Linux kernel and cause the issue. I first observed the issue on firmware v1.10.7 where v1.10.y versions had recently undergone a kernel upgrade. I put back v18.104.22.168 - the legacy left behind by the previous firmware team at Ubiquiti. Unfortunately it doesn't make a difference.
If I stop charon by "sudo ipsec stop" and/or dnsmasq by "sudo service dnsmasq stop", the CPU utilization is noticeably alleviated. 100% hogging that leads to reboot is not observed. nsm and ribd still exhibit the same CPU hogging as before though.
I didn't test the setup on IPv4. So I cannot tell if the CPU hogging exists on IPv4. I do however have tested both native IPv6 and Hurricane Electric's 6in4 tunnel. Both IPv6 technologies exhibit the issue. I didn't test the setup on other EdgeRouters except EdgeRouter-X. So I cannot tell if Cavium based EdgeRouters will not exhibit this issue.
I also cannot find any prior cases of CPU hogging on EdgeRouters of similar nature. That's a bit surprising to me. Possibly: 1) very few people run public NTP servers behind EdgeRouters; 2) if they do, perhaps they do it over IPv4 only; 3) perhaps they use Cavium based EdgeRouters?
It's a very weird issue.