From 57439ab15f1f3c8ab482f03045cba381d2224563 Mon Sep 17 00:00:00 2001 From: uvok cheetah Date: Thu, 9 Jan 2025 14:09:23 +0100 Subject: Bird CPU usage, cnt --- _drafts/bird-cpu-usage.md | 46 ++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 38 insertions(+), 8 deletions(-) diff --git a/_drafts/bird-cpu-usage.md b/_drafts/bird-cpu-usage.md index 136a475..36bad4e 100644 --- a/_drafts/bird-cpu-usage.md +++ b/_drafts/bird-cpu-usage.md @@ -14,9 +14,8 @@ alt="RRD tool graphic showing a high CPU usage" %} I found it strange, but had no time to inspect further. -Recently, I tried to investigate what happens on the server there. -htop showed high CPU usage for both bird and dnsmasq (always -together) in these times. +Recently, I tried to investigate what happens on the server there. htop showed +high CPU usage for both bird and dnsmasq (always together) in these times. Fuming a bit, I went with a brute-force approach: @@ -97,14 +96,45 @@ debug symbols](https://wiki.debian.org/HowToGetABacktrace) - it doesn't for dnsmasq (yet) in bookworm. Luckily, the nice people at labs.nic.cz provide a dbgsym package in their Bird(2) Debian repository. -Now, stracing dnsmasq (when "idle") reveals some recvmsg of type `RTM_NEWROUTE`. -I have *no idea* why dnsmasq would need that. But I already *assume* the high -CPU usage occurs when Bird exports lots of routes to the kernel. +Now, stracing dnsmasq (when "idle") reveals some recvmsg of type +`RTM_NEWROUTE`. I have *no idea* why dnsmasq would need that. But I already +*assume* the high CPU usage occurs when Bird exports lots of routes to the +kernel. Also, in journalctl, I see lots of the infamous `Kernel dropped some netlink messages, will resync on next scan.` messages at times - the message apparently nobody has a solution to, and even though there are mailing list posts telling to sysctl `net.core.rmem_default`, I doesn't seem to yield a solution. -[1] At least when I want to see the binaries function names. - Kernel symbols seem to show up fine. +[1] At least when I want to see the binaries function names. Kernel symbols +seem to show up fine. + +So… step-by-step? I have both DN42 and clearnet bird running on this VPS, in +parallel. So maybe start by disabling one of these? Or, I have a better idea, +keep it enabled, and disable all protocols! (`birdc disable \"*\"`). + +That helped, until midnight. When suddenly the CPU went up again. WTF? Let's +have a look at `birdc s p`. All protocols up. Huh!?!? + +Let's investigate further: + +* Oooo, log rotation happens at midnight +* Fuck, I specified the same log files for both bird's + +Well, log rotation. Which I added manually. This does a `birdc configure` +afterward. Which means the protocols go up again, because I disabled them on +the command line, not in the config. + +Ungh. Okay, this is getting really ugly. `systemctl disable --now bird-clear`. +Now let's run this for a few days... + +That seem to have helped. I not decided to edit the clearnet config and disable +all "external" protocols (to IXPs), keeping Babel, RPKI, and IBGP enabled. +Immediately after starting bird-clearnet, the CPU usage went up again. To be +expected, for the initial sync. But it kept being high. So, I disabled the RPKI +protocols as well and... suddenly, the CPU usage is down??? + +I actually did a `perf record` of bird when the CPU usage was high, and I saw +that `rt_event` was running 33% of the time. I don't know what to make of that. + + -- cgit v1.2.3