--- layout: post title: Bird CPU usage date: 2025-01-07 19:06 +0100 categories: tech lang: en --- Several times already, I noticed this in my Munin monitoring: {% image img="https://pics.uvokchee.de/upload/2025/01/07/20250107180512-c1453895.png" alt="RRD tool graphic showing a high CPU usage" %} I found it strange, but had no time to inspect further. Recently, I tried to investigate what happens on the server there. htop showed high CPU usage for both bird and dnsmasq (always together) in these times. Fuming a bit, I went with a brute-force approach: ``` #!/bin/bash # Configuration THRESHOLD=1.0 # 3 hours DURATION=$((3*80*80)) MAIL_TO="lolnope" SUBJECT="High Load Average Alert" BODY="The load average has been above ${THRESHOLD} for more than 3 hours." REBOOT_CMD="/sbin/reboot" # Function to check the load average check_load() { # 15 min loadavg loadavg=$(awk '{print $3}' /proc/loadavg) echo "$(date): Current Load Average: $loadavg" if (( $(echo "$loadavg > $THRESHOLD" | bc -l) )); then echo "$(date): Load average is above threshold." return 0 else echo "$(date): Load average is below threshold." return 1 fi } # Monitor the load average start_time=$(date +%s) while true; do if check_load; then current_time=$(date +%s) elapsed_time=$((current_time - start_time)) if [ "$elapsed_time" -gt "$DURATION" ]; then echo "$(date): Load average has been above threshold for more than 3 hours." # Send alert email (echo "$BODY"; ps -e -o %cpu,%mem,cmd --sort pcpu | tail) | mail -s "$SUBJECT" "$MAIL_TO" # Reboot the server # systemctl stop bird # systemctl start bird $REBOOT_CMD break fi else start_time=$(date +%s) fi sleep 300 # Check every 5 minutes done ``` Specifically, the output of ps ``` 22.7 2.7 /usr/sbin/bird -f -u bird -g bird 33.3 0.1 ps -e -o %cpu,%mem,cmd --sort pcpu 37.4 0.0 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new,.bak --local-service --trust-anchor=.,20326,8,2,e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d ``` confirmed the suspicion - although the "percentage" is a bit weird. From the manpage: > Currently, it is the CPU time used divided by the time the process has been > running (cputime/realtime ratio), expressed as a percentage. (So if the process runs "long enough" and only starts misbehaving after a year, it won't show up?). I asked an LLM what to do, in addition to strace, and it suggested perf. Unfortunately, this requires debug symbols [1]. [And while Debian does provide debug symbols](https://wiki.debian.org/HowToGetABacktrace) - it doesn't for dnsmasq (yet) in bookworm. Luckily, the nice people at labs.nic.cz provide a dbgsym package in their Bird(2) Debian repository. Now, stracing dnsmasq (when "idle") reveals some recvmsg of type `RTM_NEWROUTE`. I have *no idea* why dnsmasq would need that. But I already *assume* the high CPU usage occurs when Bird exports lots of routes to the kernel. Also, in journalctl, I see lots of the infamous `Kernel dropped some netlink messages, will resync on next scan.` messages at times - the message apparently nobody has a solution to, and even though there are mailing list posts telling to sysctl `net.core.rmem_default`, I doesn't seem to yield a solution. [1] At least when I want to see the binaries function names. Kernel symbols seem to show up fine. So… step-by-step? I have both DN42 and clearnet bird running on this VPS, in parallel. So maybe start by disabling one of these? Or, I have a better idea, keep it enabled, and disable all protocols! (`birdc disable \"*\"`). That helped, until midnight. When suddenly the CPU went up again. WTF? Let's have a look at `birdc s p`. All protocols up. Huh!?!? Let's investigate further: * Oooo, log rotation happens at midnight * Fuck, I specified the same log files for both bird's Well, log rotation. Which I added manually. This does a `birdc configure` afterward. Which means the protocols go up again, because I disabled them on the command line, not in the config. Ungh. Okay, this is getting really ugly. `systemctl disable --now bird-clear`. Now let's run this for a few days... That seem to have helped. I not decided to edit the clearnet config and disable all "external" protocols (to IXPs), keeping Babel, RPKI, and IBGP enabled. Immediately after starting bird-clearnet, the CPU usage went up again. To be expected, for the initial sync. But it kept being high. So, I disabled the RPKI protocols as well and... suddenly, the CPU usage is down??? I actually did a `perf record` of bird when the CPU usage was high, and I saw that `rt_event` was running 33% of the time. I don't know what to make of that.