diff options
Diffstat (limited to '_drafts/bird-cpu-usage.md')
-rw-r--r-- | _drafts/bird-cpu-usage.md | 170 |
1 files changed, 170 insertions, 0 deletions
diff --git a/_drafts/bird-cpu-usage.md b/_drafts/bird-cpu-usage.md new file mode 100644 index 0000000..c0fa5b7 --- /dev/null +++ b/_drafts/bird-cpu-usage.md @@ -0,0 +1,170 @@ +--- +layout: post +title: Bird CPU usage +date: 2025-01-07 19:06 +0100 +categories: tech +lang: en +--- + +**Preface**: This is kind of a "bug hunt log". I extensively posted on Mastodon +about it, but wanted to keep the progress in writing as well. This might lead +like random rambling in some parts, because I edited and rearranged the +paragraphs of this post a few times. Sorry for that. + +**Preface 2**: Starting point: In September 2024 I upgraded to bird2 +`2.15.1-1~bpo12+1` - probably a self-built backport of mine. This, I "had to +to", since I wanted the Babel metrics measurement, which is not present in +Debian stable (`2.0.12-7`). Yeah, I won't complain. For 99 % of the stuff I do, +Debian stable is *perfect*. I want a server without regularly breaking things, +after all. Only recently (2015-01-07), I added the bird / network.cz repos and +installed bird2 from there. + +Several times already, I noticed this in my Munin monitoring: + +{% image +img="https://pics.uvokchee.de/upload/2025/01/07/20250107180512-c1453895.png" +alt="RRD tool graphic showing a high CPU usage" %} + +I found it strange, but had no time to inspect further. + +Recently, I tried to investigate what happens on the server there. htop showed +high CPU usage for both bird and dnsmasq (always together) in these times. + +Fuming (angry) a bit, I went with a brute-force approach: + +``` +#!/bin/bash + +# Configuration +THRESHOLD=1.0 +# 3 hours +DURATION=$((3*80*80)) +MAIL_TO="lolnope" +SUBJECT="High Load Average Alert" +BODY="The load average has been above ${THRESHOLD} for more than 3 hours." +REBOOT_CMD="/sbin/reboot" + +# Function to check the load average +check_load() { + # 15 min loadavg + loadavg=$(awk '{print $3}' /proc/loadavg) + echo "$(date): Current Load Average: $loadavg" + + if (( $(echo "$loadavg > $THRESHOLD" | bc -l) )); then + echo "$(date): Load average is above threshold." + return 0 + else + echo "$(date): Load average is below threshold." + return 1 + fi +} + +# Monitor the load average +start_time=$(date +%s) +while true; do + if check_load; then + current_time=$(date +%s) + elapsed_time=$((current_time - start_time)) + + if [ "$elapsed_time" -gt "$DURATION" ]; then + echo "$(date): Load average has been above threshold for more than 3 hours." + + # Send alert email + (echo "$BODY"; ps -e -o %cpu,%mem,cmd --sort pcpu | tail) | mail -s "$SUBJECT" "$MAIL_TO" + + # Reboot the server - room for improvement +# systemctl stop bird +# systemctl start bird + $REBOOT_CMD + break + fi + else + start_time=$(date +%s) + fi + sleep 300 # Check every 5 minutes +done + +``` + +Specifically, the output of ps + +``` +22.7 2.7 /usr/sbin/bird -f -u bird -g bird +33.3 0.1 ps -e -o %cpu,%mem,cmd --sort pcpu +37.4 0.0 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new,.bak --local-service --trust-anchor=.,20326,8,2,e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d +``` + +confirmed the quick look at htop - although the "percentage" is a bit weird. +From the manpage: + +> Currently, it is the CPU time used divided by the time the process has been +> running (cputime/realtime ratio), expressed as a percentage. + +(So if the process runs "long enough" and only starts misbehaving after a year, +it won't show up?). + +I asked an LLM what to do, in addition to strace, and it suggested perf. +Unfortunately, this requires debug symbols [1]. [And while Debian does provide +debug symbols](https://wiki.debian.org/HowToGetABacktrace) - it doesn't for +dnsmasq (yet) in bookworm. Luckily, the nice people at labs.nic.cz provide a +dbgsym package in their Bird(2) Debian repository. + +Aside, the script above could benefit from some improvements, instead of +rebooting, I could actually record and strace and perf when this happens, and +only then reboot / restart the service. + +Now, stracing dnsmasq (when "idle") reveals some recvmsg of type +`RTM_NEWROUTE`. I have *no idea* why dnsmasq would need that. But I already +*assume* the high CPU usage occurs when Bird exports lots of routes to the +kernel. *I have no hecking clue why dnsmasq would need to use the CPU at the +same time. If someone can enlighten me, **please** message me.* + +Also, in journalctl, I see lots of the infamous `Kernel dropped some netlink +messages, will resync on next scan.` messages at times - the message apparently +nobody has a solution to, and even though there are mailing list posts telling +to sysctl `net.core.rmem_default`, I doesn't seem to yield a solution. + +I actually did a `perf record` of bird when the CPU usage was high, and I saw +that `rt_event` was running 33% of the time. I don't know what to make of that. + +So… step-by-step? I have both DN42 and clearnet bird running on this VPS, in +parallel. So maybe start by disabling one of these? Or, I have a better idea, +keep it enabled, and disable all protocols! (`birdc disable \"*\"`). That +helped, until midnight. When suddenly the CPU went up again. WTF? Let's have a +look at `birdc s p`. All protocols up. Huh!?!? + +Let's investigate further: + +* Oooo, log rotation happens at midnight +* Fuck, I specified the same log files for both bird's + +Well, log rotation. Which I added myself. This does a `birdc configure` +afterward. Which means the protocols go up again, because I disabled them on the +command line, not in the config. + +Ungh. Okay, this is getting really ugly. `systemctl disable --now bird-clear`. +Now let's run this for a few days... + +That seem to have helped. I now decided to edit the clearnet config and disable +all "external" protocols (to IXPs), keeping Babel, RPKI, and IBGP enabled. +Immediately after starting bird-clearnet, the CPU usage went up again. To be +expected, for the initial sync. But it kept being high. So, I disabled the RPKI +protocols as well and... suddenly, the CPU usage is down??? + +(2025-01-09) So, now, the RPKI stuff... I saw that the connection to one of my +RPKI servers wouldn't work (Fort Validator), and Fort complained about `Bad RTR +version: 2`. The [Bird +docs](https://bird.network.cz/?get_doc&v=20&f=bird-6.html) mention a `max +version` option, and with that, the connection worked. And the CPU usage was +down. *However*, this was not "the root issue". This option is only present +since bird 2.16, which I only installed on 2025-01-07 [2]. And I had the CPU +problems back in November last year! So this is "something completely +different", it appears. I currently have the BGP sessions to virtual IXPs +disabled, and will run it like this for a few days. The average load is 0.4. +Still too high, compared to my other VPS. + +[1] At least when I want to see the binaries function names. Kernel symbols +seem to show up fine. \ +[2] Which is pretty annoying, when you chase one bug, you change something in +your config, and then something else breaks. That's on me. + |