summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authoruvok cheetah2025-01-09 20:16:52 +0100
committeruvok cheetah2025-01-09 20:16:52 +0100
commit946de8ee08ef5ed703046221b6eda9dc9a6f0245 (patch)
tree57869453f83d03264ba4c88e5cb496ecc5236097
parent8673b658e0c12c2288ac2935a91184140c00fc66 (diff)
Update bird cpu bug log.HEADmaster
-rw-r--r--_drafts/bird-cpu-usage.md64
1 files changed, 47 insertions, 17 deletions
diff --git a/_drafts/bird-cpu-usage.md b/_drafts/bird-cpu-usage.md
index 36bad4e..c0fa5b7 100644
--- a/_drafts/bird-cpu-usage.md
+++ b/_drafts/bird-cpu-usage.md
@@ -6,6 +6,19 @@ categories: tech
lang: en
---
+**Preface**: This is kind of a "bug hunt log". I extensively posted on Mastodon
+about it, but wanted to keep the progress in writing as well. This might lead
+like random rambling in some parts, because I edited and rearranged the
+paragraphs of this post a few times. Sorry for that.
+
+**Preface 2**: Starting point: In September 2024 I upgraded to bird2
+`2.15.1-1~bpo12+1` - probably a self-built backport of mine. This, I "had to
+to", since I wanted the Babel metrics measurement, which is not present in
+Debian stable (`2.0.12-7`). Yeah, I won't complain. For 99 % of the stuff I do,
+Debian stable is *perfect*. I want a server without regularly breaking things,
+after all. Only recently (2015-01-07), I added the bird / network.cz repos and
+installed bird2 from there.
+
Several times already, I noticed this in my Munin monitoring:
{% image
@@ -17,7 +30,7 @@ I found it strange, but had no time to inspect further.
Recently, I tried to investigate what happens on the server there. htop showed
high CPU usage for both bird and dnsmasq (always together) in these times.
-Fuming a bit, I went with a brute-force approach:
+Fuming (angry) a bit, I went with a brute-force approach:
```
#!/bin/bash
@@ -59,7 +72,7 @@ while true; do
# Send alert email
(echo "$BODY"; ps -e -o %cpu,%mem,cmd --sort pcpu | tail) | mail -s "$SUBJECT" "$MAIL_TO"
- # Reboot the server
+ # Reboot the server - room for improvement
# systemctl stop bird
# systemctl start bird
$REBOOT_CMD
@@ -81,8 +94,8 @@ Specifically, the output of ps
37.4 0.0 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new,.bak --local-service --trust-anchor=.,20326,8,2,e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
```
-confirmed the suspicion - although the "percentage" is a bit weird. From the
-manpage:
+confirmed the quick look at htop - although the "percentage" is a bit weird.
+From the manpage:
> Currently, it is the CPU time used divided by the time the process has been
> running (cputime/realtime ratio), expressed as a percentage.
@@ -96,45 +109,62 @@ debug symbols](https://wiki.debian.org/HowToGetABacktrace) - it doesn't for
dnsmasq (yet) in bookworm. Luckily, the nice people at labs.nic.cz provide a
dbgsym package in their Bird(2) Debian repository.
+Aside, the script above could benefit from some improvements, instead of
+rebooting, I could actually record and strace and perf when this happens, and
+only then reboot / restart the service.
+
Now, stracing dnsmasq (when "idle") reveals some recvmsg of type
`RTM_NEWROUTE`. I have *no idea* why dnsmasq would need that. But I already
*assume* the high CPU usage occurs when Bird exports lots of routes to the
-kernel.
+kernel. *I have no hecking clue why dnsmasq would need to use the CPU at the
+same time. If someone can enlighten me, **please** message me.*
Also, in journalctl, I see lots of the infamous `Kernel dropped some netlink
messages, will resync on next scan.` messages at times - the message apparently
nobody has a solution to, and even though there are mailing list posts telling
to sysctl `net.core.rmem_default`, I doesn't seem to yield a solution.
-[1] At least when I want to see the binaries function names. Kernel symbols
-seem to show up fine.
+I actually did a `perf record` of bird when the CPU usage was high, and I saw
+that `rt_event` was running 33% of the time. I don't know what to make of that.
So… step-by-step? I have both DN42 and clearnet bird running on this VPS, in
parallel. So maybe start by disabling one of these? Or, I have a better idea,
-keep it enabled, and disable all protocols! (`birdc disable \"*\"`).
-
-That helped, until midnight. When suddenly the CPU went up again. WTF? Let's
-have a look at `birdc s p`. All protocols up. Huh!?!?
+keep it enabled, and disable all protocols! (`birdc disable \"*\"`). That
+helped, until midnight. When suddenly the CPU went up again. WTF? Let's have a
+look at `birdc s p`. All protocols up. Huh!?!?
Let's investigate further:
* Oooo, log rotation happens at midnight
* Fuck, I specified the same log files for both bird's
-Well, log rotation. Which I added manually. This does a `birdc configure`
-afterward. Which means the protocols go up again, because I disabled them on
-the command line, not in the config.
+Well, log rotation. Which I added myself. This does a `birdc configure`
+afterward. Which means the protocols go up again, because I disabled them on the
+command line, not in the config.
Ungh. Okay, this is getting really ugly. `systemctl disable --now bird-clear`.
Now let's run this for a few days...
-That seem to have helped. I not decided to edit the clearnet config and disable
+That seem to have helped. I now decided to edit the clearnet config and disable
all "external" protocols (to IXPs), keeping Babel, RPKI, and IBGP enabled.
Immediately after starting bird-clearnet, the CPU usage went up again. To be
expected, for the initial sync. But it kept being high. So, I disabled the RPKI
protocols as well and... suddenly, the CPU usage is down???
-I actually did a `perf record` of bird when the CPU usage was high, and I saw
-that `rt_event` was running 33% of the time. I don't know what to make of that.
+(2025-01-09) So, now, the RPKI stuff... I saw that the connection to one of my
+RPKI servers wouldn't work (Fort Validator), and Fort complained about `Bad RTR
+version: 2`. The [Bird
+docs](https://bird.network.cz/?get_doc&v=20&f=bird-6.html) mention a `max
+version` option, and with that, the connection worked. And the CPU usage was
+down. *However*, this was not "the root issue". This option is only present
+since bird 2.16, which I only installed on 2025-01-07 [2]. And I had the CPU
+problems back in November last year! So this is "something completely
+different", it appears. I currently have the BGP sessions to virtual IXPs
+disabled, and will run it like this for a few days. The average load is 0.4.
+Still too high, compared to my other VPS.
+[1] At least when I want to see the binaries function names. Kernel symbols
+seem to show up fine. \
+[2] Which is pretty annoying, when you chase one bug, you change something in
+your config, and then something else breaks. That's on me.