summaryrefslogtreecommitdiff
path: root/_drafts/bird-cpu-usage.md
blob: c0fa5b7d235c5bd90fd25cdad29b5049991c89dd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
layout: post
title: Bird CPU usage
date: 2025-01-07 19:06 +0100
categories: tech
lang: en
---

**Preface**: This is kind of a "bug hunt log". I extensively posted on Mastodon
about it, but wanted to keep the progress in writing as well. This might lead
like random rambling in some parts, because I edited and rearranged the
paragraphs of this post a few times. Sorry for that.

**Preface 2**: Starting point: In September 2024 I upgraded to bird2
`2.15.1-1~bpo12+1` - probably a self-built backport of mine.  This, I "had to
to", since I wanted the Babel metrics measurement, which is not present in
Debian stable (`2.0.12-7`). Yeah, I won't complain. For 99 % of the stuff I do,
Debian stable is *perfect*. I want a server without regularly breaking things,
after all.  Only recently (2015-01-07), I added the bird / network.cz repos and
installed bird2 from there.

Several times already, I noticed this in my Munin monitoring:

{% image
img="https://pics.uvokchee.de/upload/2025/01/07/20250107180512-c1453895.png"
alt="RRD tool graphic showing a high CPU usage" %}

I found it strange, but had no time to inspect further.

Recently, I tried to investigate what happens on the server there.  htop showed
high CPU usage for both bird and dnsmasq (always together) in these times.

Fuming (angry) a bit, I went with a brute-force approach:

```
#!/bin/bash

# Configuration
THRESHOLD=1.0
# 3 hours
DURATION=$((3*80*80))
MAIL_TO="lolnope"
SUBJECT="High Load Average Alert"
BODY="The load average has been above ${THRESHOLD} for more than 3 hours."
REBOOT_CMD="/sbin/reboot"

# Function to check the load average
check_load() {
    # 15 min loadavg
    loadavg=$(awk '{print $3}' /proc/loadavg)
    echo "$(date): Current Load Average: $loadavg"

    if (( $(echo "$loadavg > $THRESHOLD" | bc -l) )); then
        echo "$(date): Load average is above threshold."
        return 0
    else
        echo "$(date): Load average is below threshold."
        return 1
    fi
}

# Monitor the load average
start_time=$(date +%s)
while true; do
    if check_load; then
        current_time=$(date +%s)
        elapsed_time=$((current_time - start_time))

        if [ "$elapsed_time" -gt "$DURATION" ]; then
            echo "$(date): Load average has been above threshold for more than 3 hours."

            # Send alert email
            (echo "$BODY"; ps -e -o %cpu,%mem,cmd --sort pcpu | tail) | mail -s "$SUBJECT" "$MAIL_TO"

            # Reboot the server - room for improvement
#            systemctl stop bird
#            systemctl start bird
            $REBOOT_CMD
            break
        fi
    else
        start_time=$(date +%s)
    fi
    sleep 300 # Check every 5 minutes
done

```

Specifically, the output of ps

```
22.7  2.7 /usr/sbin/bird -f -u bird -g bird
33.3  0.1 ps -e -o %cpu,%mem,cmd --sort pcpu
37.4  0.0 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new,.bak --local-service --trust-anchor=.,20326,8,2,e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
```

confirmed the quick look at htop - although the "percentage" is a bit weird.
From the manpage:

> Currently, it is the CPU time used divided by the time the process has been
> running (cputime/realtime ratio), expressed as a percentage.

(So if the process runs "long enough" and only starts misbehaving after a year,
it won't show up?).

I asked an LLM what to do, in addition to strace, and it suggested perf.
Unfortunately, this requires debug symbols [1]. [And while Debian does provide
debug symbols](https://wiki.debian.org/HowToGetABacktrace) - it doesn't for
dnsmasq (yet) in bookworm.  Luckily, the nice people at labs.nic.cz provide a
dbgsym package in their Bird(2) Debian repository.

Aside, the script above could benefit from some improvements, instead of
rebooting, I could actually record and strace and perf when this happens, and
only then reboot / restart the service.

Now, stracing dnsmasq (when "idle") reveals some recvmsg of type
`RTM_NEWROUTE`.  I have *no idea* why dnsmasq would need that. But I already
*assume* the high CPU usage occurs when Bird exports lots of routes to the
kernel. *I have no hecking clue why dnsmasq would need to use the CPU at the
same time. If someone can enlighten me, **please** message me.*

Also, in journalctl, I see lots of the infamous `Kernel dropped some netlink
messages, will resync on next scan.` messages at times - the message apparently
nobody has a solution to, and even though there are mailing list posts telling
to sysctl `net.core.rmem_default`, I doesn't seem to yield a solution.

I actually did a `perf record` of bird when the CPU usage was high, and I saw
that `rt_event` was running 33% of the time. I don't know what to make of that.

So… step-by-step? I have both DN42 and clearnet bird running on this VPS, in
parallel. So maybe start by disabling one of these? Or, I have a better idea,
keep it enabled, and disable all protocols! (`birdc disable \"*\"`).  That
helped, until midnight. When suddenly the CPU went up again. WTF?  Let's have a
look at `birdc s p`. All protocols up. Huh!?!?

Let's investigate further:

* Oooo, log rotation happens at midnight
* Fuck, I specified the same log files for both bird's

Well, log rotation. Which I added myself. This does a `birdc configure`
afterward. Which means the protocols go up again, because I disabled them on the
command line, not in the config.

Ungh. Okay, this is getting really ugly. `systemctl disable --now bird-clear`.
Now let's run this for a few days...

That seem to have helped. I now decided to edit the clearnet config and disable
all "external" protocols (to IXPs), keeping Babel, RPKI, and IBGP enabled.
Immediately after starting bird-clearnet, the CPU usage went up again. To be
expected, for the initial sync. But it kept being high. So, I disabled the RPKI
protocols as well and... suddenly, the CPU usage is down???

(2025-01-09) So, now, the RPKI stuff... I saw that the connection to one of my
RPKI servers wouldn't work (Fort Validator), and Fort complained about `Bad RTR
version: 2`.  The [Bird
docs](https://bird.network.cz/?get_doc&v=20&f=bird-6.html) mention a `max
version` option, and with that, the connection worked. And the CPU usage was
down. *However*, this was not "the root issue". This option is only present
since bird 2.16, which I only installed on 2025-01-07 [2]. And I had the CPU
problems back in November last year! So this is "something completely
different", it appears. I currently have the BGP sessions to virtual IXPs
disabled, and will run it like this for a few days. The average load is 0.4.
Still too high, compared to my other VPS.

[1] At least when I want to see the binaries function names.  Kernel symbols
seem to show up fine. \
[2] Which is pretty annoying, when you chase one bug, you change something in
your config, and then something else breaks. That's on me.