_drafts/bird-cpu-usage.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140

---
layout: post
title: Bird CPU usage
date: 2025-01-07 19:06 +0100
categories: tech
lang: en
---

Several times already, I noticed this in my Munin monitoring:

{% image
img="https://pics.uvokchee.de/upload/2025/01/07/20250107180512-c1453895.png"
alt="RRD tool graphic showing a high CPU usage" %}

I found it strange, but had no time to inspect further.

Recently, I tried to investigate what happens on the server there.  htop showed
high CPU usage for both bird and dnsmasq (always together) in these times.

Fuming a bit, I went with a brute-force approach:

```
#!/bin/bash

# Configuration
THRESHOLD=1.0
# 3 hours
DURATION=$((3*80*80))
MAIL_TO="lolnope"
SUBJECT="High Load Average Alert"
BODY="The load average has been above ${THRESHOLD} for more than 3 hours."
REBOOT_CMD="/sbin/reboot"

# Function to check the load average
check_load() {
    # 15 min loadavg
    loadavg=$(awk '{print $3}' /proc/loadavg)
    echo "$(date): Current Load Average: $loadavg"

    if (( $(echo "$loadavg > $THRESHOLD" | bc -l) )); then
        echo "$(date): Load average is above threshold."
        return 0
    else
        echo "$(date): Load average is below threshold."
        return 1
    fi
}

# Monitor the load average
start_time=$(date +%s)
while true; do
    if check_load; then
        current_time=$(date +%s)
        elapsed_time=$((current_time - start_time))

        if [ "$elapsed_time" -gt "$DURATION" ]; then
            echo "$(date): Load average has been above threshold for more than 3 hours."

            # Send alert email
            (echo "$BODY"; ps -e -o %cpu,%mem,cmd --sort pcpu | tail) | mail -s "$SUBJECT" "$MAIL_TO"

            # Reboot the server
#            systemctl stop bird
#            systemctl start bird
            $REBOOT_CMD
            break
        fi
    else
        start_time=$(date +%s)
    fi
    sleep 300 # Check every 5 minutes
done

```

Specifically, the output of ps

```
22.7  2.7 /usr/sbin/bird -f -u bird -g bird
33.3  0.1 ps -e -o %cpu,%mem,cmd --sort pcpu
37.4  0.0 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new,.bak --local-service --trust-anchor=.,20326,8,2,e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
```

confirmed the suspicion - although the "percentage" is a bit weird. From the
manpage:

> Currently, it is the CPU time used divided by the time the process has been
> running (cputime/realtime ratio), expressed as a percentage.

(So if the process runs "long enough" and only starts misbehaving after a year,
it won't show up?).

I asked an LLM what to do, in addition to strace, and it suggested perf.
Unfortunately, this requires debug symbols [1]. [And while Debian does provide
debug symbols](https://wiki.debian.org/HowToGetABacktrace) - it doesn't for
dnsmasq (yet) in bookworm.  Luckily, the nice people at labs.nic.cz provide a
dbgsym package in their Bird(2) Debian repository.

Now, stracing dnsmasq (when "idle") reveals some recvmsg of type
`RTM_NEWROUTE`.  I have *no idea* why dnsmasq would need that. But I already
*assume* the high CPU usage occurs when Bird exports lots of routes to the
kernel.

Also, in journalctl, I see lots of the infamous `Kernel dropped some netlink
messages, will resync on next scan.` messages at times - the message apparently
nobody has a solution to, and even though there are mailing list posts telling
to sysctl `net.core.rmem_default`, I doesn't seem to yield a solution.

[1] At least when I want to see the binaries function names.  Kernel symbols
seem to show up fine.

So… step-by-step? I have both DN42 and clearnet bird running on this VPS, in
parallel. So maybe start by disabling one of these? Or, I have a better idea,
keep it enabled, and disable all protocols! (`birdc disable \"*\"`).

That helped, until midnight. When suddenly the CPU went up again. WTF?  Let's
have a look at `birdc s p`. All protocols up. Huh!?!?

Let's investigate further:

* Oooo, log rotation happens at midnight
* Fuck, I specified the same log files for both bird's

Well, log rotation. Which I added manually. This does a `birdc configure`
afterward. Which means the protocols go up again, because I disabled them on
the command line, not in the config.

Ungh. Okay, this is getting really ugly. `systemctl disable --now bird-clear`.
Now let's run this for a few days...

That seem to have helped. I not decided to edit the clearnet config and disable
all "external" protocols (to IXPs), keeping Babel, RPKI, and IBGP enabled.
Immediately after starting bird-clearnet, the CPU usage went up again. To be
expected, for the initial sync. But it kept being high. So, I disabled the RPKI
protocols as well and... suddenly, the CPU usage is down???

I actually did a `perf record` of bird when the CPU usage was high, and I saw
that `rt_event` was running 33% of the time. I don't know what to make of that.