1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
|
---
layout: post
title: Bird CPU usage
date: 2025-01-07 19:06 +0100
categories: tech
lang: en
---
Several times already, I noticed this in my Munin monitoring:
{% image
img="https://pics.uvokchee.de/upload/2025/01/07/20250107180512-c1453895.png"
alt="RRD tool graphic showing a high CPU usage" %}
I found it strange, but had no time to inspect further.
Recently, I tried to investigate what happens on the server there.
htop showed high CPU usage for both bird and dnsmasq (always
together) in these times.
Fuming a bit, I went with a brute-force approach:
```
#!/bin/bash
# Configuration
THRESHOLD=1.0
# 3 hours
DURATION=$((3*80*80))
MAIL_TO="lolnope"
SUBJECT="High Load Average Alert"
BODY="The load average has been above ${THRESHOLD} for more than 3 hours."
REBOOT_CMD="/sbin/reboot"
# Function to check the load average
check_load() {
# 15 min loadavg
loadavg=$(awk '{print $3}' /proc/loadavg)
echo "$(date): Current Load Average: $loadavg"
if (( $(echo "$loadavg > $THRESHOLD" | bc -l) )); then
echo "$(date): Load average is above threshold."
return 0
else
echo "$(date): Load average is below threshold."
return 1
fi
}
# Monitor the load average
start_time=$(date +%s)
while true; do
if check_load; then
current_time=$(date +%s)
elapsed_time=$((current_time - start_time))
if [ "$elapsed_time" -gt "$DURATION" ]; then
echo "$(date): Load average has been above threshold for more than 3 hours."
# Send alert email
(echo "$BODY"; ps -e -o %cpu,%mem,cmd --sort pcpu | tail) | mail -s "$SUBJECT" "$MAIL_TO"
# Reboot the server
# systemctl stop bird
# systemctl start bird
$REBOOT_CMD
break
fi
else
start_time=$(date +%s)
fi
sleep 300 # Check every 5 minutes
done
```
Specifically, the output of ps
```
22.7 2.7 /usr/sbin/bird -f -u bird -g bird
33.3 0.1 ps -e -o %cpu,%mem,cmd --sort pcpu
37.4 0.0 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new,.bak --local-service --trust-anchor=.,20326,8,2,e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
```
confirmed the suspicion - although the "percentage" is a bit weird. From the
manpage:
> Currently, it is the CPU time used divided by the time the process has been
> running (cputime/realtime ratio), expressed as a percentage.
(So if the process runs "long enough" and only starts misbehaving after a year,
it won't show up?).
I asked an LLM what to do, in addition to strace, and it suggested perf.
Unfortunately, this requires debug symbols [1]. [And while Debian does provide
debug symbols](https://wiki.debian.org/HowToGetABacktrace) - it doesn't for
dnsmasq (yet) in bookworm. Luckily, the nice people at labs.nic.cz provide a
dbgsym package in their Bird(2) Debian repository.
Now, stracing dnsmasq (when "idle") reveals some recvmsg of type `RTM_NEWROUTE`.
I have *no idea* why dnsmasq would need that. But I already *assume* the high
CPU usage occurs when Bird exports lots of routes to the kernel.
Also, in journalctl, I see lots of the infamous `Kernel dropped some netlink
messages, will resync on next scan.` messages at times - the message apparently
nobody has a solution to, and even though there are mailing list posts telling
to sysctl `net.core.rmem_default`, I doesn't seem to yield a solution.
[1] At least when I want to see the binaries function names.
Kernel symbols seem to show up fine.
|