Skip to Content

How to investigate high CPU usage on FortiGate

This article describes how to analyze high CPU usage on a FortiGate.

Scope

FortiGate, FortiOS.

Solution

It is important to understand how CPU usage is measured:

CPU usage is a time-based measurement: it is the amount of time during which the CPU has not been IDLE over time and has been executing instructions.

The CPU can be mainly used by 3 distinct spaces:

  • user space -> Time spent on running application daemons.
  • system -> Time spent on executing Kernel instructions.
  • interrupts (irq/softirqs) -> Time spent on handling interrupts.

Starting an investigation for high CPU with ‘diagnose system top’ is not a good idea as the top command only shows the CPU usage of user space daemons, so it does not provide the whole picture of the actual CPU usage.

Better options to start investigating are ‘get sys performance status’ and ‘diagnose sys mpstat <refresh seconds>’.

Both commands show the actual percentage usage for each space for each CPU.

get sys performance status
CPU states: 0% user 0% system 0% nice 99% idle 0% iowait 1% irq 0% softirq
CPU0 states: 0% user 0% system 0% nice 99% idle 0% iowait 1% irq 0% softirq
Memory: 2055760k total, 1093300k used (53.2%), 567356k free (27.6%), 395104k freeable (19.2%)
Average network usage: 243 / 23 kbps in 1 minute, 241 / 62 kbps in 10 minutes, 223 / 4 kbps in 30 minutes
Average sessions: 54 sessions in 1 minute, 48 sessions in 10 minutes, 44 sessions in 30 minutes
Average session setup rate: 0 sessions per second in last 1 minute, 0 sessions per second in last 10 minutes, 0 sessions per second in last 30 minutes
Virus caught: 0 total in 1 minute
IPS attacks blocked: 0 total in 1 minute
Uptime: 1 days, 0 hours, 55 minutes


diagnose sys mpstat 2
Gathering data, wait 2 sec, press any key to quit.
..0..1
TIME CPU %usr %nice %sys %iowait %irq %soft %steal %idle
11:42:42 AM all 0.00 0.00 0.00 0.00 0.50 0.00 0.00 99.50
0 0.00 0.00 0.00 0.00 0.50 0.00 0.00 99.50

Once the main type of usage has been identified, the focus can be moved to the specific use space.

User space

At this point, ‘diagnose system top’ can be used to identify the application which is consuming all the CPU time and debug it with the specific application debug commands.

Additional information about a process (like the last CPU it ran, status, syscalls, memory usage, etc) can be gathered in the following underlying directories:

fnsysctl cat /proc/<pid>/status
fnsysctl cat /proc/<pid>/stat
fnsysctl cat /proc/<pid>/statm
fnsysctl cat /proc/<pid>/syscall

System

If high CPU usage is seen in a system space, a CPU profiling can show what the most frequently called functions in the source code are, whether they are Linux kernel functions or Fortinet-specific system functions. This will give an idea of what is going on at the system level during a specific amount of time. After identifying which CPU is experiencing high system usage, run the following commands:

diag sys profile cpumask <ID of the CPU to analyze>
diag sys profile start

Wait approximately 10 to 30 seconds.

diag sys profile stop
diag sys profile show order

Show order will print something like this:

0xffffffff8025fb50: 454 vprintk_emit+0x190/0x4f0
0xffffffff80208b68: 345 poll_idle+0x18/0x30
0xffffffffa0009f73: 6 del_ip_session_expect_list+0x1283/0x1390
0xffffffff802bf1c8: 1 copy_pte_range+0x228/0x440
0xffffffff802e7558: 1 select_estimate_accuracy+0xe8/0x130
0xffffffff804f68e8: 1 __copy_skb_header+0x438/0x4e0
0xffffffff80602608: 1 __schedule+0x2b8/0x910
0xffffffffa0164feb: 1 np6_clean_rx+0xcb/0xe30
0xffffffffa080dc4b: 1 wcmod_uc_status_dump+0x311/0x156b

Note that poll_idle is the kernel function executed when no other function is being executed to measure the actual CPU usage, as stated at the beginning.

For further investigation, it will be necessary to involve Fortinet support.

IRQ / Interrupts

Interrupts (hardware or software interrupts) are special functions that when triggered when certain events occur. They order the CPU to interrupt the current operation to execute the actual raised interrupt handler. This means the CPU is not waiting for the event to occur, but the interrupt sends the signal to the CPU upon being triggered and executes an interrupt of related code (a real-life analogy would be ringing a doorbell, so when the door rings the people inside stop what they are doing to check who is at the door).

On a firewall or network devices in general, the main source of interrupts is network packets received on interfaces.

High IRQ could mean an L2 issue such as L2 broadcast storms, a high number of non offloaded sessions, or just a high session setup rate.

To investigate high softirq or irq, obtain useful information about session setup rate and offloaded vs non-offloaded sessions with ‘get sys performance status’.

get sys performance status
CPU states: 1% user 2% system 0% nice 87% idle 0% iowait 0% irq 10% softirq
CPU0 states: 1% user 3% system 0% nice 96% idle 0% iowait 0% irq 0% softirq
..
CPU55 states: 0% user 1% system 0% nice 85% idle 0% iowait 0% irq 14% softirq
Memory: 264290500k total, 46488320k used (17.6%), 217336196k free (82.2%), 465984k freeable (0.2%)
Average network usage: 86273109 / 86082338 kbps in 1 minute, 87085803 / 87039054 kbps in 10 minutes, 86707706 / 86756164 kbps in 30 minutes
Maximal network usage: 102850930 / 102515500 kbps in 1 minute, 102850930 / 102515500 kbps in 10 minutes, 102850930 / 102515500 kbps in 30 minutes
Average sessions: 7238388 sessions in 1 minute, 6294477 sessions in 10 minutes, 6278432 sessions in 30 minutes
Maximal sessions: 7581448 sessions in 1 minute, 7581448 sessions in 10 minutes, 7871998 sessions in 30 minutes
Average session setup rate: 22146 sessions per second in last 1 minute, 24011 sessions per second in last 10 minutes, 23829 sessions per second in last 30 minutes
Maximal session setup rate: 24399 sessions per second in last 1 minute, 28758 sessions per second in last 10 minutes, 30005 sessions per second in last 30 minutes
Average NPU sessions: 2702855 sessions in last 1 minute, 3782150 sessions in last 10 minutes, 3875431 sessions in last 30 minutes
Maximal NPU sessions: 3057107 sessions in last 1 minute, 4233725 sessions in last 10 minutes, 4250696 sessions in last 30 minutes
Average nTurbo sessions: 0 sessions in last 1 minute, 0 sessions in last 10 minutes, 0 sessions in last 30 minutes
Maximal nTurbo sessions: 0 sessions in last 1 minute, 0 sessions in last 10 minutes, 0 sessions in last 30 minutes
Virus caught: 0 total in 1 minute
IPS attacks blocked: 0 total in 1 minute
Uptime: 112 days, 10 hours, 0 minutes

Additionally, use the command ‘diagnose sys session stat’:

diagnose sys session stat
misc info: session_count=170 setup_rate=7 exp_count=0 clash=0
memory_tension_drop=0 ephemeral=0/4259840 removeable=0 extreme_low_mem=0
npu_session_count=0
nturbo_session_count=0
delete=2, flush=0, dev_down=29/348
session walkers: active=0, vf-138, dev-29, saddr-0, npu-0, wildcard-0
TCP sessions:
8 in ESTABLISHED state
96 in SYN_SENT state
firewall error stat:
error1=00000000
error2=00000000
error3=00000000
error4=00000000
tt=00000000
cont=00000000
ips_recv=00000000
policy_deny=00007ac8
av_recv=00000000
fqdn_count=000001f7
fqdn6_count=00000000
global: ses_limit=0 ses6_limit=0 rt_limit=0 rt6_limit=0

These will offer an approximation of the actual network statistics and the network offloading.

Additionally, use a system profile (like as high system usage) to investigate offloading related system calls.

EG --> 0xffffffffa0164feb: 1 np6_clean_rx+0xcb/0xe30

To gain an idea of which interrupts are raised and on which CPU, run the following:

diagnose hardware sysinfo interrupts OR fnsysctl cat /proc/interrupts

fnsysctl cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
29: 0 0 0 0 0 0 0 0 GIC Edge arch_timer
30: 8539503 8539503 8539501 8539503 8539503 8539503 8539503 8539503 GIC Edge arch_timer
37: 0 0 0 0 0 0 0 0 GIC Edge watchdog
38: 10043 0 0 0 0 0 0 0 GIC Edge IICBTIS
39: 50213 0 0 0 0 0 0 0 GIC Edge IICBTIA
40: 416 0 0 0 0 0 0 0 GIC Edge IICBTIS
41: 694 0 0 0 0 0 0 0 GIC Edge IICBTIA
42: 176 0 0 0 0 0 0 0 GIC Level serial
47: 0 0 0 0 0 0 0 0 GIC Level arm-smmu global fault
48: 431429 0 0 0 0 0 0 0 GIC Level linux-kernel-bde
80: 2621820 0 0 0 0 0 0 0 GIC Edge np6xlite-hif0
81: 0 2752922 0 0 0 0 0 0 GIC Edge np6xlite-hif1
82: 0 0 1565775 0 0 0 0 0 GIC Edge np6xlite-hif2
83: 0 0 0 1763550 0 0 0 0 GIC Edge np6xlite-hif3
84: 0 0 0 0 5551906 0 0 0 GIC Edge np6xlite-hif4
85: 0 0 0 0 0 1312881 0 0 GIC Edge np6xlite-hif5
86: 0 0 0 0 0 0 2354701 0 GIC Edge np6xlite-hif6
87: 0 0 0 0 0 0 0 2710748 GIC Edge np6xlite-hif7
88: 0 0 0 0 0 0 0 0 GIC Edge np6xlite-err
97: 0 190392 0 0 0 0 0 0 GIC Edge np6xlite_0-nturbo-tx-rx0
...