This article describes how to analyze high CPU usage on a FortiGate.
Table of Contents
Scope
FortiGate, FortiOS.
Solution
It is important to understand how CPU usage is measured:
CPU usage is a time-based measurement: it is the amount of time during which the CPU has not been IDLE over time and has been executing instructions.
The CPU can be mainly used by 3 distinct spaces:
- user space -> Time spent on running application daemons.
- system -> Time spent on executing Kernel instructions.
- interrupts (irq/softirqs) -> Time spent on handling interrupts.
Starting an investigation for high CPU with ‘diagnose system top’ is not a good idea as the top command only shows the CPU usage of user space daemons, so it does not provide the whole picture of the actual CPU usage.
Better options to start investigating are ‘get sys performance status’ and ‘diagnose sys mpstat <refresh seconds>’.
Both commands show the actual percentage usage for each space for each CPU.
get sys performance status CPU states: 0% user 0% system 0% nice 99% idle 0% iowait 1% irq 0% softirq CPU0 states: 0% user 0% system 0% nice 99% idle 0% iowait 1% irq 0% softirq Memory: 2055760k total, 1093300k used (53.2%), 567356k free (27.6%), 395104k freeable (19.2%) Average network usage: 243 / 23 kbps in 1 minute, 241 / 62 kbps in 10 minutes, 223 / 4 kbps in 30 minutes Average sessions: 54 sessions in 1 minute, 48 sessions in 10 minutes, 44 sessions in 30 minutes Average session setup rate: 0 sessions per second in last 1 minute, 0 sessions per second in last 10 minutes, 0 sessions per second in last 30 minutes Virus caught: 0 total in 1 minute IPS attacks blocked: 0 total in 1 minute Uptime: 1 days, 0 hours, 55 minutes diagnose sys mpstat 2 Gathering data, wait 2 sec, press any key to quit. ..0..1 TIME CPU %usr %nice %sys %iowait %irq %soft %steal %idle 11:42:42 AM all 0.00 0.00 0.00 0.00 0.50 0.00 0.00 99.50 0 0.00 0.00 0.00 0.00 0.50 0.00 0.00 99.50
Once the main type of usage has been identified, the focus can be moved to the specific use space.
User space
At this point, ‘diagnose system top’ can be used to identify the application which is consuming all the CPU time and debug it with the specific application debug commands.
Additional information about a process (like the last CPU it ran, status, syscalls, memory usage, etc) can be gathered in the following underlying directories:
fnsysctl cat /proc/<pid>/status fnsysctl cat /proc/<pid>/stat fnsysctl cat /proc/<pid>/statm fnsysctl cat /proc/<pid>/syscall
System
If high CPU usage is seen in a system space, a CPU profiling can show what the most frequently called functions in the source code are, whether they are Linux kernel functions or Fortinet-specific system functions. This will give an idea of what is going on at the system level during a specific amount of time. After identifying which CPU is experiencing high system usage, run the following commands:
diag sys profile cpumask <ID of the CPU to analyze> diag sys profile start
Wait approximately 10 to 30 seconds.
diag sys profile stop diag sys profile show order
Show order will print something like this:
0xffffffff8025fb50: 454 vprintk_emit+0x190/0x4f0 0xffffffff80208b68: 345 poll_idle+0x18/0x30 0xffffffffa0009f73: 6 del_ip_session_expect_list+0x1283/0x1390 0xffffffff802bf1c8: 1 copy_pte_range+0x228/0x440 0xffffffff802e7558: 1 select_estimate_accuracy+0xe8/0x130 0xffffffff804f68e8: 1 __copy_skb_header+0x438/0x4e0 0xffffffff80602608: 1 __schedule+0x2b8/0x910 0xffffffffa0164feb: 1 np6_clean_rx+0xcb/0xe30 0xffffffffa080dc4b: 1 wcmod_uc_status_dump+0x311/0x156b
Note that poll_idle is the kernel function executed when no other function is being executed to measure the actual CPU usage, as stated at the beginning.
For further investigation, it will be necessary to involve Fortinet support.
IRQ / Interrupts
Interrupts (hardware or software interrupts) are special functions that when triggered when certain events occur. They order the CPU to interrupt the current operation to execute the actual raised interrupt handler. This means the CPU is not waiting for the event to occur, but the interrupt sends the signal to the CPU upon being triggered and executes an interrupt of related code (a real-life analogy would be ringing a doorbell, so when the door rings the people inside stop what they are doing to check who is at the door).
On a firewall or network devices in general, the main source of interrupts is network packets received on interfaces.
High IRQ could mean an L2 issue such as L2 broadcast storms, a high number of non offloaded sessions, or just a high session setup rate.
To investigate high softirq or irq, obtain useful information about session setup rate and offloaded vs non-offloaded sessions with ‘get sys performance status’.
get sys performance status CPU states: 1% user 2% system 0% nice 87% idle 0% iowait 0% irq 10% softirq CPU0 states: 1% user 3% system 0% nice 96% idle 0% iowait 0% irq 0% softirq .. CPU55 states: 0% user 1% system 0% nice 85% idle 0% iowait 0% irq 14% softirq Memory: 264290500k total, 46488320k used (17.6%), 217336196k free (82.2%), 465984k freeable (0.2%) Average network usage: 86273109 / 86082338 kbps in 1 minute, 87085803 / 87039054 kbps in 10 minutes, 86707706 / 86756164 kbps in 30 minutes Maximal network usage: 102850930 / 102515500 kbps in 1 minute, 102850930 / 102515500 kbps in 10 minutes, 102850930 / 102515500 kbps in 30 minutes Average sessions: 7238388 sessions in 1 minute, 6294477 sessions in 10 minutes, 6278432 sessions in 30 minutes Maximal sessions: 7581448 sessions in 1 minute, 7581448 sessions in 10 minutes, 7871998 sessions in 30 minutes Average session setup rate: 22146 sessions per second in last 1 minute, 24011 sessions per second in last 10 minutes, 23829 sessions per second in last 30 minutes Maximal session setup rate: 24399 sessions per second in last 1 minute, 28758 sessions per second in last 10 minutes, 30005 sessions per second in last 30 minutes Average NPU sessions: 2702855 sessions in last 1 minute, 3782150 sessions in last 10 minutes, 3875431 sessions in last 30 minutes Maximal NPU sessions: 3057107 sessions in last 1 minute, 4233725 sessions in last 10 minutes, 4250696 sessions in last 30 minutes Average nTurbo sessions: 0 sessions in last 1 minute, 0 sessions in last 10 minutes, 0 sessions in last 30 minutes Maximal nTurbo sessions: 0 sessions in last 1 minute, 0 sessions in last 10 minutes, 0 sessions in last 30 minutes Virus caught: 0 total in 1 minute IPS attacks blocked: 0 total in 1 minute Uptime: 112 days, 10 hours, 0 minutes
Additionally, use the command ‘diagnose sys session stat’:
diagnose sys session stat misc info: session_count=170 setup_rate=7 exp_count=0 clash=0 memory_tension_drop=0 ephemeral=0/4259840 removeable=0 extreme_low_mem=0 npu_session_count=0 nturbo_session_count=0 delete=2, flush=0, dev_down=29/348 session walkers: active=0, vf-138, dev-29, saddr-0, npu-0, wildcard-0 TCP sessions: 8 in ESTABLISHED state 96 in SYN_SENT state firewall error stat: error1=00000000 error2=00000000 error3=00000000 error4=00000000 tt=00000000 cont=00000000 ips_recv=00000000 policy_deny=00007ac8 av_recv=00000000 fqdn_count=000001f7 fqdn6_count=00000000 global: ses_limit=0 ses6_limit=0 rt_limit=0 rt6_limit=0
These will offer an approximation of the actual network statistics and the network offloading.
Additionally, use a system profile (like as high system usage) to investigate offloading related system calls.
EG --> 0xffffffffa0164feb: 1 np6_clean_rx+0xcb/0xe30
To gain an idea of which interrupts are raised and on which CPU, run the following:
diagnose hardware sysinfo interrupts OR fnsysctl cat /proc/interrupts fnsysctl cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 29: 0 0 0 0 0 0 0 0 GIC Edge arch_timer 30: 8539503 8539503 8539501 8539503 8539503 8539503 8539503 8539503 GIC Edge arch_timer 37: 0 0 0 0 0 0 0 0 GIC Edge watchdog 38: 10043 0 0 0 0 0 0 0 GIC Edge IICBTIS 39: 50213 0 0 0 0 0 0 0 GIC Edge IICBTIA 40: 416 0 0 0 0 0 0 0 GIC Edge IICBTIS 41: 694 0 0 0 0 0 0 0 GIC Edge IICBTIA 42: 176 0 0 0 0 0 0 0 GIC Level serial 47: 0 0 0 0 0 0 0 0 GIC Level arm-smmu global fault 48: 431429 0 0 0 0 0 0 0 GIC Level linux-kernel-bde 80: 2621820 0 0 0 0 0 0 0 GIC Edge np6xlite-hif0 81: 0 2752922 0 0 0 0 0 0 GIC Edge np6xlite-hif1 82: 0 0 1565775 0 0 0 0 0 GIC Edge np6xlite-hif2 83: 0 0 0 1763550 0 0 0 0 GIC Edge np6xlite-hif3 84: 0 0 0 0 5551906 0 0 0 GIC Edge np6xlite-hif4 85: 0 0 0 0 0 1312881 0 0 GIC Edge np6xlite-hif5 86: 0 0 0 0 0 0 2354701 0 GIC Edge np6xlite-hif6 87: 0 0 0 0 0 0 0 2710748 GIC Edge np6xlite-hif7 88: 0 0 0 0 0 0 0 0 GIC Edge np6xlite-err 97: 0 190392 0 0 0 0 0 0 GIC Edge np6xlite_0-nturbo-tx-rx0 ...