IOS-XE controllers System Health Check for Debugging Issues

Below checks are a good starting point for troubleshooting issues with IOS-XE based controllers (5760, 3850, 3650). Its a good practice to keep an eye on some of these stats even during normal operation of the system. It will be handy to have the below info handy before opening TAC case.

1. CPU Utilization:

  • Monitor the cpu utilization history and make sure that there are no CPU spikes for longer period (more than couple of minutes).
  • The hourly average CPU utilization of any core for any given minute going higher than ~70% needs further close monitoring. Identify the process that is taking most of the CPU and see the spike can be attributed to any kind of network activity.
  • NOTE: If the controller is added to the Prime Infrastructure(PI), then there could be periodic spikes of some of the processes listed below:
    • snmp_subagent, eicored, eicore_ipc(wcm) and SNMP ENGINE (iosd)
    • The SNMP ENGINE utilization should drop down if there is any other higher priority activity that requires the CPU.
  • Commands

    show processes cpu sorted detailed | exclude 0.00 # Sorted list of processes to the thread level based on the CPU utilization.
    show processes cpu history

    Sample output with PI activity:
    #sh proc cpu sort det | e 0.00
    Core 0: CPU utilization for five seconds: 45%; one minute: 16%; five minutes: 13%
    Core 1: CPU utilization for five seconds: 44%; one minute: 14%; five minutes: 10%
    Core 2: CPU utilization for five seconds: 52%; one minute: 15%; five minutes: 11%
    Core 3: CPU utilization for five seconds: 44%; one minute: 15%; five minutes: 11%
    PID    T C  TID    Runtime(ms) Invoked uSecs  5Sec      1Min     5Min     TTY   Process
                                                   (%)       (%)      (%)
    5805   L           2621250     1821404 85     17.69     3.14    1.03    0     snmp_subagent
    5805   L 2  5805   1074        1478    0      15.99     1.47    0.39    0     snmp_subagent
    5805   L 0  6626   1389355     7685676 0      8.42      7.46    0.41    0     snmp_subagent
    5805   L 2  6586   762745      9850229 0      3.94      2.63    0.18    0     CMI default xdm
    5805   L 1  13461  874         33226   0      2.56      1.84    1.79    0     SNMP SA CACHE
    5805   L 2  6497   276868      724659  0      0.13      0.09    0.09    0     snmp_subagent
    8127   L           3936457     4684750 100    16.91     5.53    4.32    0     iosd
    8127   L 2  8127   600155      2944832 0      13.54     3.74    2.66    0     iosd
    8127   L 3  8742   2624104     1473118 0      2.47      1.69    1.62    0     iosd.fastpath
    8127   L 2  8743   707123      2610018 0      0.90      0.11    0.03    0     CMI Thread
    281    I           89336       1250203 0      24.44     4.11    1.11    0       SNMP ENGINE
    276    I           859231      18049   0      11.66     0.99    0.22    0       SNMP Timers
    22     I           1967404     1384683 0      8.55      1.66    0.44    0       CMI IOSd task
    59     I           1908371     5190050 0      3.00      2.99    2.88    0       ARP Snoop
    312    I           1547565     190253  0      2.77      1.33    1.33    0       QoS stats process
    274    I           671863      5192881 0      0.88      0.88    0.88    0       DAI Packet Process
    30     I           2510652     2756779 0      0.66      0.55    0.44    0       ARP Input
    200    I           911297      1344167 0      0.66      0.22    0.11    0       Tunnel IOSd shim DB
    195    I           1717254     4141977 0      0.33      0.22    0.22    0       IP Host Track Proce
    232    I           1066616     5559004 0      0.33      0.22    0.22    0       UDLD
    223    I           1590722     5162371 0      0.33      0.33    0.33    0       IP Input
    219    I           977672      3720297 0      0.11      0.11    0.11    0       DHCP Snooping
    8123   L           3603041     7383618 63     4.31      2.04    2.04    0     wcm
    8123   L 1  8808   2457789     5526106 0      2.34      4.40    0.40    0     eicore_ipc
    
    
    #show processes cpu history
    History information for system:
    
    
         11111     11111     11111     11111     11111          1111
        811111999991111177777111118888800000888880000099999999990000
    100
     90
     80
     70
     60
     50
     40
     30
     20
     10 ************************************************************
       0....5....1....1....2....2....3....3....4....4....5....5....
                 0    5    0    5    0    5    0    5    0    5
                   CPU% per second (last 60 seconds)
    
    
        164111111111111111454411116611111111111111135441113651111111
        157124331111111613114211143611111104010000091641006601002200
    100
     90
     80
     70  *                         *                       *
     60  *                        **                       #
     50  **                *      **                **     #*
     40  #*               *##*    ##               *##*   *#*
     30  #*               ###*    ##               *##*   *#*
     20  #*            *  ####    ##               ####   *##
     10 ############################################################
       0....5....1....1....2....2....3....3....4....4....5....5....
                 0    5    0    5    0    5    0    5    0    5
                   CPU% per minute (last 60 minutes)
                  * = maximum CPU%   # = average CPU%
    
    
        677776677677777766656666666565554555555555556575556556565555555655665644
        634359546523207032491111314839989215282111231924203084021809628696108257
    100
     90
     80     *   *     *
     70 ****************                              *                *
     60 ********************************   * *      ***   * ** * * ** ********
     50 ************************************************************************
     40 ************************************************************************
     30 ************************************************************************
     20 #*##*#**#########**********************************************#********
     10 ########################################################################
       0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
                 0    5    0    5    0    5    0    5    0    5    0    5    0
                           CPU% per hour (last 72 hours)
                        * = maximum CPU%   # = average CPU%
    
2. Memory Utilization:

  • Monitor the RSS column of the individual processes. If this component is monotonically growing then it is alarming and further investigation needed based on the process that is growing.
  • Monitor the free memory component of the system level memory. Don’t be alarmed if it drops down suddenly as IOSd usage could spike up and the memory freed will be held back in IOSd itself. If the system level free memory goes down below 40% it is alarming and further investigation needed.
  • If the IOSd is process is growing, the run IOSd process specific command and look for Processor 8E2F3008 1610612736 344548600 1266064136 1114693628 1102149236 line in the output. Make sure the Lowest and Largest Columns are not too much apart.
    Commands

    show processes memory sorted
    show memory detailed process iosd allocating-process totals # If the IOSd process RSS is growing.

    Sample output
    #show processes memory sorted
    System memory  : 7995664K total, 2601088K used, 5394576K free, 2227952K kernel reserved
    Lowest(b)      : 568393300
    PID     Text         Data      Stack     Heap      RSS       Total     Process
    8123    20048        1140084   88        34448     748636    1312796   wcm
    8127    63252        1601300   104       932       740208    1924924   iosd
    5240    9272         422520    92        115968    486208    783940    fed
    5782    4260         157088    88        81200     139288    256792    ffm
    5242    612          93880     88        1184      117536    306532    stack-mgr
    5805    760          219200    88        22908     102992    283952    snmp_subagent  
    
    #show memory detailed process iosd allocating-process totals
    System memory  : 7995664K total, 2601040K used, 5394624K free, 2227952K kernel reserved
    Lowest(b)      : 568393300
    
    Process iosd, type L, PID = 8127
        1924924K total, 63252K text, 1601300K data, 104K stack, 932K heap
    
    
                    Head    Total(b)     Used(b)     Free(b)   Lowest(b)  Largest(b)
    Processor   8E2F3008   1610612736   344548600   1266064136   1114693628   1102149236
    IOS Proce   8D2F2008    16777216     7691620     9085596     8954256     8466680
    
    
    Allocator PC Summary for: Processor
    
        PC          Total   Count  Name
    0xF4CFA0B7   43112624   19401  *Packet Data*
    0xF4CFA036   23628624   19560  *Packet Header*
    0xF5936761   20985680     320  AAA AttrL Sub
    0xF5A01BE7   12586816  168267  Init
    
3. AP and Client stats

  • Monitor the number of APs, clients and make sure there is no sudden drop in the count at any point.
  • Monitor the APs up time to make sure there are no AP drop offs.
  • Monitor the client stats for learn-ip state and clients with self-assigned IP addresses.
  • Monitor the AP load and channel utilization of all the APs.
    Commands

    show ap uptime # Look at AP uptime and Association uptime columns and make sure those are not too much apart, otherwise there are AP flappings
    show wcdb database all | include Num # Client count
    show wcdb database all | count LEARN|169.254. # Clients with self assigned IP, there may be a problem with the DHCP
    show ap dot11 24ghz load-info # Make sure the channel utilization is not high
    show ap dot11 5ghz load-info
    show wireless exclusion # Make sure the clients are not listed if disabled, Make sure the clients are not stuck if enabled with 0 timer value.
    show wireless client summary | i Idle # Make sure the clients are not stuck in this state.

    Sample output
    #show ap uptime
    Number of APs : 498
    AP Name               Ethernet MAC       AP Up Time                                Association Up Time
    ---------------------------------------------------------------------------------------------------------------------------
    tsim-01-1             00aa.0100.02f0     1 day 8 hours 42 minutes 15 seconds       1 day 8 hours 38 minutes 57 seconds
    tsim-01-2             00aa.0100.03f0     1 day 8 hours 42 minutes 15 seconds       1 day 8 hours 38 minutes 57 seconds
    tsim-01-3             00aa.0100.04f0     1 day 8 hours 42 minutes 14 seconds       1 day 8 hours 38 minutes 56 seconds
    #show wcdb database all | include Number
            Total Number of Wireless Clients = 3486
    #show ap dot11 24ghz load-info
    AP Name                           Radio MAC       Slot  Channel Utilization (%)  Clients
    ----------------------------------------------------------------------------------------
    ap-01-1                         00aa.0100.0200     0                       40        10
    ap-01-2                         00aa.0100.0300     0                       30        5
    #show wireless exclusionlist
    Manually Excluded Clients
    
    Dynamically Excluded Clients
    
    
4. Monitor System traffic / logs

  • Look for % in the logs to get all the syslog messages.
  • Look for the keywords like Trace, CAPWAP, AVL in the system logs
  • Monitor the traffic handled by the CPU and watch for any buffer exhaustion/suspension happening frequently.
    Commands

    show logging
    show platform punt client # Look for failures column for any packets punted to CPU, should be zero. It should not increase fast if it is non-zero
    show controllers ethernet-controller port-asic statistics exceptions # Look for any drop counter increments

Sample output
show logging | include Trace|%|AVL|CAPWAP
*May  8 11:50:53.940 PDT: %LINEPROTO-5-UPDOWN: Line protocol on Interface Capwap3, changed state to up
-Traceback= 1#b1a3e3fd1d3138ac635fe72eb77c44c3  :F39AF000+20DA8A3 :F39AF000+1380F83 :F39AF000+1382D17 :F39AF000+138C274

#show platform punt client
  tag      buffer        jumbo    fallback     packets   received   failures
                                            alloc   free  bytes    conv  buf
 27       0/1024/2048     0/5       0/5        0     0          0     0     0
 65536    0/1024/1600     0/0       0/512  2687494 2687494  171999616     0     0
 65537    0/ 512/1600     0/0       0/512  174604 174604   34383056     0     0
 65538    0/   5/5        0/0       0/5        0     0          0     0     0
 65539    0/2048/1600     0/16      0/512  60657819 60657819 2060395227     0     0
 65540    0/ 128/1600     0/8       0/0        0     0          0     0     0
 65541    0/ 128/1600     0/16      0/32   1490044 1490729  340860488     0     0
 65542    0/ 768/1600     0/4       0/0    5623288 5646925  337397580     0     0
 65544    0/  96/1600     0/4       0/0        0     0          0     0     0
 65545    0/  96/1600     0/8       0/32       0     0          0     0     0
 65546    0/ 512/1600     0/32      0/512  2482982 2914147  240875710     0     0
 65547    0/  96/1600     0/8       0/32       0     0          0     0     0
 65548    0/ 512/1600     0/32      0/256      0     0          0     0     0
 65551    0/ 512/1600     0/0       0/256      0     0          0     0     0
 65556    0/  16/1600     0/4       0/0        0     0          0     0     0
 65557    0/  16/1600     0/4       0/0        0     0          0     0     0
 65558    0/  16/1600     0/4       0/0        0     0          0     0     0
 65559    0/  16/1600     0/4       0/0        0     0          0     0     0
 65560    0/  16/1600     0/4       0/0        0     0          0     0     0
 65561    0/ 512/1600     0/0       0/256  100170460 129432612 1711903376     0     2
 65562    0/ 512/1600     0/0       0/256  3740090 9643238 1266448617     0     0
 65563    0/ 512/1600     0/0       0/256      0     0          0     0     0
 65565    0/ 512/1600     0/16      0/256      0     0          0     0     0
 65566    0/ 512/1600     0/16      0/256      0     0          0     0     0
 65567    0/ 512/1600     0/16      0/256      0     0          0     0     0
 65568    0/ 512/1600     0/16      0/256      0     0          0     0     0
 65583    0/   1/1        0/0       0/0        0     0          0     0     0
 131071    0/  96/1600     0/4       0/0        0     0          0     0     0
fallback pool: 0/1500/1600
jumbo pool:    0/128/9300



comments powered by Disqus