Nexus 7000 SUP2E Compact Flash Failure Recovery

This article describes one of the procedure to recover flash failure on Cisco Nexus 7000 using SUP2E. Cisco has published a bug id CSCus22805 (CCO account required) on their bug documentation. Before we show the procedure and CLI output during the recovery process, We are going to show how Cisco documentation explain regarding this issue.

Background

According to the documentation, Each N7K supervisor 2/2E is equipped with 2 eUSB flash devices in RAID1 configuration, one primary and one mirror. Together they provide non-volatile repositories for boot images, startup configuration and persistent application data. What can happen is over a period of months or years in service, one of these devices may be disconnected from the USB bus, causing the RAID software to drop the device from the configuration. The device can still function normally with 1/2 devices. However, when the second device drops out of the array, the bootflash is remounted as read-only, meaning you cannot save configuration or files to the bootflash, or allow the standby to sync to the active in the event it is reloaded.

Symptoms

  • Compact flash diagnostic failure
  • N7K-SUP2E# show diagnostic result module 1
    
     Current bootup diagnostic level: complete
     Module 5: Supervisor module-2  (Standby)
    
             Test results: (. = Pass, F = Fail, I = Incomplete,
             U = Untested, A = Abort, E = Error disabled)
    
              1) ASICRegisterCheck-------------> .
              2) USB---------------------------> .
              3) NVRAM-------------------------> .
              4) RealTimeClock-----------------> .
              5) PrimaryBootROM----------------> .
              6) SecondaryBootROM--------------> .
              7) CompactFlash------------------> F  <=====
              8) ExternalCompactFlash----------> U
              9) PwrMgmtBus--------------------> U
             10) SpineControlBus---------------> .
             11) SystemMgmtBus-----------------> U
             12) StatusBus---------------------> U
             13) StandbyFabricLoopback---------> .
             14) ManagementPortLoopback--------> .
             15) EOBCPortLoopback--------------> .
             16) OBFL--------------------------> .
  • Unable to perform ‘copy run start’
  • N7K-SUP2E# copy running-config startup-config
     [########################################] 100%
     Configuration update aborted: request was aborted
  • eUSB becomes read-only or is non-responsive
  • ISSU failures, usually when trying to failover to the standby supervisor

Problem Analysis

To diagnose the current state of the compact flash cards you need to use some internal commands Cisco provides on the documentation, those are show system internal raid | grep -A 1 “Current RAID status info” and show system internal file /proc/mdstat. If you have more than one supervisor, you may check it by adding slot x before the internal command, where x is the SUP2/2E slot position. Do notice, since these commands are internal, you might need to enter it completely. Don’t use tab keyboard to syntax completion it won’t working. Below are the output from those internal command related to my case.

N7K-SUP2E# show system internal raid | grep -A 1 "Current RAID status info"
 Current RAID status info:
 RAID data from CMOS = 0xa5 0xc3

From this output you want to look at the number beside of 0xa5 which is 0xc3. You can then use these keys to determine if the primary or secondary compact flash has failed, or both. The above output shows 0xc3 which tells us that both the primary and the secondary compact flashes have failed. Below is the reference table to pull up the information.

Raid Status Info Description
0xf0 No failures reported
0xe1 Primary flash failed
0xd2 Alternate (or mirror) flash failed
0xc3 Both primary and alternate failed
N7K-SUP2E# show system internal file /proc/mdstat
Personalities : [raid1]
md6 : active raid1 sdb6[2](F) sdc6[1]
      77888 blocks [2/1] [_U]
      
md5 : active raid1 sdb5[2](F) sdc5[1]
      78400 blocks [2/1] [_U]
      
md4 : active raid1 sdb4[2](F) sdc4[1]
      39424 blocks [2/1] [_U]
      
md3 : active raid1 sdb3[2](F) sdc3[1]
      1802240 blocks [2/1] [_U]

In this scenario you see that the primary compact flash is not up [_U]. A healthy output will show all blocks as [UU]. Below is the sample of the healty compact flash on my secondary SUP2E.

N7K-SUP2E# slot 2 show system internal file /proc/mdstat
Personalities : [raid1] 
md6 : active raid1 sdc6[0] sdb6[1]
      77888 blocks [2/2] [UU]
      
md5 : active raid1 sdc5[0] sdb5[1]
      78400 blocks [2/2] [UU]
      
md4 : active raid1 sdc4[0] sdb4[1]
      39424 blocks [2/2] [UU]
      
md3 : active raid1 sdc3[0] sdb3[1]
      1802240 blocks [2/2] [UU]

Scenarios

To determine which scenario you are facing, Cisco comes up with several scenarios letter. You will need to use the above commands in the “Problem Analysis” section to correlate with a scenario letter below.

Single supervisor:

Scenario Letter Active Supervisor Active Supervisor Code
A 1 Fail 0xe1 or 0xd2
B 2 Fail 0xc3

Dual supervisor:

Scenario Letter Active Supervisor Standby Supervisor Active Supervisor Code Standby Supervisor Code
C 0 Fail 1 Fail 0xf0 0xe1 or 0xd2
D 1 Fail 0 Fail 0xe1 or 0xd2 0xf0
E 1 Fail 1 Fail 0xe1 or 0xd2 0xe1 or 0xd2
F 2 Fail 0 Fail 0xc3 0xf0
G 0 Fail 2 Fail 0xf0 0xc3
H 2 Fail 1 Fail 0xc3 0xe1 or 0xd2
I 1 Fail 2 Fail 0xe1 or 0xd2 0xc3
J 2 Fail 2 Fail 0xc3 0xc3

On the table above, scenario F is highlighted. That is because we are going to show you how we were accomplished this recovery activity on our client using this scenario.

Recovery Procedure

Cisco has published a procedure for every scenarios listed on the document. When we dealing with scenario F a non-impacting recovery is possible. Below are the summary of the procedure in scenario F:

  • Backup running configuration for all vdc externally. You can use logging facility on your ssh terminal for “show running-config vdc-all” command.
  • Compare runnning configuration (show running-config vdc-all) and startup configuration (show startup-config vdc-all). Evaluate missing configuration on running configuration.
  • Perform supervisor switchover using “system switchover“.
  • New standby supervisor will begin rebooting. During this time you will want to add any missing configuration back to the new active.
  • New standby should reach “ha-standby” state. Use “show module” command to verify it alternatively you might use “show redundancy status” to ensure the all states on “Other supervisor” are “HA standby
  • If the new standby comes up in a “powered-up” state, you will need to manually bring it back online. This can be done by issuing the following commands, where “x” is the standby module stuck in a “powered-up” state:
  • (config)# out-of-service module x
    (config)# no poweroff module x
  • If you see that the standby keeps getting stuck in the powered-up state and ultimately keeps power cycling after the steps above, this is likely due to the active reloading the standby for not coming up in time. To resolve this, configure the following using ‘x’ for the standby slot that stuck in powered-up:
    (config)# system standby manual-boot
    (config)# reload module x force-dnld
  • Once the standby is back online in an “ha-standby” state, you will then need to run the recovery tool to ensure that the recovery is complete. The tool can be downloaded at the following link:
    recovery tool
  • unzipped recovery tool, and uploaded it to the bootflash of the box, you will need to execute the following command: “load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
  • check the recovery status with “show system internal file /proc/mdstat” command/

Procedure Output

Ok. let’s move on to the execution section. To avoid any confusion regarding the supervisor status. I will give a name to the supervisor like the following. Sup1Active means Supervisor one in active state and Sup2Standby means supervisor two on standby state. State on each supervisor will change during the procedure, please be aware with it.

Switchover Supervisor

On “Sup1Active do supevisor switchover. Sup1 will start to reboot and will be Sup1Standby.

N7K-SUP2E# system switchover 
N7K-SUP2E# 
User Access Verification
N7K-SUP2E login: 
>>>
>>>
>>>
NX7k SUP BIOS version ( 2.11 ) : Build - 01/09/2013 18:16:20
PM FPGA Version : 0x00000024 
Power sequence microcode revision - 0x00000009 : card type - 10156EEA0
Booting Spi Flash : Primary 
  CPU Signature - 0x000106e4: Version - 0x000106e0 
  CPU - 2 : Cores - 4 : HTEn - 1 : HT - 2 : Features - 0xbfebfbff 
  FSB Clk - 532 Mhz :  Freq - 2143 Mhz - 2128 Mhz 
  MicroCode Version : 0x00000002 
  Memory - 32768 MB : Frequency - 1067 MHZ 
  Loading Bootloader: Done 
  IO FPGA Version   : 0x1000d 
  PLX Version       : 861910b5
Bios digital signature verification - Passed
USB bootflash status : [1-1:0-0]
...

Below are the output from the Sup2Active, previously Sup2Standby

N7K-SUP2E(standby)# 2017 Apr 22 01:58:02  %$ VDC-1 %$ Apr 22 01:58:02 %KERN-2-SYSTEM_MSG: [18173381.026292] Switchover started by redundancy driver - kernel
2017 Apr 22 01:58:02  %$ VDC-1 %$ %SYSMGR-2-HASWITCHOVER_PRE_START: This supervisor is becoming active (pre-start phase).
2017 Apr 22 01:58:02  %$ VDC-1 %$ %SYSMGR-2-HASWITCHOVER_START: Supervisor 2 is becoming active.
2017 Apr 22 01:58:02  %$ VDC-1 %$ %SYSMGR-2-SWITCHOVER_OVER: Switchover completed.
N7K-SUP2E# show module
Mod  Ports  Module-Type                         Model              Status
---  -----  ----------------------------------- ------------------ ----------
1    0      Supervisor module-2                                    powered-up
2    0      Supervisor module-2                 N7K-SUP2E          active *
3    48     1000 Mbps Optical Ethernet XL Modul N7K-M148GS-11L     ok
4    24     10 Gbps Ethernet Module             N7K-M224XP-23L     ok
...

On my case, Sup1Standby was not able to back online. When you see highlighted lines below during the bootup process, it is a sign that your Sup is fail to boot and it will end on switch boot mode.

...
RAID assembly failed. Stopping all RAID partitions...
Trying to mount bootflash /dev/sdd3...
mount: block device /dev/sdd3 is write-protected, mounting read-only
mount: wrong fs type, bad option, bad superblock on /dev/sdd3,
       or too many mounted file systems
/dev/sdd3 mount failed, trying /dev/sdc3...
/dev/sdc3: Input/output error
mount: block device /dev/sdc3 is write-protected, mounting read-only
/dev/sdc3: Input/output error
mount: /dev/sdc3 is not a valid block device
Cannot find any valid bootflash partitions.
....
switch(boot)#

Even on switch boot mode your are not able to load the kickstart image since Sup doesn’t aware of any flash storage consist of kickstart image and operating system image.

switch(boot)# dir 

Usage for bootflash: filesystem 
   98643968 bytes used
  320786432 bytes free
  419430400 bytes total

Hence, we need to move on to the next procedure to bring Sup1Standby online. On Sup2Active do below command.

N7K-SUP2E(config)# out-of-service module 1
N7K-SUP2E(config)# 2017 Apr 22 02:00:46  %$ VDC-1 %$ %PLATFORM-2-MOD_PWRDN: Module 1 powered down (Serial number )
2017 Apr 22 02:00:46 N7K-SUP2E-VDC-4 %$ VDC-4 %$ %PLATFORM-2-MOD_PWRDN: Module 1 powered down (Serial number )
2017 Apr 22 02:00:46 N7K-SUP2E-VDC-2 %$ VDC-2 %$ %PLATFORM-2-MOD_PWRDN: Module 1 powered down (Serial number )
2017 Apr 22 02:00:46 N7K-SUP2E-VDC-3 %$ VDC-3 %$ %PLATFORM-2-MOD_PWRDN: Module 1 powered down (Serial number )
N7K-SUP2E(config)# no poweroff module 1

From Sup1Standby console, you will see it begin to bootup. When you see highlighted lines below during the bootup process, it is a sign that your Sup is in a good state.

...
Trying to mount bootflash /dev/sdd3...
Mounted primary /dev/sdd3 as /bootflash
Existing bootflash found, saving files...
Saving n7000-s2-dk9-npe.6.1.1.bin
Saving n7000-s2-dk9.6.1.2.bin
Saving n7000-s2-kickstart-npe.6.1.1.bin
Saving n7000-s2-kickstart.6.1.2.bin
Initializing the system...
Unmounting file systems...
Making partitions on physical devices...
Initializing RAID services...
Initializing startup-config and licenses...
mke2fs 1.35 (28-Feb-2004)
Checking for bad blocks (read-only test): done                        
mke2fs 1.35 (28-Feb-2004)
Checking for bad blocks (read-only test): done                        
Formatting PSS:
mke2fs 1.35 (28-Feb-2004)
Checking for bad blocks (read-only test): done                        
Formatting bootflash...
mke2fs 1.35 (28-Feb-2004)
Checking for bad blocks (read-only test): done                        
Fri Jan 3 19:04:29 2017: RAIDMON: Data(0x0) provided saved successfully to CMOS
Initialization completed - No reinit of CMOS/NVRAM
Copying saved files back to bootflash...
Checking obfl filesystem.
Checking all filesystems..... done.
Warning: switch is starting up with default configuration
rLoading system software
/bootflash//n7000-s2-dk9.6.1.2.bin read done
System image digital signature verification successful.
Uncompressing system image: bootflash:/n7000-s2-dk9.6.1.2.bin Fri Jan 3 19:06:12 UTC 2017
blogger: nothing to do.

..done Fri Jan 3 19:06:15 UTC 2017
Load plugins that defined in image conf: /isan/plugin_img/img.conf
Loading plugin 0: core_plugin...
num srgs 1
0: swid-core-sup2dc3, swid-core-sup2dc3
num srgs 1
0: swid-sup2dc3-ks, swid-sup2dc3-ks
INIT: Entering runlevel: 3



User Access Verification
N7K-SUP2E(standby) login:

Hence we need to wait until Sup1Standby reach “ha-standby” state. In this situation we would prefer use “show redundancy status” command to “show module” command from Sup2Active. Because we can see the Sup1Standby progress until “ha-standby” state.

N7K-SUP2E# show redundancy status 
Redundancy mode
---------------
      administrative:   HA
         operational:   None

This supervisor (sup-2)
-----------------------
    Redundancy state:   Active
    Supervisor state:   Active
      Internal state:   Active with HA standby

Other supervisor (sup-1)
------------------------
    Redundancy state:   Standby

    Supervisor state:   Unknown
      Internal state:   Other
...
N7K-SUP2E# show redundancy status 
Redundancy mode
---------------
      administrative:   HA
         operational:   None

This supervisor (sup-2)
-----------------------
    Redundancy state:   Active
    Supervisor state:   Active
      Internal state:   Active with HA standby

Other supervisor (sup-1)
------------------------
    Redundancy state:   Standby

    Supervisor state:   HA standby
      Internal state:   HA synchronization in progress
...
N7K-SUP2E# show redundancy status 
Redundancy mode
---------------
      administrative:   HA
         operational:   HA

This supervisor (sup-2)
-----------------------
    Redundancy state:   Active
    Supervisor state:   Active
      Internal state:   Active with HA standby

Other supervisor (sup-1)
------------------------
    Redundancy state:   Standby

    Supervisor state:   HA standby
      Internal state:   HA standby
...

Sup1Standby is the problematic Sup with the flash failure, after login prompt occurs. Login to Sup1Standby and execute command “show system internal file /proc/mdstat” to see recovery progress on this Sup (We don’t need to load recovery tool on Sup1Standby. Reload procedure will automatically recover it flash).

N7K-SUP2E(standby)#  show system internal file /proc/mdstat
Personalities : [raid1] 
md6 : active raid1 sdd6[2] sdc6[1]
      77888 blocks [2/1] [_U]
        resync=DELAYED
      
md5 : active raid1 sdd5[2] sdc5[1]
      78400 blocks [2/1] [_U]
        resync=DELAYED
      
md4 : active raid1 sdd4[2] sdc4[1]
      39424 blocks [2/1] [_U]
        resync=DELAYED
      
md3 : active raid1 sdd3[2] sdc3[1]
      1802240 blocks [2/1] [_U]
      [=========>...........]  recovery = 45.4% (819648/1802240) finish=1.2min s
peed=13142K/sec

Repeat the command above until you see the result like below, when it does your Sup1Standby is ready.

N7K-SUP2E(standby)#  show system internal file /proc/mdstat
Personalities : [raid1] 
md6 : active raid1 sdd6[0] sdc6[1]
      77888 blocks [2/2] [UU]
      
md5 : active raid1 sdd5[0] sdc5[1]
      78400 blocks [2/2] [UU]
      
md4 : active raid1 sdd4[0] sdc4[1]
      39424 blocks [2/2] [UU]
      
md3 : active raid1 sdd3[0] sdc3[1]
      1802240 blocks [2/2] [UU]

Execute Recovery Tool

As we run the procedure on scenario F, it is not necessary to execute the recovery tool on the Sup2Active, since Sup1Standby is the only problemactic Sup with flash failure. But in our case, after the supervisor switchover even though raid status info shows 0xf0, we were identified that Sup2Active raid status is not in [UU] state. You can do save configuration to startup at this state.

N7K-SUP2E# show system internal raid 
Current RAID status info:
RAID data from CMOS = 0xa5 0xf0
RAID data from driver disks 0 bad 0 name 
Bootflash: /dev/sdc
Mirrorflash: /dev/sdd

Current RAID status:
Personalities : [raid1] 
md6 : active raid1 sdc6[0]
      77888 blocks [2/1] [U_]
      
md5 : active raid1 sdc5[0]
      78400 blocks [2/1] [U_]
      
md4 : active raid1 sdc4[0]
      39424 blocks [2/1] [U_]
      
md3 : active raid1 sdc3[0]
      1802240 blocks [2/1] [U_]

Hence we need to execute the recovery tool. When you execute the tool, it will automatically copying it to the standby Sup if you have redundant Sup. Do notice on the output, since Sup1Standby already recovered it will not attempt any recovery action on it. Execute below command to run the tool.

N7K-SUP2E# load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
Loading plugin version 10.0(2)
###############################################################
  Warning: debug-plugin is for engineering internal use only!
  For security reason, plugin image has been deleted.
###############################################################
INFO: Running on active slot 2, checking if a ha-standby is available...
INFO: Standby present in slot 1. Copying the recovery tool...
###############################################################
  Warning: debug-plugin is for engineering internal use only!
  For security reason, plugin image has been deleted.
###############################################################
INFO: Running on the standby in slot 1, Checking RAID status...
INFO: Both disks are found to be healthy.
INFO: Verifying RAID configuration. Got primary=sdb Secondary=sdd
INFO: RAID device md3 is healthy.
INFO: RAID device md4 is healthy.
INFO: RAID device md5 is healthy.
INFO: RAID device md6 is healthy.
INFO: No recovery was attempted on module 1. All flashes left intact.
INFO: A detailed copy of the this log was saved as volatile:flash_repair_log_mod1.tgz.
INFO: Recovery procedures complete on module 1.
INFO: Please check for any errors in previous messages.
INFO: Run 'show system internal file /proc/mdstat' and check 'up status' [UU] for all disks.
INFO: Run 'show diagnostic result module ' on all available supervisor slots.
INFO: And restart CompactFlash test (7) instances if not in running state.
Loading plugin version 10.0(2)
INFO: Now starting the flash recovery procedures on active.
INFO: Primary=sdc(sdc) Secondary=sdd(sdd) Working=sdc
WARNING: Attempting recovery of secondary device sdd
INFO: Removing /dev/sdd from RAID configuration...
INFO: Resetting secondary flash...
INFO: Found secondary device sdd in 9 seconds.
INFO: Running health checks on the recovered device /dev/sdd...
INFO: Basic I/O tests passed. /dev/sdd looks healthy and responsive.
INFO: Verifying RAID configuration. Got primary=sdc Secondary=sdd
INFO: sdc3 is already a part of md3.
INFO: Adding sdd3 back into md3 RAID configuration...
INFO: sdc4 is already a part of md4.
INFO: Adding sdd4 back into md4 RAID configuration...
INFO: sdc5 is already a part of md5.
INFO: Adding sdd5 back into md5 RAID configuration...
INFO: sdc6 is already a part of md6.
INFO: Adding sdd6 back into md6 RAID configuration...
INFO: Resetting RAID status in CMOS...
WARNING: Flash recovery attempted on module 2.
INFO: A detailed copy of the this log was saved as volatile:flash_repair_log_mod2.tgz.
INFO: Recovery procedures complete on module 2.
INFO: Please check for any errors in previous messages.
INFO: Run 'show system internal file /proc/mdstat' and check 'up status' [UU] for all disks.
INFO: Run 'show diagnostic result module ' on all available supervisor slots.
INFO: And restart CompactFlash test (7) instances if not in running state.
N7K-SUP2E# show system internal file /proc/mdstat
Personalities : [raid1] 
md6 : active raid1 sdd6[2] sdc6[0]
      77888 blocks [2/1] [U_]
        resync=DELAYED
      
md5 : active raid1 sdd5[2] sdc5[0]
      78400 blocks [2/1] [U_]
        resync=DELAYED
      
md4 : active raid1 sdd4[2] sdc4[0]
      39424 blocks [2/1] [U_]
        resync=DELAYED
      
md3 : active raid1 sdd3[2] sdc3[0]
      1802240 blocks [2/1] [U_]
      [==>..................]  recovery = 14.7% (265984/1802240) finish=2.0min s
peed=12665K/sec

Wait until all blocks recover. Now you have all your flashes works.

N7K-SUP2E# show diagnostic result module 2

Current bootup diagnostic level: complete
Module 2: Supervisor module-2  (Active)

        Test results: (. = Pass, F = Fail, I = Incomplete,
        U = Untested, A = Abort, E = Error disabled)

         1) ASICRegisterCheck-------------> .
         2) USB---------------------------> .
         3) NVRAM-------------------------> .
         4) RealTimeClock-----------------> .
         5) PrimaryBootROM----------------> .
         6) SecondaryBootROM--------------> .
         7) CompactFlash------------------> .
         8) ExternalCompactFlash----------> U
         9) PwrMgmtBus--------------------> .
        10) SpineControlBus---------------> .
        11) SystemMgmtBus-----------------> .
        12) StatusBus---------------------> .
        13) StandbyFabricLoopback---------> .
        14) ManagementPortLoopback--------> .
        15) EOBCPortLoopback--------------> .
        16) OBFL--------------------------> .

Don’t forget to save all configuration to startup config.

N7K-SUP2E# copy running-config startup-config vdc-all
[########################################] 100%
Copy complete.

Source:

Nexus 7000 Supervisor 2/2E Compact Flash Failure Recovery

Contributor:
Muhammad Benny
Network Engineer 

Dirga Bramantyo
Network Engineer - CCNP

Ananto Yudi Hendrawan
Network Engineer - CCIE Service Provider #38962, RHCE, VCP6-DCV
nantoyudi@gmail.com

Cisco ISR 4331 Throughput Capacity

This article is describes one of an issue we was faced on the past regarding Cisco router throughput capacity. This issue is quite interesting since I didn’t know that some of Cisco routers delivered with a throughput license feature.

At the beginning, we was received a report from our client that they were experincing slow transfer data when the link reach 90 – 95Mbps. We can see the throughput graph from below picture.

As a basic troubleshooting process, we tried to identify the router CPU process, we saw that all processes were normal. Also there wasn’t any packet drop on the interface. One of the key we have discovered was, slowness only happen for the traffic goes through the congested link (I said congested because our customer has 1Gbps link but it never reaches 100Mbps). Another question that came in mind how it could be slow, what was the evidence so you can say it is slow. Our customer was sent the ping comparation when the traffic is about 40-60 Mbps, ping through the router will have average delay around 2-3ms. When the congestion was occured, ping through the device will have average delay around 40-44ms. According to the graph above it even never reach 90Mbps, but when we verified it from the CLI it did.

After several tests on the network, we started to dig more information from Cisco documentation. According to Cisco, the aggregate throughput handled by isr4331 is 100Mbps to 300Mbps. By default the router is running with 100Mbps of throughput and you can increase it to maximum of 300Mbps using throughput license. you may see the throughput information summary on each ISR4000 series summary on below picture.

At this instance we cannot increase the router throughput capacity unless we buy the throughput license. Fortunately Cisco comes with a trial license on it, so we can do a temporary remediation to let the the current traffic utilise more bandwith space.

Before we start to activate the temporary license, let’s do some verification on the license status.

Current Throughput Level

ISR4331#show platform hardware throughput level 
The current throughput level is 100000 kb/s

Current License Status

ISR4331#sh license feature  
Feature name             Enforcement  Evaluation  Subscription   Enabled  RightToUse 
!
!output omitted for brevity
!
throughput               yes          yes         no             no       yes        
internal_service         yes          no          no             no       no
ISR4331#show license 
!
!output omitted for brevity
!
Index 7 Feature: throughput                     
        Period left: Not Activated
        Period Used: 0  minute  0  second  
        License Type: EvalRightToUse
        License State: Active, Not in Use, EULA not accepted
        License Count: Non-Counted
        License Priority: None

Now let’s enable temporary throughput license on the router. It will be available for next 60 days. Don’t forget to save your configuration and reload the chassis to take effect.

ISR4331(config)#platform hardware throughput level 300000
         Feature Name:throughput
 
PLEASE  READ THE  FOLLOWING TERMS  CAREFULLY. INSTALLING THE LICENSE OR
LICENSE  KEY  PROVIDED FOR  ANY CISCO  PRODUCT  FEATURE  OR  USING SUCH
PRODUCT  FEATURE  CONSTITUTES  YOUR  FULL ACCEPTANCE  OF  THE FOLLOWING
TERMS. YOU MUST NOT PROCEED FURTHER IF YOU ARE NOT WILLING TO  BE BOUND
BY ALL THE TERMS SET FORTH HEREIN.
 
Use of this product feature requires  an additional license from Cisco,
together with an additional  payment.  You may use this product feature
on an evaluation basis, without payment to Cisco, for 60 days. Your use
of the  product,  including  during the 60 day  evaluation  period,  is
subject to the Cisco end user license agreement
http://www.cisco.com/en/US/docs/general/warranty/English/EU1KEN_.html
If you use the product feature beyond the 60 day evaluation period, you
must submit the appropriate payment to Cisco for the license. After the
60 day  evaluation  period,  your  use of the  product  feature will be
governed  solely by the Cisco  end user license agreement (link above),
together  with any supplements  relating to such product  feature.  The
above  applies  even if the evaluation  license  is  not  automatically
terminated  and you do  not receive any notice of the expiration of the
evaluation  period.  It is your  responsibility  to  determine when the
evaluation  period is complete and you are required to make  payment to
Cisco for your use of the product feature beyond the evaluation period.
 
Your  acceptance  of  this agreement  for the software  features on one
product  shall be deemed  your  acceptance  with  respect  to all  such
software  on all Cisco  products  you purchase  which includes the same
software.  (The foregoing  notwithstanding, you must purchase a license
for each software  feature you use past the 60 days evaluation  period,
so  that  if you enable a software  feature on  1000  devices, you must
purchase 1000 licenses for use past  the 60 day evaluation period.)   
 
Activation  of the  software command line interface will be evidence of
your acceptance of this agreement.

ACCEPT? (yes/[no]): yes

Now let’s verify router status after we enable the temporary throughput license.

ISR4331#show license feature 
Feature name             Enforcement  Evaluation  Subscription   Enabled  RightToUse 
!
!output omitted for brevity
!        
throughput               yes          yes         no             yes      yes        
internal_service         yes          no          no             no       no    
ISR4331#show license         
!
!output omitted for brevity
!                         
Index 7 Feature: throughput                     
        Period left: 8  weeks 4  days 
        Period Used: 0  day  0 hours 
        License Type: EvalRightToUse
        License State: Active, In Use
        License Count: Non-Counted
        License Priority: Low

And for the final information. Let me show you the throughput graph after we enable the temporary throughput license.

Contributor:

Ananto Yudi Hendrawan
Network Engineer - CCIE Service Provider #38962, RHCSA, VCP6-DCV
nantoyudi@gmail.com

High CPU do to OSPF-100 Router process

Hello, our recent issue was happened on our client was high CPU due to OSPF-100 Router process

Below is my box info

PID: CISCO2811
OS: 12.4(24)T8

When we verified the box we found below process on the list

ROUTER01#sh proc cpu sort
CPU utilization for five seconds: 100%/18%; one minute: 99%; five minutes: 98%
PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process 
 212  3959892793  60425944      65533 72.21% 68.38% 66.57%   0 OSPF Router 100    
 321  1821538078  93942136      19390  1.99%  1.98%  1.90%   0 SSH Process
 122   2511766172961610201          0  1.07%  1.27%  1.20%   0 IP Input 
ROUTER01#

I tried to cek log messages but nothing. Some sources I found in the internet says it due to link flap but I couldn’t found any unstable link connection on this box. the only thing that look suspicious was interface status on it

ROUTER01#sh ip ip inter br
Interface                  IP-Address      OK? Method Status                Protocol
FastEthernet0/0            10.23.5.214     YES NVRAM  up                    up      
FastEthernet0/1            unassigned      YES NVRAM  up                    down    
Serial0/1/0:0              10.21.200.93    YES NVRAM  up                    up      
Serial0/1/1:0              10.21.200.105   YES NVRAM  administratively down down    
Serial0/2/0:0              10.21.20.1      YES NVRAM  down                  down    
Serial0/2/1:0              10.21.200.249   YES NVRAM  up                    up      
Serial0/3/0:0              10.21.201.73    YES NVRAM  up                    up      
Serial0/3/1:0              10.20.200.166   YES NVRAM  up                    up      
Multilink1                 10.21.201.209   YES NVRAM  up                    up      
Loopback0                  10.23.0.39      YES NVRAM  up                    up      
ROUTER01#

I tried to shutdown interface Fa0/1 and the CPU process was decreased

ROUTER01#sh processes cpu history  

ROUTER01   10:16:01 AM Monday Dec 2 2013 WIB

                                                                
    111111111111111111111111111111111111111111111111111111111111
    777666662222288888444447777799999888889999999999444442222266
100                                                             
 90                                                             
 80                                                             
 70                                                             
 60                                                             
 50                                                             
 40                                                             
 30                                                             
 20 ********     *****     *************************          **
 10 ************************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....6
             0    5    0    5    0    5    0    5    0    5    0
               CPU% per second (last 60 seconds)

                                                                
    121112999999999999999899999999998999999999999799999999999999
    947760999939999999999899999999995999998999999799999999999999
100       **** ********** ********** ************ **************
 90       *************************************** **************
 80       ******************************************************
 70       ******************************************************
 60       *********#********************************************
 50       #****###*######*****#*#*#***##*****#******#**#**#*****
 40       ####*##########*#*#*###*##**##***#*#****######**###**#
 30       ####*##########*#*########*###*########*#######*####*#
 20 ##***#######################################################
 10 ############################################################
   0....5....1....1....2....2....3....3....4....4....5....5....6
             0    5    0    5    0    5    0    5    0    5    0
               CPU% per minute (last 60 minutes)
              * = maximum CPU%   # = average CPU%

still don’t have any explanation on this, but one thing that I know it was worked on my case

contributor,

Ananto Yudi, CCIE Service Provider #38962
nantoyudi@gmail.com

Nexus7000 – DEVICE_TEST-STANDBY-2-PWR_MGMT_BUS_FAIL

Here we go, another log message on our Nexus 7000

2013 Sep 21 20:27:28 Nexus7000-DIS %DEVICE_TEST-STANDBY-2-PWR_MGMT_BUS_FAIL: Module 10 has failed test SpineControlBus 20 times on device Power Mgmt Bus on slot 19 due to error Spine control test failed error number 0x00000002

Below is my box info

PID: N7K-C7018
SUP: N7K-SUP1
LineCard: N7K-F248XP-25
OS: 6.0(3)

We have one box with two SUP1 and those log belong to standby SUP, we tried to verify in detail what caused those error

Nexus7000-DIS# sh diagnostic result module 10

Current bootup diagnostic level: complete
Module 10: Supervisor module-1X  (Standby)

        Test results: (. = Pass, F = Fail, I = Incomplete,
        U = Untested, A = Abort, E = Error disabled)

         1) ASICRegisterCheck-------------> .
         2) USB---------------------------> .
         3) CryptoDevice------------------> .
         4) NVRAM-------------------------> .
         5) RealTimeClock-----------------> .
         6) PrimaryBootROM----------------> .
         7) SecondaryBootROM--------------> .
         8) CompactFlash------------------> .
         9) ExternalCompactFlash----------> .
        10) PwrMgmtBus--------------------> U
        11) SpineControlBus---------------> E
        12) SystemMgmtBus-----------------> U
        13) StatusBus---------------------> U
        14) StandbyFabricLoopback---------> .
        15) ManagementPortLoopback--------> .
        16) EOBCPortLoopback--------------> .
        17) OBFL--------------------------> .

Nexus7000-DIS#

We can see from the diagnostic result that we have SpineControlBus diagnostic test failed and it went into error disable state, so I went to the type of this diagnostic test for the detail

Nexus7000-DIS# sh diagnostic result module 10 test SpineControlBus detail 

Current bootup diagnostic level: complete
Module 10: Supervisor module-1X  (Standby)

  Diagnostic level at card bootup: complete

        Test results: (. = Pass, F = Fail, I = Incomplete,
        U = Untested, A = Abort, E = Error disabled)

        ______________________________________________________________________

        11) SpineControlBus E

                Error code ------------------> DIAG TEST ERR DISABLE
                Total run count -------------> 1056091
                Last test execution time ----> Sat Sep 21 20:27:28 2013
                First test failure time -----> Fri Sep 21 00:46:24 2012
                Last test failure time ------> Sat Sep 21 20:27:28 2013
                Last test pass time ---------> Sat Sep 21 20:26:28 2013
                Total failure count ---------> 27
                Consecutive failure count ---> 2
                Last failure reason ---------> Spine control test failed
                Next Execution time ---------> Sat Sep 21 20:27:58 2013

        XBar      1  2  3  4  5
         ---------------------------------------------------------------------
                  F  F  F  I  .

        ______________________________________________________________________


Nexus7000-DIS#

We opened TAC case to Cisco and got below explanation

“The current version running on the box is 6.0(3).The switch is hitting a well-known cosmetic software bug CSCuc72466. It is fixed in 6.2(2). The SpineControlBus tests active and standby access to the Spine card in order to determine if the spine works. However, that access can only be done one at a time. When both active and standby run the test at the same time, one of the tests (usually the standby test) fails. The failure is a false alarm and not an indication of an actual hardware failure. It does not have any impact on the data traffic in the switch.”

Below is the information related to the bug

contributor,

Ananto Yudi, CCIE Service Provider #38962
nantoyudi@gmail.com

Nexus7000 – KERN-0-SYSTEM_MSG

Another log message came on our Nexus 7000 several days ago

Nov 4 12:15:27 %KERN-0-SYSTEM_MSG: [29333429.955846] NVRAM Error: (line 871):Invalid cksum for block 1 expected 0xea2 got 0xea4 – kernel

Below is my box info

PID: N7K-C7018
SUP: N7K-SUP1
LineCard: N7K-F248XP-25
OS: 6.0(3)

We opened case to Cisco to address this issue and TAC engineer said to replace our SUP on module 10. One thing interesting from this logs, it didn’t show up on the log buffer, it came out on our CLI terminal

contributor,

Ananto Yudi, CCIE Service Provider #38962
nantoyudi@gmail.com

High CPU due to IP RIB Update process

Hi, our recent issue was happened on our client was high CPU due to IP RIB Update process

Below is my box info

PID: WS-C3750G-24TS-S
OS: 12.2(35)SE2

When we verified the box we found below process on the list

SW01#sh proc cpu sort
CPU utilization for five seconds: 100%/18%; one minute: 99%; five minutes: 98%
PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process 
 202  3959892793  60425944      65533 72.21% 68.38% 66.57%   0 IP RIB Update    
 117  1821538078  93942136      19390  1.99%  1.98%  1.90%   0 HL3U bkgrd proce 
 116   2511766172961610201          0  1.07%  1.27%  1.20%   0 Hulc LED Process 

we tried to verify total routing that recieved on the box and found high amount of prefixes

SW01#sh ip route sum
IP routing table name is Default-IP-Routing-Table(0)
IP routing table maximum-paths is 32
Route Source    Networks    Subnets     Overhead    Memory (bytes)
connected       0           8           1028        1216
static          0           0           0           0
ospf 100        3           26829       1717376     4079484
  Intra-area: 151 Inter-area: 1216 External-1: 10807 External-2: 14658
  NSSA External-1: 0 NSSA External-2: 0
internal        24                                  28128
Total           27          26837       1718404     4108828
SW01#

when dealing with this issue we have two option to mitigate this issue, first change OSPF normal area to Stub, secondly we change the SDM template to “desktop routing”

Since changing OSPF need to re-engineer all network device behind this box, we choosed second option to change the SDM template

Our concern changing the SDM template is, desktop routing template only have 11K number of IPv4 routes since the real routes recieved on the box reach 26K

SW01#sh sdm prefer
The current template is "desktop routing" template.
The selected template optimizes the resources in
the switch to support this level of features for
8 routed interfaces and 1024 VLANs. 

  number of unicast mac addresses:                  3K
  number of IPv4 IGMP groups + multicast routes:    1K
  number of IPv4 unicast routes:                    11K
    number of directly-connected IPv4 hosts:        3K
    number of indirect IPv4 routes:                 8K
  number of IPv4 policy based routing aces:         0.5K
  number of IPv4/MAC qos aces:                      0.5K
  number of IPv4/MAC security aces:                 1K
SW01#

We move foward to change the the SDM Template

SW01(config)#sdm prefer routing 

after this implementation we see good result on the box

SW01##sh proc cpu hist
                                                              
                               6666655555222226666644442222211
    7766666777777777788888777771111133333444446666699994444400
100                                                           
 90                                                           
 80                                                           
 70                                           *****           
 60                            *****          *****           
 50                            **********     *********       
 40                            **********     *********       
 30                            **********     *********       
 20                            *****************************  
 10 **********************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5    
               CPU% per second (last 60 seconds)
SW01#sh proc cpu sort | ex 0.00
CPU utilization for five seconds: 7%/0%; one minute: 25%; five minutes: 35%
PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process 
 117      121285      5168      23468  1.27%  1.34%  1.33%   0 HL3U bkgrd proce 
 221      347043     33414      10386  0.47% 12.47%  7.88%   0 OSPF-100 Router  
 170        3097     28694        107  0.31%  0.06%  0.01%   0 IP Input         
  51        9192    253360         36  0.15%  0.07%  0.06%   0 Fifo Error Detec 
 125        6109      2945       2074  0.15%  0.11%  0.10%   0 HRPC qos request 
SW01#

contributor,

Ananto Yudi, CCIE Service Provider #38962
nantoyudi@gmail.com

%EC-5-UNBUNDLE: Interface Te 1/49 left the port-channel Po1

I faced this issue several weeks after installation. When it came to me, I simply think this was a simple layer 1 issue but the truth says other

Below is my box info

PID: WS-C4948E
OS: 15.0(2)SG2

A. Our Devices Topology

1. High level design

2. Physical connectivity

B. Problem Description

when the issue came, we only received information from the logs that reside on the access switch

Sep 20 04:07:13.238 WIB: %EC-5-UNBUNDLE: Interface Te1/52 left the port-channel Po1
Sep 20 04:07:20.878 WIB: %EC-5-BUNDLE: Interface Te1/52 joined port-channel Po1
Sep 20 04:07:39.478 WIB: %EC-5-UNBUNDLE: Interface Te1/52 left the port-channel Po1
Sep 20 04:07:46.499 WIB: %EC-5-BUNDLE: Interface Te1/52 joined port-channel Po1

during these condition we experienced intermitten traffic from and to the server through this access switch

C. Troubleshooting Action

1. Shut no shut interface Te1/49 on Access SW01 → failed
2. Reconfigure interface Te1/49 on Access SW01 and Eth2/27 on Dist SW01 → failed
3. Change SFP on Te1/49 Access SW01 with the new one → failed
4. Move the cable from Te1/49 to Te1/52 like below pictures → failed

before

after

5. Move the cable from Te1/49 to Te1/52 and Te1/52 to Te1/49 → Success

before

after

D. Summary

we opened case to Cisco after we did step 1-4, Cisco suggested step 5 to isolate whether there was a potential hardware failure issue. Since it doesn’t appears anymore Cisco doesn’t have any root cause conclution regarding this issue

contributor,

Ananto Yudi, CCIE Service Provider #38962
nantoyudi@gmail.com