Nexus 7000 SUP2E Compact Flash Failure Recovery

This article describes one of the procedure to recover flash failure on Cisco Nexus 7000 using SUP2E. Cisco has published a bug id CSCus22805 (CCO account required) on their bug documentation. Before we show the procedure and CLI output during the recovery process, We are going to show how Cisco documentation explain regarding this issue.

Background

According to the documentation, Each N7K supervisor 2/2E is equipped with 2 eUSB flash devices in RAID1 configuration, one primary and one mirror. Together they provide non-volatile repositories for boot images, startup configuration and persistent application data. What can happen is over a period of months or years in service, one of these devices may be disconnected from the USB bus, causing the RAID software to drop the device from the configuration. The device can still function normally with 1/2 devices. However, when the second device drops out of the array, the bootflash is remounted as read-only, meaning you cannot save configuration or files to the bootflash, or allow the standby to sync to the active in the event it is reloaded.

Symptoms

  • Compact flash diagnostic failure
  • N7K-SUP2E# show diagnostic result module 1
    
     Current bootup diagnostic level: complete
     Module 5: Supervisor module-2  (Standby)
    
             Test results: (. = Pass, F = Fail, I = Incomplete,
             U = Untested, A = Abort, E = Error disabled)
    
              1) ASICRegisterCheck-------------> .
              2) USB---------------------------> .
              3) NVRAM-------------------------> .
              4) RealTimeClock-----------------> .
              5) PrimaryBootROM----------------> .
              6) SecondaryBootROM--------------> .
              7) CompactFlash------------------> F  <=====
              8) ExternalCompactFlash----------> U
              9) PwrMgmtBus--------------------> U
             10) SpineControlBus---------------> .
             11) SystemMgmtBus-----------------> U
             12) StatusBus---------------------> U
             13) StandbyFabricLoopback---------> .
             14) ManagementPortLoopback--------> .
             15) EOBCPortLoopback--------------> .
             16) OBFL--------------------------> .
  • Unable to perform ‘copy run start’
  • N7K-SUP2E# copy running-config startup-config
     [########################################] 100%
     Configuration update aborted: request was aborted
  • eUSB becomes read-only or is non-responsive
  • ISSU failures, usually when trying to failover to the standby supervisor

Problem Analysis

To diagnose the current state of the compact flash cards you need to use some internal commands Cisco provides on the documentation, those are show system internal raid | grep -A 1 “Current RAID status info” and show system internal file /proc/mdstat. If you have more than one supervisor, you may check it by adding slot x before the internal command, where x is the SUP2/2E slot position. Do notice, since these commands are internal, you might need to enter it completely. Don’t use tab keyboard to syntax completion it won’t working. Below are the output from those internal command related to my case.

N7K-SUP2E# show system internal raid | grep -A 1 "Current RAID status info"
 Current RAID status info:
 RAID data from CMOS = 0xa5 0xc3

From this output you want to look at the number beside of 0xa5 which is 0xc3. You can then use these keys to determine if the primary or secondary compact flash has failed, or both. The above output shows 0xc3 which tells us that both the primary and the secondary compact flashes have failed. Below is the reference table to pull up the information.

Raid Status Info Description
0xf0 No failures reported
0xe1 Primary flash failed
0xd2 Alternate (or mirror) flash failed
0xc3 Both primary and alternate failed
N7K-SUP2E# show system internal file /proc/mdstat
Personalities : [raid1]
md6 : active raid1 sdb6[2](F) sdc6[1]
      77888 blocks [2/1] [_U]
      
md5 : active raid1 sdb5[2](F) sdc5[1]
      78400 blocks [2/1] [_U]
      
md4 : active raid1 sdb4[2](F) sdc4[1]
      39424 blocks [2/1] [_U]
      
md3 : active raid1 sdb3[2](F) sdc3[1]
      1802240 blocks [2/1] [_U]

In this scenario you see that the primary compact flash is not up [_U]. A healthy output will show all blocks as [UU]. Below is the sample of the healty compact flash on my secondary SUP2E.

N7K-SUP2E# slot 2 show system internal file /proc/mdstat
Personalities : [raid1] 
md6 : active raid1 sdc6[0] sdb6[1]
      77888 blocks [2/2] [UU]
      
md5 : active raid1 sdc5[0] sdb5[1]
      78400 blocks [2/2] [UU]
      
md4 : active raid1 sdc4[0] sdb4[1]
      39424 blocks [2/2] [UU]
      
md3 : active raid1 sdc3[0] sdb3[1]
      1802240 blocks [2/2] [UU]

Scenarios

To determine which scenario you are facing, Cisco comes up with several scenarios letter. You will need to use the above commands in the “Problem Analysis” section to correlate with a scenario letter below.

Single supervisor:

Scenario Letter Active Supervisor Active Supervisor Code
A 1 Fail 0xe1 or 0xd2
B 2 Fail 0xc3

Dual supervisor:

Scenario Letter Active Supervisor Standby Supervisor Active Supervisor Code Standby Supervisor Code
C 0 Fail 1 Fail 0xf0 0xe1 or 0xd2
D 1 Fail 0 Fail 0xe1 or 0xd2 0xf0
E 1 Fail 1 Fail 0xe1 or 0xd2 0xe1 or 0xd2
F 2 Fail 0 Fail 0xc3 0xf0
G 0 Fail 2 Fail 0xf0 0xc3
H 2 Fail 1 Fail 0xc3 0xe1 or 0xd2
I 1 Fail 2 Fail 0xe1 or 0xd2 0xc3
J 2 Fail 2 Fail 0xc3 0xc3

On the table above, scenario F is highlighted. That is because we are going to show you how we were accomplished this recovery activity on our client using this scenario.

Recovery Procedure

Cisco has published a procedure for every scenarios listed on the document. When we dealing with scenario F a non-impacting recovery is possible. Below are the summary of the procedure in scenario F:

  • Backup running configuration for all vdc externally. You can use logging facility on your ssh terminal for “show running-config vdc-all” command.
  • Compare runnning configuration (show running-config vdc-all) and startup configuration (show startup-config vdc-all). Evaluate missing configuration on running configuration.
  • Perform supervisor switchover using “system switchover“.
  • New standby supervisor will begin rebooting. During this time you will want to add any missing configuration back to the new active.
  • New standby should reach “ha-standby” state. Use “show module” command to verify it alternatively you might use “show redundancy status” to ensure the all states on “Other supervisor” are “HA standby
  • If the new standby comes up in a “powered-up” state, you will need to manually bring it back online. This can be done by issuing the following commands, where “x” is the standby module stuck in a “powered-up” state:
  • (config)# out-of-service module x
    (config)# no poweroff module x
  • If you see that the standby keeps getting stuck in the powered-up state and ultimately keeps power cycling after the steps above, this is likely due to the active reloading the standby for not coming up in time. To resolve this, configure the following using ‘x’ for the standby slot that stuck in powered-up:
    (config)# system standby manual-boot
    (config)# reload module x force-dnld
  • Once the standby is back online in an “ha-standby” state, you will then need to run the recovery tool to ensure that the recovery is complete. The tool can be downloaded at the following link:
    recovery tool
  • unzipped recovery tool, and uploaded it to the bootflash of the box, you will need to execute the following command: “load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
  • check the recovery status with “show system internal file /proc/mdstat” command/

Procedure Output

Ok. let’s move on to the execution section. To avoid any confusion regarding the supervisor status. I will give a name to the supervisor like the following. Sup1Active means Supervisor one in active state and Sup2Standby means supervisor two on standby state. State on each supervisor will change during the procedure, please be aware with it.

Switchover Supervisor

On “Sup1Active do supevisor switchover. Sup1 will start to reboot and will be Sup1Standby.

N7K-SUP2E# system switchover 
N7K-SUP2E# 
User Access Verification
N7K-SUP2E login: 
>>>
>>>
>>>
NX7k SUP BIOS version ( 2.11 ) : Build - 01/09/2013 18:16:20
PM FPGA Version : 0x00000024 
Power sequence microcode revision - 0x00000009 : card type - 10156EEA0
Booting Spi Flash : Primary 
  CPU Signature - 0x000106e4: Version - 0x000106e0 
  CPU - 2 : Cores - 4 : HTEn - 1 : HT - 2 : Features - 0xbfebfbff 
  FSB Clk - 532 Mhz :  Freq - 2143 Mhz - 2128 Mhz 
  MicroCode Version : 0x00000002 
  Memory - 32768 MB : Frequency - 1067 MHZ 
  Loading Bootloader: Done 
  IO FPGA Version   : 0x1000d 
  PLX Version       : 861910b5
Bios digital signature verification - Passed
USB bootflash status : [1-1:0-0]
...

Below are the output from the Sup2Active, previously Sup2Standby

N7K-SUP2E(standby)# 2017 Apr 22 01:58:02  %$ VDC-1 %$ Apr 22 01:58:02 %KERN-2-SYSTEM_MSG: [18173381.026292] Switchover started by redundancy driver - kernel
2017 Apr 22 01:58:02  %$ VDC-1 %$ %SYSMGR-2-HASWITCHOVER_PRE_START: This supervisor is becoming active (pre-start phase).
2017 Apr 22 01:58:02  %$ VDC-1 %$ %SYSMGR-2-HASWITCHOVER_START: Supervisor 2 is becoming active.
2017 Apr 22 01:58:02  %$ VDC-1 %$ %SYSMGR-2-SWITCHOVER_OVER: Switchover completed.
N7K-SUP2E# show module
Mod  Ports  Module-Type                         Model              Status
---  -----  ----------------------------------- ------------------ ----------
1    0      Supervisor module-2                                    powered-up
2    0      Supervisor module-2                 N7K-SUP2E          active *
3    48     1000 Mbps Optical Ethernet XL Modul N7K-M148GS-11L     ok
4    24     10 Gbps Ethernet Module             N7K-M224XP-23L     ok
...

On my case, Sup1Standby was not able to back online. When you see highlighted lines below during the bootup process, it is a sign that your Sup is fail to boot and it will end on switch boot mode.

...
RAID assembly failed. Stopping all RAID partitions...
Trying to mount bootflash /dev/sdd3...
mount: block device /dev/sdd3 is write-protected, mounting read-only
mount: wrong fs type, bad option, bad superblock on /dev/sdd3,
       or too many mounted file systems
/dev/sdd3 mount failed, trying /dev/sdc3...
/dev/sdc3: Input/output error
mount: block device /dev/sdc3 is write-protected, mounting read-only
/dev/sdc3: Input/output error
mount: /dev/sdc3 is not a valid block device
Cannot find any valid bootflash partitions.
....
switch(boot)#

Even on switch boot mode your are not able to load the kickstart image since Sup doesn’t aware of any flash storage consist of kickstart image and operating system image.

switch(boot)# dir 

Usage for bootflash: filesystem 
   98643968 bytes used
  320786432 bytes free
  419430400 bytes total

Hence, we need to move on to the next procedure to bring Sup1Standby online. On Sup2Active do below command.

N7K-SUP2E(config)# out-of-service module 1
N7K-SUP2E(config)# 2017 Apr 22 02:00:46  %$ VDC-1 %$ %PLATFORM-2-MOD_PWRDN: Module 1 powered down (Serial number )
2017 Apr 22 02:00:46 N7K-SUP2E-VDC-4 %$ VDC-4 %$ %PLATFORM-2-MOD_PWRDN: Module 1 powered down (Serial number )
2017 Apr 22 02:00:46 N7K-SUP2E-VDC-2 %$ VDC-2 %$ %PLATFORM-2-MOD_PWRDN: Module 1 powered down (Serial number )
2017 Apr 22 02:00:46 N7K-SUP2E-VDC-3 %$ VDC-3 %$ %PLATFORM-2-MOD_PWRDN: Module 1 powered down (Serial number )
N7K-SUP2E(config)# no poweroff module 1

From Sup1Standby console, you will see it begin to bootup. When you see highlighted lines below during the bootup process, it is a sign that your Sup is in a good state.

...
Trying to mount bootflash /dev/sdd3...
Mounted primary /dev/sdd3 as /bootflash
Existing bootflash found, saving files...
Saving n7000-s2-dk9-npe.6.1.1.bin
Saving n7000-s2-dk9.6.1.2.bin
Saving n7000-s2-kickstart-npe.6.1.1.bin
Saving n7000-s2-kickstart.6.1.2.bin
Initializing the system...
Unmounting file systems...
Making partitions on physical devices...
Initializing RAID services...
Initializing startup-config and licenses...
mke2fs 1.35 (28-Feb-2004)
Checking for bad blocks (read-only test): done                        
mke2fs 1.35 (28-Feb-2004)
Checking for bad blocks (read-only test): done                        
Formatting PSS:
mke2fs 1.35 (28-Feb-2004)
Checking for bad blocks (read-only test): done                        
Formatting bootflash...
mke2fs 1.35 (28-Feb-2004)
Checking for bad blocks (read-only test): done                        
Fri Jan 3 19:04:29 2017: RAIDMON: Data(0x0) provided saved successfully to CMOS
Initialization completed - No reinit of CMOS/NVRAM
Copying saved files back to bootflash...
Checking obfl filesystem.
Checking all filesystems..... done.
Warning: switch is starting up with default configuration
rLoading system software
/bootflash//n7000-s2-dk9.6.1.2.bin read done
System image digital signature verification successful.
Uncompressing system image: bootflash:/n7000-s2-dk9.6.1.2.bin Fri Jan 3 19:06:12 UTC 2017
blogger: nothing to do.

..done Fri Jan 3 19:06:15 UTC 2017
Load plugins that defined in image conf: /isan/plugin_img/img.conf
Loading plugin 0: core_plugin...
num srgs 1
0: swid-core-sup2dc3, swid-core-sup2dc3
num srgs 1
0: swid-sup2dc3-ks, swid-sup2dc3-ks
INIT: Entering runlevel: 3



User Access Verification
N7K-SUP2E(standby) login:

Hence we need to wait until Sup1Standby reach “ha-standby” state. In this situation we would prefer use “show redundancy status” command to “show module” command from Sup2Active. Because we can see the Sup1Standby progress until “ha-standby” state.

N7K-SUP2E# show redundancy status 
Redundancy mode
---------------
      administrative:   HA
         operational:   None

This supervisor (sup-2)
-----------------------
    Redundancy state:   Active
    Supervisor state:   Active
      Internal state:   Active with HA standby

Other supervisor (sup-1)
------------------------
    Redundancy state:   Standby

    Supervisor state:   Unknown
      Internal state:   Other
...
N7K-SUP2E# show redundancy status 
Redundancy mode
---------------
      administrative:   HA
         operational:   None

This supervisor (sup-2)
-----------------------
    Redundancy state:   Active
    Supervisor state:   Active
      Internal state:   Active with HA standby

Other supervisor (sup-1)
------------------------
    Redundancy state:   Standby

    Supervisor state:   HA standby
      Internal state:   HA synchronization in progress
...
N7K-SUP2E# show redundancy status 
Redundancy mode
---------------
      administrative:   HA
         operational:   HA

This supervisor (sup-2)
-----------------------
    Redundancy state:   Active
    Supervisor state:   Active
      Internal state:   Active with HA standby

Other supervisor (sup-1)
------------------------
    Redundancy state:   Standby

    Supervisor state:   HA standby
      Internal state:   HA standby
...

Sup1Standby is the problematic Sup with the flash failure, after login prompt occurs. Login to Sup1Standby and execute command “show system internal file /proc/mdstat” to see recovery progress on this Sup (We don’t need to load recovery tool on Sup1Standby. Reload procedure will automatically recover it flash).

N7K-SUP2E(standby)#  show system internal file /proc/mdstat
Personalities : [raid1] 
md6 : active raid1 sdd6[2] sdc6[1]
      77888 blocks [2/1] [_U]
        resync=DELAYED
      
md5 : active raid1 sdd5[2] sdc5[1]
      78400 blocks [2/1] [_U]
        resync=DELAYED
      
md4 : active raid1 sdd4[2] sdc4[1]
      39424 blocks [2/1] [_U]
        resync=DELAYED
      
md3 : active raid1 sdd3[2] sdc3[1]
      1802240 blocks [2/1] [_U]
      [=========>...........]  recovery = 45.4% (819648/1802240) finish=1.2min s
peed=13142K/sec

Repeat the command above until you see the result like below, when it does your Sup1Standby is ready.

N7K-SUP2E(standby)#  show system internal file /proc/mdstat
Personalities : [raid1] 
md6 : active raid1 sdd6[0] sdc6[1]
      77888 blocks [2/2] [UU]
      
md5 : active raid1 sdd5[0] sdc5[1]
      78400 blocks [2/2] [UU]
      
md4 : active raid1 sdd4[0] sdc4[1]
      39424 blocks [2/2] [UU]
      
md3 : active raid1 sdd3[0] sdc3[1]
      1802240 blocks [2/2] [UU]

Execute Recovery Tool

As we run the procedure on scenario F, it is not necessary to execute the recovery tool on the Sup2Active, since Sup1Standby is the only problemactic Sup with flash failure. But in our case, after the supervisor switchover even though raid status info shows 0xf0, we were identified that Sup2Active raid status is not in [UU] state. You can do save configuration to startup at this state.

N7K-SUP2E# show system internal raid 
Current RAID status info:
RAID data from CMOS = 0xa5 0xf0
RAID data from driver disks 0 bad 0 name 
Bootflash: /dev/sdc
Mirrorflash: /dev/sdd

Current RAID status:
Personalities : [raid1] 
md6 : active raid1 sdc6[0]
      77888 blocks [2/1] [U_]
      
md5 : active raid1 sdc5[0]
      78400 blocks [2/1] [U_]
      
md4 : active raid1 sdc4[0]
      39424 blocks [2/1] [U_]
      
md3 : active raid1 sdc3[0]
      1802240 blocks [2/1] [U_]

Hence we need to execute the recovery tool. When you execute the tool, it will automatically copying it to the standby Sup if you have redundant Sup. Do notice on the output, since Sup1Standby already recovered it will not attempt any recovery action on it. Execute below command to run the tool.

N7K-SUP2E# load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
Loading plugin version 10.0(2)
###############################################################
  Warning: debug-plugin is for engineering internal use only!
  For security reason, plugin image has been deleted.
###############################################################
INFO: Running on active slot 2, checking if a ha-standby is available...
INFO: Standby present in slot 1. Copying the recovery tool...
###############################################################
  Warning: debug-plugin is for engineering internal use only!
  For security reason, plugin image has been deleted.
###############################################################
INFO: Running on the standby in slot 1, Checking RAID status...
INFO: Both disks are found to be healthy.
INFO: Verifying RAID configuration. Got primary=sdb Secondary=sdd
INFO: RAID device md3 is healthy.
INFO: RAID device md4 is healthy.
INFO: RAID device md5 is healthy.
INFO: RAID device md6 is healthy.
INFO: No recovery was attempted on module 1. All flashes left intact.
INFO: A detailed copy of the this log was saved as volatile:flash_repair_log_mod1.tgz.
INFO: Recovery procedures complete on module 1.
INFO: Please check for any errors in previous messages.
INFO: Run 'show system internal file /proc/mdstat' and check 'up status' [UU] for all disks.
INFO: Run 'show diagnostic result module ' on all available supervisor slots.
INFO: And restart CompactFlash test (7) instances if not in running state.
Loading plugin version 10.0(2)
INFO: Now starting the flash recovery procedures on active.
INFO: Primary=sdc(sdc) Secondary=sdd(sdd) Working=sdc
WARNING: Attempting recovery of secondary device sdd
INFO: Removing /dev/sdd from RAID configuration...
INFO: Resetting secondary flash...
INFO: Found secondary device sdd in 9 seconds.
INFO: Running health checks on the recovered device /dev/sdd...
INFO: Basic I/O tests passed. /dev/sdd looks healthy and responsive.
INFO: Verifying RAID configuration. Got primary=sdc Secondary=sdd
INFO: sdc3 is already a part of md3.
INFO: Adding sdd3 back into md3 RAID configuration...
INFO: sdc4 is already a part of md4.
INFO: Adding sdd4 back into md4 RAID configuration...
INFO: sdc5 is already a part of md5.
INFO: Adding sdd5 back into md5 RAID configuration...
INFO: sdc6 is already a part of md6.
INFO: Adding sdd6 back into md6 RAID configuration...
INFO: Resetting RAID status in CMOS...
WARNING: Flash recovery attempted on module 2.
INFO: A detailed copy of the this log was saved as volatile:flash_repair_log_mod2.tgz.
INFO: Recovery procedures complete on module 2.
INFO: Please check for any errors in previous messages.
INFO: Run 'show system internal file /proc/mdstat' and check 'up status' [UU] for all disks.
INFO: Run 'show diagnostic result module ' on all available supervisor slots.
INFO: And restart CompactFlash test (7) instances if not in running state.
N7K-SUP2E# show system internal file /proc/mdstat
Personalities : [raid1] 
md6 : active raid1 sdd6[2] sdc6[0]
      77888 blocks [2/1] [U_]
        resync=DELAYED
      
md5 : active raid1 sdd5[2] sdc5[0]
      78400 blocks [2/1] [U_]
        resync=DELAYED
      
md4 : active raid1 sdd4[2] sdc4[0]
      39424 blocks [2/1] [U_]
        resync=DELAYED
      
md3 : active raid1 sdd3[2] sdc3[0]
      1802240 blocks [2/1] [U_]
      [==>..................]  recovery = 14.7% (265984/1802240) finish=2.0min s
peed=12665K/sec

Wait until all blocks recover. Now you have all your flashes works.

N7K-SUP2E# show diagnostic result module 2

Current bootup diagnostic level: complete
Module 2: Supervisor module-2  (Active)

        Test results: (. = Pass, F = Fail, I = Incomplete,
        U = Untested, A = Abort, E = Error disabled)

         1) ASICRegisterCheck-------------> .
         2) USB---------------------------> .
         3) NVRAM-------------------------> .
         4) RealTimeClock-----------------> .
         5) PrimaryBootROM----------------> .
         6) SecondaryBootROM--------------> .
         7) CompactFlash------------------> .
         8) ExternalCompactFlash----------> U
         9) PwrMgmtBus--------------------> .
        10) SpineControlBus---------------> .
        11) SystemMgmtBus-----------------> .
        12) StatusBus---------------------> .
        13) StandbyFabricLoopback---------> .
        14) ManagementPortLoopback--------> .
        15) EOBCPortLoopback--------------> .
        16) OBFL--------------------------> .

Don’t forget to save all configuration to startup config.

N7K-SUP2E# copy running-config startup-config vdc-all
[########################################] 100%
Copy complete.

Source:

Nexus 7000 Supervisor 2/2E Compact Flash Failure Recovery

Contributor:
Muhammad Benny
Network Engineer 

Dirga Bramantyo
Network Engineer - CCNP

Ananto Yudi Hendrawan
Network Engineer - CCIE Service Provider #38962, RHCE, VCP6-DCV
nantoyudi@gmail.com
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s