Monday, November 26, 2012

3par V-class arrays, code 37 reset and data corruption

Last week one of the nodes in our 3par v800 reset with a Code 37, and a few seconds before the node reset Oracle started to complain about corrupt blocks. In digging into this issue, it seems that there is a known hardware problem on the V-Class arrays. The issue stems from the PCI-E interface chipset on the system board and fibre channel cards.

We were told that we were the only customer to see data corruption with a Code 37 reset, but your mileage may vary. If you've had similar problems, I'd love to hear about it.

The following output from showeeprom shows a bad and good board:

Node: 5
--------
      Board revision: 0920-200009.A3
            Assembly: SAN 2012/03 Serial 3978
       System serial: 1405629
        BIOS version: 2.9.8
          OS version: 3.1.1.342
        Reset reason: PCI_RESET
           Last boot: 2012-11-17 20:51:43 EST
   Last cluster join: 2012-11-17 20:52:25 EST
          Last panic: 2012-03-23 08:56:46 EDT
  Last panic request: Never
   Error ignore code: 00
         SMI context: 00
       Last HBA mode: 2a100700
          BIOS state: 80 ff 24 27 28 29 2a 2c
           TPD state: 34 40 ff 2a 2c 2e 30 32
Code 27 (Temp/Voltage Failure) - Subcode 0x3 (1)        2012-11-17 20:47:34 EST
Code 31 (GPIO Failure) - Subcode 0x3 (1)                2012-11-17 20:43:45 EST
Code 37 (GEvent Triggered) - Subcode 0x80002001 (0)     2012-11-17 20:41:43 EST
Code 27 (Temp/Voltage Failure) - Subcode 0x3 (1)        2012-04-05 15:33:01 EDT
Code 27 (Temp/Voltage Failure) - Subcode 0x3 (1)        2012-03-26 17:59:21 EDT
Code 38 (Power Supply Failure) - Subcode 0x13 (0)       2012-03-26 17:06:41 EDT

I'm told that boards with revision D2 contain the fixes for the issue:

Node: 0
--------
      Board revision: 0920-200009.D2
            Assembly: SAN 2012/38 Serial 6349
       System serial: 1405629
        BIOS version: 2.9.8
          OS version: 3.1.1.342
        Reset reason: ALIVE_L
           Last boot: 2012-11-23 16:44:12 EST
   Last cluster join: 2012-11-23 16:44:47 EST
          Last panic: 2012-10-23 21:30:25 EDT
  Last panic request: Never
   Error ignore code: 00
         SMI context: 00
       Last HBA mode: 2a100700
          BIOS state: 80 ff 24 27 28 29 2a 2c
           TPD state: 34 40 ff 2a 2c 2e 30 32

Linux multipath (device-mapper) optimizations with 3par storage

Earlier this year we bought a 6-node 3par v800 array and it was being deployed with Oracle RAC Clusters running OEL 6.x.

We discovered that 3par's default configuration for multipath.conf would yield a 30 second I/O stall whenever we failed one of the paths.

Eventually  we were able to get support to offer adding "dev_loss_tmo            1" to the multipath.conf as such:

 defaults {  
  user_friendly_names yes  
  polling_interval    5  
  dev_loss_tmo      1  
 }  

With that in place, we would only observe a 1 second I/O stall during a path failure.