Solaris 11.3 Bug# 21207532 System or DB hang in zil_commit() at shutdown


The issue is generated due to ZFS bug present in version lower than 11.3.2.4.0, those symptoms produce that some errors related with IO wait are generated in the trace log file of the Oracle Data Base, shutting down the mmon process as you see below


Trace log example:

Thu Sep 01 08:51:11 2016
WARNING: aiowait timed out 2 times
Thu Sep 01 08:57:38 2016
minact-scn: got error during useg scan e:12751 usn:4
minact-scn: useg scan erroring out with error e:12751
Suspending MMON action 'Block Cleanout Optim, Undo Segment Scan' for 82800 seconds
Thu Sep 01 09:01:11 2016
WARNING: aiowait timed out 3 times
Thu Sep 01 09:07:20 2016
Suspending MMON action 'undo usage' for 82800 seconds
Thu Sep 01 09:11:11 2016
WARNING: aiowait timed out 4 times
Thu Sep 01 09:12:03 2016
Shutting down instance (immediate)
Stopping background process SMCO
Thu Sep 01 09:12:34 2016
Background process SMCO not dead after 30 seconds
Killing background process SMCO
Shutting down instance: further logons disabled


You can not shutdown normally and you will need to perform a shutdown abort and when you try to start the Oracle Data Base again, you will receive the same error in the screen


ORA-01102: cannot mount database in EXCLUSIVE mode 

SQL> startup
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.

Total System Global Area 4745003008 bytes
Fixed Size 2167200 bytes
Variable Size 2399145568 bytes
Database Buffers 2332033024 bytes
Redo Buffers 11657216 bytes
ORA-01102: cannot mount database in EXCLUSIVE mode  


When you check the server you can not detect any issue related with IO wait

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 72.0 0.0 191.0 0.0 0.0 0.0 0.4 0 1 c1d3
0.0 67.0 0.0 309.5 0.0 0.0 0.0 0.2 0 0 c1d4
0.0 130.0 0.0 1452.5 0.0 0.0 0.0 0.2 0 1 c1d5
0.0 120.0 0.0 1413.0 0.0 0.0 0.0 0.3 0 1 c1d6
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 1.0 0.0 20.0 0.0 0.0 0.0 0.4 0 0 c1d4
0.0 1.0 0.0 20.0 0.0 0.0 0.0 0.3 0 0 c1d5
0.0 1.0 0.0 20.0 0.0 0.0 0.0 0.4 0 0 c1d6

The recommendation is take a coredump when the issue is in progress, for my experience it is not permit start the oracle database until the server is booted or you can  apply before   the SRU recommended since 11.3.2.4.0 or greater where the  issue is fixed.

Execute the following command to take the it (coredump), if you need to reboot because you can not apply the SRU at this moment do a "reboot -d"

#savecore -L      (Take the core information without reboot)

We posted for you previously how to update Solaris 11 in the link below, follow the steps in order to do the upgrade

http://unixaddiction.blogspot.com/2016/05/updating-solaris-11x-to-113-and-sru-to.html


 You can open a case in the MOS, in order to be analyzed the coredump generated, they should see something like this

pc: genunix:cv_wait+0x3c: call unix:swtch

void genunix:cv_wait+0x3c((kcondvar_t *)0xc4014b00def6, (kmutex_t *)0xc4014c071b80)
void zfs:zil_commit+0xd4((zilog_t *)0xc4014c071b80, (uint64_t), (uint64_t)0xf7b, (uint64_t)0xf7b)
int zfs:zfs_write+0x910((vnode_t *)0xc401666a8b40, (uio_t *)0x2a10d51b248, (int)0x40, (cred_t *)0xc40146364fd0, (caller_context_t *)0)
int genunix:fop_write+0x84((vnode_t *)0xc401666a8b40, (uio_t *)0x2a10d51b248, (int)0x40, (cred_t *)0xc40146364fd0, (caller_context_t *)0)
ssize_t genunix:pwrite+0x208((int), (void *), (size_t)0x100000?, (off_t))
unix:_syscall_no_proc_exit+0x58()
-- switch to user thread's user stack --

address translation failed for thread.t_prev: 8 bytes @ 0x2a10a995cf8

  1 matching thread found
  with function "zil_commit" in its stack


All values are 0 in the corresponding zl_itxg struct.

CAT(vmcore.1/11V)> sdump 0xc4014c071b80 zilog_t zl_itxg
  itxg_t [4] zl_itxg = [ {
  kmutex_t itxg_lock = {
  void *[1] _opaque = [ NULL ]
  }


SOLUTION OR WORKAROUND  (Doc ID 2116637.1)

Bug 21207532 is fixed in Solaris 11.3 SRU 2.4 (or later) 


Comments