Saturday, December 11, 2010

hangcheck_timer module for monitoring the Linux kernel hangs

Make sure hangcheck_timer is running on all cluster nodes.

crs.oracle.pind22>/sbin/lsmod |grep -i hang
Module Size Used by
hangcheck_timer 37465 0


add entry to start the hangcheck_timer module on all nodes if necessary
crs.oracle.pind22>vi /etc/rc.local

modprobe hangcheck-timer hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1

hangcheck_tick: defines how often, in seconds, the hangcheck_timer module checks nodes for hangs. The default value is 60 seconds.

Hangcheck_margin: defines how long, in seconds, the timer waits for a response from the kernel. The default value is 180 seconds.

Hangcheck_reboot: determines if the hangcheck_timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. If hangcheck_reboot >= 1, then the hangcheck_timer module restarts the system. If hangcheck_reboot = 0, then the hangcheck_timer will not reboot node even if a hang is detected.

For 10g, need to make sure that the cluster misscount is greater than the sum of the setting for hangcheck_tick+hangcheck_margin.

REF:
Hangcheck-timer module requirements for Oracle9i, 10g, and 11g RAC on Linux.

No comments:

Post a Comment