Kbase P6363: Batch jobs kill -9 causing microtransactions to go bad
Autor |
  Progress Software Corporation - Progress |
Acesso |
  Público |
Publicação |
  1/9/2006 |
|
Status: Verified
FACT(s) (Environment):
UNIX
SYMPTOM(s):
Batch process interrputed with UNIX kill -9
SYSTEM ERROR: User died during microtransaction. (2256)
SYSTEM ERROR: rmundochg: size <rec-size>, expected <rec-size> (5543)
Error messages related ONLY on cron jobs
CPU ~100% usage when error messages appear
CHANGE:
A time window for batch processes to run was introduced. If this time window is exceeded a routine is called to kill the procedure as this was known to slow down other messages which need processing and this could halt business processes in the 24/6 and factory downtime for what ever reason is expensive.
CAUSE:
The reason for the resulting database corruption, is that the microtransaction has 'gone bad'. This is due to abnormal disconnection from the database. Please note, that although "It is abnormal and unusual (not an application error) for a microtransaction not to be completed successfully." under 'normal' conditions, disconnecting batch clients throws a very different angle on understanding the 'corruption, since if kill -9 is used to interrupt the batch process, UNIX does not allow any program to detect a kill -9 signal, the Progress server process can not detect what is happening when this signal is received.
For example, this error was caused by issuing a kill -9 in the script:
check+kill)
prog=$2
LOGF=log/check+kill.log
processes=`ps -ef | grep $prog | grep -v grep |\
grep -v check+kill | awk '{print $2}'`
if [ ! "Z$processes" = "Z" ] ; then
ps -ef | grep $prog | grep -v grep | grep -v check+kill >>$LOGF
kill_named_proc $prog "check+kill" >>$LOGF
sleep 5
kill_named_proc $prog "check+kill" -9 >>$LOGF
sleep 20
processes=`ps -ef | grep $prog | grep -v grep | grep -v check+kill`
echo "Proc = " $processes > tmp/stat.err
if [ "Z$processes" = "Z" ] ; then
echo "`$CTIME` param=$prog All killed" >>$LOGF
rm -f tmp/stat.err
fi
fi
;;
The TIME stamp of the 2256 error exactly matched the time the kill -9 was issued.
FIX:
Do NOT use kill -9. This is a drastic measure and all other avenues should be explored first.
proshut dbname -> option 1: Disconnect a User
or
proshut dbname -C disconnect {userid}
WHERE:
userid = _connect-usr for _connect-pid
is the supported manner of interrupting a process online.
If a process must be killed, Progress Solution P14679, "Guidelines on the use of UNIX kill command to stop a process" describes the sequence that "kills" should be executed.
In this particular instance, the batch code was investigated. As an initial workaround, the "sleep" was increased to 60 seconds which decreased the need to stop the cron job and starting batch clients with a lower priority (using UNIX 'nice') to prevent it from taking too much CPU capacity for example: niced -19.
xbpro_nice()
{nice -19 _progres -p $PROG -param "$HOST" -b $DBS $CPSTREAM $DATE $STDOUT}