Consultor Eletrônico



Kbase P74287: What could potentially cause agents not able to start when broker is running?
Autor   Progress Software Corporation - Progress
Acesso   Público
Publicação   3/24/2009
Status: Verified

GOAL:

What could potentially cause agents not able to start when broker is running?

GOAL:

I can successfully start broker with initial agents set to 0, but when I try to add agents, the broker starts failing, why?

GOAL:

What process are agents and broker going through when agent is trying to start?

GOAL:

What tool can I use when trouble shooting agent startup problems?

FACT(s) (Environment):

WebSpeed 3.x

FIX:

npp_poll is a communication layer function which handles TCP/IP connections. The WebSpeed broker and agent go through this function when communicating. The agents checks return value from npp_poll first, if return value is not -1, we also check that the broker process's pid is valid. If this returns a false value (e.g. the pid is invalid), the agent dies without writing the (6401) message to the log, and the broker stays up.

When verifying the broker is still running, the agent issues a kill(0) on the broker pid (i.e. kill pid,0). kill() with a signal of 0 does not actually send a signal to the process.
From the kill man page:
"If sig is 0 (the null signal), error checking is performed but no signal is actually sent. This can be used to check the validity of pid."
The result of this kill operation is used to determine if the broker is still alive. If the broker is not alive, the agent takes it upon itself to shut down, and no messages are written to the log.

Following is a small program that will perform the same checking that the agent does when checking whether the broker is alive. The idea is to use this program instead of the agent, start the broker, and look at the log files it generates.

Following are steps to run the debug program

- build the program, and put it into the working directory

Save the program as chkpid.c in $WRKDIR
cc -c chkpid.c
cc -o chkpid chkpid.o

- add the following line to the broker's entry in the ubroker.properties file:

srvrExecFile=$DLC/bin/jvmStart $WRKDIR/chkpid

This tells the broker to start an agent using the chkpid executable, instead of the normal $DLC/bin/_progres executable.

Note: chkpid is the name of the compiled executable. If they rename it something else, put that here.

- for the purposes of this exercise, they can set initialSrvrInstance=1.
We only need one agent to start to see the problem.

- Try to start the broker. The broker will initially start, and be stuck in a starting state as it waits for the "agent" to start. However, since we are using chkpid instead, the "agent" will not start, it will finish executing and terminate. The broker will eventually shut down on it's own. This is expected behavior.

- when chkpid runs, it creates a file called "out<pid>.txt" in the working directory, where <pid> is the PID of the agent process. They are only short files, about 2 lines long. The output should help determine what the problem is.

chkpid.c

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define DEBUG 0

pid_t getubpid(int argc, char * argv[])
{
pid_t ubpid = -1;
int i;

for (i = 0;i < argc;i++)
{
if (DEBUG) printf("argv[%d] is %s\n",i,argv[i]);
if (strcmp("-ubpid",argv[i]) == 0)
{
ubpid = atol(argv[i+1]);
if (DEBUG) printf("Found ubpid, %s %d\n",argv[i+1],ubpid);
break;
}
}

return ubpid;
}

int main(int argc, char * argv[])
{
pid_t ubpid;
pid_t mypid;
int last_error;
pid_t group_pid;
FILE * fp;
char msg[255];

/* open a file based on our PID */
mypid = getpid();
sprintf(msg,"out%05d.txt",mypid);
if ((fp = fopen(msg,"wt")) != (FILE *)NULL)
{

ubpid = getubpid(argc,argv);
fprintf(fp,"ubpid is %d\n",ubpid);
if (ubpid > 0)
{
group_pid = kill(ubpid,0);
fprintf(fp,"Return from kill is %d, errno is %d\n",group_pid,errno);
if (DEBUG) printf("Return from kill is %d, errno is %d\n",group_pid,errno);
}
else
{
if (DEBUG) printf("Failed to get ubpid\n");
fprintf(fp,"Failed to get ubpid\n");
}
fclose(fp);
}
return 0;
}