PumpSockets accept error: Socket Error: Error: 72: Software caused connection abort

jfu000 · December 16, 2016, 1:20am

Hello,
Sometimes we are seeing following error message written continuously in BESRelay.log:

PumpSockets accept error: Socket Error: Error: 72: Software caused connection abort

While this message written repeatedly, relay looks like not functioning and we see “winsock error -10” in client logs frequently.
Our relay is running on AIX 7.1/VIOC, connecting 200 - 700 clients.
netstat command shows many CLOSE_WAIT but the number of it is 200 - 300 at most, so I think it might be normal.

Any help would be appreciated.

gpoliafico · December 19, 2016, 7:48am

PumpSocket in general does not indicate a problem, means simple the agent have closed the communication its side … the number of CLOSE_WAIT should be normal as well …
the only thing I’d worried about is that relay that ‘looks like not functioning’ …

As test, can check if from the agent the http://relayhostname:52311/rd command answer correctly ( using both IP or FQN ) … DNS issue? firewall? proxy?

If these checks do not suggest nothing, better open a PMR permitting the support team to have a look to relay logs and ‘client diagnostic’ and go deeper on the issue.

MattPeterson · January 5, 2017, 10:22pm

We occasionally see this on some AIX relays as well. I normally notice it when clients stop reporting and/or move to another relay.

I haven’t found the cause, or a real solution. Restarting the client and relay service seems to get thing working normally again, sometimes it takes a few restarts.

I’ve had success just killing the CLOSE_WAIT sockets as well using the command below:

for i in netstat -Aan |grep 52311 |grep CLOSE | awk '{print $1}' ; do rmsock $i tcpcb ; done

jfu000 · January 6, 2017, 3:36am

I found that when this error occurs CPU utilization of our relay server was closing to 1 - 2% despite that it is always over 10% while working normally.
So I am thinking slow down of relay for any reasons causes this PumpSocket error, but I have no idea about the root causes …