Relay go offline

JasonWalker · June 26, 2018, 2:37pm

Yes, blocking icmp pings to the top-level relays can prevent child relays from selecting them (even with the child relays configured for manual relay select).
In this case, you should add the FailoverRelayList client setting, configured on the child relays, with values directing them to your top-level relays.

The client (including child relays) first attempt to “ping” potential parent relays to determine which are available. If none respond to ping requests, the client (or child relay) would attempt to contact the BES Root Server defined in the masthead (even without ping response). Defining the FailoverRelay or FailoverRelayList client setting overrides that behavior and the client/child relay will contact the relay(s) listed in this setting instead of connecting to the root server.

tonyjar · June 27, 2018, 4:18am

Hi Jason,
thanks for the info, so can I take your advise this way: if the TLR is set to pingable again, it might improve the client (& lower relay) disconnection issue?

Tony

JasonWalker · June 27, 2018, 6:10am

Yes, that should improve things.

For initial registration, you’d still need a FailoverRelay set, or RelayServer1 / RelayServer2 at installation time (before the client has obtained the relay list). After initial registration, allowing icmp or setting FailoverRelayList should maintain relay select capability.

tonyjar · June 27, 2018, 10:04am

Hi Jason,
many thanks with the info, we will try, thanks again,

Tony

JasonWalker · June 27, 2018, 12:07pm

Glad I could help, hope it goes well with you

tonyjar · June 29, 2018, 3:04am

Hi Jason,
My colleagues have it tested, event the TLR is pingable and tracert they still easily go offline, further i looked into one of the disconnected relay client relay logfile in Program Files\BigFix Enterprise\BES Relay\ i found a lot “No buffer space”, is that the cause of the issue also? and how to tune up the buffer space??

**10.82.29.115 is the TLR at below log

Sat, 26 May 2018 22:44:58 +0800 - PeriodicTasks (1896) - GetExpectedVersionOfParent Error: HTTP Error 7: Couldn’t connect to server: Failed to connect to 10.82.29.115: No buffer space
Sat, 26 May 2018 22:44:58 +0800 - PeriodicTasks (1896) - Error running task UpdateAndSendRelayStatus: HTTP Error 7: Couldn’t connect to server: Failed to connect to 10.82.29.115: No buffer space
Sat, 26 May 2018 22:46:43 +0800 - /cgi-bin/bfenterprise/clientregister.exe (16492) - Uncaught exception in plugin ClientRegister with client 10.70.70.3: HTTP Error 7: Couldn’t connect to server: Failed to connect to 10.82.29.115: No buffer space
Sat, 26 May 2018 22:47:20 +0800 - /cgi-bin/bfenterprise/clientregister.exe (11948) - Uncaught exception in plugin ClientRegister with client 10.70.70.3: HTTP Error 7: Couldn’t connect to server: Failed to connect to 10.82.29.115: No buffer space
Sat, 26 May 2018 22:47:25 +0800 - /cgi-bin/bfenterprise/clientregister.exe (16444) - Uncaught exception in plugin ClientRegister with client 10.70.70.3: HTTP Error 7: Couldn’t connect to server: Failed to connect to 10.82.29.115: No buffer space
Sat, 26 May 2018 22:48:10 +0800 - /cgi-bin/bfenterprise/clientregister.exe (13452) - Uncaught exception in plugin ClientRegister with client 10.70.70.3: HTTP Error 7: Couldn’t connect to server: Failed to connect to 10.82.29.115: No buffer space

Million thanks.
Tony

JasonWalker · June 29, 2018, 3:32am

Is the top level relay itself doing ok? Do you have a PMR open (you’ll probably need one).

If your top level relay is healthy and not giving error messages, I expect there may be something wrong in your network path or the network configuration on your child relay. Are you doing anything to restrict tcp/ip sockets (like defining a small ephemeral port range)?

tonyjar · June 29, 2018, 3:46am

Hi Jason,
i asked my reseller but they didnt provide me any channel to IBM Bigfix, how to submit a PMR actually can you give me some info?

i guess my TLR health are ok as not all 6 will go wrong at the same time right? I dont know if my network colleagues restrict anything as i am no expert to network also, can you suggest any command i can try to see the current setting?

million thanks.

Tony

JasonWalker · June 29, 2018, 1:51pm

You’ll need an IBM ID to log in and support PMRs (which I think have been renamed to TS now to be more confusing).

If you don’t have an IBM ID, you should be able to create one and register for support using your customer number or agreement number. If you don’t have those and your reseller is defunct or uncooperative the IBM licensing folks should be able to retrieve your customer number given the serial number in your masthead file.

Let us know your current standing so we can determine where best to direct you

tonyjar · July 3, 2018, 2:31pm

Hi Jason,
yea i am requesting my reseller to provide the customer number or agreement number, if i got the number where to create the ID and open a PMR (or TS i already confused)?

many thanks.

Tony

tonyjar · July 5, 2018, 6:50am

Hi Jason,
I got a 10 digit oem software agreement number from my reseller but i cant fit in the 7-digit page, any suggestion?

Tony

tonyjar · July 5, 2018, 7:45am

wrong pic, should be as below

tvn · July 6, 2018, 3:14pm

I am having the same issue on a relay also. It was reporting fine until Jun29 and now i see this winsock error 4294967288 error in the logs and the relay went offline.

tonyjar · July 7, 2018, 6:06pm

HI TVN,
My env is 9.5.6, whats ur can you share?

Tony

JasonWalker · July 7, 2018, 7:57pm

Technically now they are called “TS” (Tech Support requests) but by convention a lot of us still refer to the former term “PMR”

JasonWalker · July 7, 2018, 7:57pm

Click that link for support resource help to send them an email and they should help you.

tvn · July 9, 2018, 12:22pm

Hi tonyjar, same version here. 9.5.6.63 to be more precise. I have tried a new relay install. Without the relay installed, the client started reporting back to the console. When i installed the relay, it stopped to post reports after about 20min.

tonyjar · July 10, 2018, 2:54am

Hi TVN,
somewhat the symptom is very similar to my env.
i just checked with network team, they set a bandwidth control at the outbound data, i am guessing if it limited the data out and resulting disconnection from client. see if it could be any hints for u too, cheers

Tony

tvn · July 12, 2018, 8:36pm

Hi tonyjar,

In my case it was a little bit different.

The netstat command showed me that there was something holding the 52311 port, i got that by the logs:
Wed, 11 Jul 2018 10:55:14 -0400 - 1669719840 - Unable to start http server: socket may already be in use on port 52311, next retry in 30 seconds
Wed, 11 Jul 2018 10:50:16 -0400 - PeriodicTasks (650106624) - Error running task UpdateAndSendRelayStatus: HTTP Error 28: Timeout was reached: Connection timed out after 10010 milliseconds
Wed, 11 Jul 2018 10:50:44 -0400 - 1669719840 - Can’t listen on address [0.0.0.0]: 15SocketAddrInUse

And the command:
[root@ ~]# netstat -apn |grep 52311
tcp 0 78661 <ip_number>:52311 <ip_number>:56829 FIN_WAIT1 -
udp 0 0 0.0.0.0:52311 0.0.0.0:* 9332/BESClient

The FIN_WAIT1 - was an orphan process, that i could kill using the tcp orphan file tcp_max_orphans:
cat /proc/sys/net/ipv4/tcp_max_orphans

This makes me think that at some point, the TLR of this relay sent a message but it did not got a response back, which caused the orphan process.

I added the value 0 to the file:
echo 0 > /proc/sys/net/ipv4/tcp_max_orphans

The process FIN_WAIT1 - died.

So i added the original value to the file again:
echo 262144 > /proc/sys/net/ipv4/tcp_max_orphans

After that, i restarted the client and relay and it is up for about 10 hours.

tonyjar · July 13, 2018, 1:50am

Hi TVN,
It really not similar with my case then…
whatever thanks for the sharing,

cheers,
Tony