SOCKET RECEIVE (winsock error 4294967286)

The relay in question has about 1,200 active clients. The relay is OEL 7.1 VM, 2CPU, 4GB RAM.

Looking at connections, “netstat -an | grep -i 52311 | wc” only returns anywhere from 80 to 200 so I wouldn’t think the OS has any port exhaustion problems.

the BESRelay.log file doesn’t show any errors from the time rangem where the client had the issue.
Mon, 27 May 2019 18:01:58 +0000 - 1834936064 - 403: 20NoMastheadMatchesURL
Mon, 27 May 2019 18:20:08 +0000 - 1834936064 - 189: 17NotASignedMessage

Ok, that all looks good. I’m out of suggestions…I think you said you had a PMR open right? I’d keep following-up with support.

We really need good direction from support and are challenged in this area. We have tried the 10000 debug log and sent event tracing data for winsock communications but we don’t get to root cause. A reboot of the server in some cases make the issue go away so that tells us it’s not network related but something corrupted on the client in the BES code or somewhere in the TCP stack. You can also see our frustration as this has been going on for almost 2 years.

Not sure whether it works in your case but here is what i did in my case when i received winsock "4294967290"error and upon further troubleshooting at the client side , the client wasn’t able to make connections with the FQDN of the bigfix/relay server and with IP address it works fine with no errors and i made a local host entry on the client and it works.

1 Like

Unfortunately, with 10’s of 1000’s of clients, that won’t be doable for us. The rotten thing is if you restart the machine, it contacts the relay just fine.

Instead of restarting the entire machine, do things resolve if you simply restart the Client service?

No, they typically don’t unless a different relay is auto selected but that rarely happens.

If the issue is not resolved by restarting the BigFix Client (but is by restarting the endpoint), it seems likely that the issue resides somewhere external to the Client, but on the endpoint itself.

When this issue occurs, prior to restarting the endpoint, have you performed network connectivity troubleshooting between the endpoint and the Relay in question? DNS resolution, ping, telnet on BigFix port, network packet captures, etc…?

The biggest issue we have is the inability to consistently repro this. We cannot afford to turn on 10000 debug client logging and have netmon tracing running on 10’s of 1000’s of clients. It can happen on one client then 5 minutes later, it clears up on same client with no restart. We can restart the client as previously mentioned and then issue goes away.

It’s one of the worst moving targets we have ever seen. We have had 2 cases open with IBM on this in the past and we simply gave up due to lack of root cause identification.

I wouldn’t suggest enabling debug logging on the Client to troubleshoot winsock errors as it will not give you additional information/details beyond the standard Client log. As suggested earlier, the winsock errors are indicative of a network communications issue that the Client encountered when attempting to perform a network operation.

If you’re seeing winsock errors in a Client log that clears up on its own without a restart of any kind, then for sure the issue is entirely external to the Client, and suggests intermittent network issues.

Some winsock errors (such as BAD_SERVERNAME - which is related to DNS resolution) might be addressed at the network level, or by different BigFix Client configurations. The rest tend to be network connectivity issues that need to be isolated when they occur, and can result from configurations/applications on the endpoint itself and/or network configurations/conditions.

I would suggest focusing troubleshooting efforts on specific instances, and testing network connectivity during these periods.

That’s what will be difficult. We don’t know which machine will act up next out of 10’s of 1000’s and we cannot afford to turn on debugging on 10’s of 1000’s of machines.

As Aram states, turning on additional BESClient debugging is not likely to help the situation, as it’s a transient problem likely related to your client or the network, and not to Bigfix itself.

You’ll need to be looking for other clues in the Event Logs of your server & relays, looking at your firewall/router logs, network switch logs, etc.

If left alone, does the client ever resolve itself without your intervention?

From your other post I saw there are ten network hops from your client to your relay; that is likely a case where I’d recommend locating a relay closer to your client, or enabling Persistent Connections from your client to the relay, depending on your network architecture.

What is your total deployment size, total number of relays, number of locations, number of endpoints per location?

You mentioned earlier you need better guidance from support…to be clear, this is not a support forum but a peer enthusiast / self-help forum. I don"t know whether anyone from the Support team reads the forum, and if so it’s just on their own time - responding on the forum is not anyone’s job. For actual support, you’ll need to work with them through the PMR / Support Requests process. I’m honestly not sure how much that’ll help, diagnosing what’s going on with your network might be beyond their scope; the winsock errors could possibly be useful from a developer standpoint, but often a Winsock error just means “I could not connect” which is less-than-helpful. The codewords that go with the number are often more useful than the number itself - SOCKET RECEIVE, or BAD SERVERNAME, etc.

BAD SERVERNAME usually means your DNS isn’t working properly. SOCKET RECEIVE usually means your network or operating system is dropping packets along the way. The first doesn’t usually fix itself, unless maybe you are seeing that while rebooting DNS servers; the second usually resolves itself with retries in time.

I don’t want to hijack this thread since I have one going on SOCKET RECEIVE (winsock error 4294967286) - #20 by Aram but all of your commands check out. Since my side has been trying to resolve this for almost 2 years, we have been down your road of those commands over and over to no resolution.

Some suggestions (IBM/HCL, you listening?), if a GetURL is failing to register due to winsock errors, try a different relay.

My current winsock issue is winsock error 4294967286 but we get the winsock error 4294967290 as well. If we ever fix one, we most likely will fix both. Message from log is At 16:59:23 +0000 -
Error posting report to: ‘http://xxxxxxxxx.thomsonreuters.com:52311/cgi-bin/bfenterprise/PostResults.exe’ (General transport failure.
SOCKET RECEIVE (winsock error 4294967286)

Here is output of those commands.

The relay select auto we have had set to 1 usually, will try 0 and see if this helps at all. Do you know what 0 does versus 1?

I can also get to https://xxxxxxxx.thomsonreuters.com:52311/rd just fine from same machine.

Relay select 0 vs 1 is “manual relay select” vs “automatic select”. It should have no bearing as this does not appear to be occurring during relay selection.

Is it normal to have ten network hops between the client and the relay? That’s a lot of hops, and a case where you might want to distribute a relay closer to the client.

What kind of logical distribution do you have (number of clients, number of relays, distance from client to relay)? Many customers who have the model of many small, distributed sites (ie mall storefronts) will add a relay to one of the workstations at each location.

Have you looked at the Persistent Connection options available as of 9.5.11 or 9.5.12 ?

Hi cstoneba

Until now, Have these problem appear and any suggestion to solve this because it happen to me also as 9.5.13 version. Thanks

No, never found a cause.

Hi cstoneba

Finally I got the solution for this, My solution is checking the baseline policy for the OS and check if the TLS is enable or blocked . If blocked try to enable it. you can test run by turn off baseline policy and then install the client. Thank you

what versions of TLS did you have issues with?

for me is 1.2 Thank you

1 Like