Multiple problems in Relay - Clients not registering

Hello guys,

I got a problem in one of my relays. This particular relay can’t do anything, if I manually register clients on this one, after a period of time they will connect to the failover relay, which is the root server.
When set to automatic relay selection, clients will avoid this one. Currently only two clients are connected to this relay, but these clients have not sent report for more than 30 days.

I’ll probably have to open a case in support to completely solve it, but for now, do you guys have any idea on how I can troubleshoot it? I’ve never heard of http error 55.

I’ll let a screenshot of some logs from the relay.

Looks like the rely cant talk to the Server (whichever one you have marked out).
Check that port 52311 (TCP and UDP) is opened for the full path between the Relay and the Server.

Also be sure it’s allowed on the Relay’s local firewall.

1 Like

Agree that the issue is that this Relay is unable to connect to its Server or upstream Relay.

For reference, both of these particular error codes (28 and 55) are coming from libcurl, a widely-used HTTPS client used in our Client and Relay services. Reference at https://curl.se/libcurl/c/libcurl-errors.html.

The log is returning libcurl errors rather than HTTP statuses because the HTTP connection is not being established at all, so there can be no HTTP responses from the server.

1 Like

Do I have any ways to test a bidirectional connection between my relay and a server and between a client and relay?
Pinging will of course, generate results and they are answering to each other.

The problem here is that as soon as I install a relay 1000 clients register to it and after a few days or hours, this number gets close to zero.

I know that this is a network problem, I just wanted to find ways to identify what exactly is the error.

Maybe a download test between a client and relay?

Not sure if BigFix has any ways to check the health of a communication between agent and relays.

I notice that we’re having many download failures on big files. For example, we’re deploying Windows 10 22H2 to our entire environment and the ISO file is around 5gb, computers often fail to download it, even when they are on the same subnet as its relay or client. Yes, the relays have all the files, I’ve checked it.

I’ve already sent a network traffic guide to our infrastructure and firewall team, but I am looking to show them the actual failure.

Thank you.

You can use this URL in curl or a browser -

http://relayservername:52311/cgi-bin/bfenterprise/clientregister.exe?RequestType=Version

If this is not an Authenticating Relay, you should get an HTML response with the version. If it’s an Authenticating Relay, you’ll get a 401:Forbidden response which also indicates a good connection.

I did the test and I’m getting:

503 – Service Unavailable

More than one relay is showing this result.

Testing from a client to relay should work, but when I try to do the same test from the relay to a client computer, it loads forever and eventually fails.

The connection from a relay to a client is not expected to work - the client doesn’t listen for / respond to connections.

Is this test working when you connect to a Relay? If it’s not, I’d suspect a proxy or endpoint security product is blocking the connection

It is known that before updating and migrating BigFix, at least 3 relays were a “Proxy Agent”, but the security team is entirely new and they don’t know the reason behind this setting.
Just for a background, the company’s infrastructure suffered a cyber-attack 2 years ago and the previous security team is no longer working with us.

The computers have SEP as AV and they are all Server 2012 R2. Is it possible that the rejection is happening in the relay server? Because Clients use another AV and apparently this is not the problem.

Just to say that these 3 relays were converted to regular relays, no longer proxy agent.

Thank you.

Ah, ok, in this context ‘proxy’ and ‘proxy agent’ have entirely different meanings.

I meant a ‘proxy’ as in the common usage - a web filtering engine such as squid, nginx, etc. that performs web filtering / firewall filtering functions that might allow, reject, or re-write web traffic.

The BigFix usage of ‘proxy agent’ refers to a device (usually a Relay) that connects to a service and emulates BES Computers for that service - such as the VMWare Proxy that allows manipulating VM profiles at the VCenter server, or the Bare Metal proxy that manipulates OSD Bare Metal Server, or the Cloud or MCM plugins.

If there is a web-filtering proxy between your client and relay, or between relay and parent relay/root server, that proxy might be blocking the connection.

If the 503 response is actually coming from the relay itself, you should see some related error messages in the relay’s logfile.txt. if the 503 response is actually coming from a proxy, you might not see errors in the relay log - if the connection test never reached the relay and instead was responded by the proxy.

I just want to come upo with this again. The problem is still unsolved, but I could observe a strange behaviour with multiple relays.

Here are some logs from the logfile.txt of the relay:

Mon, 15 Jan 2024 14:23:55 -0300 - 3992 - Failed to create plugin thread for client 999.100.1.100: class WindowsPlatform::ThreadError
Mon, 15 Jan 2024 14:23:58 -0300 - 3992 - Failed to create plugin thread for client 999.100.1.101: class WindowsPlatform::ThreadError
Mon, 15 Jan 2024 14:23:59 -0300 - 3992 - Failed to create plugin thread for client 999.100.1.102: class WindowsPlatform::ThreadError
Mon, 15 Jan 2024 14:23:59 -0300 - 3992 - Failed to create plugin thread for client 999.100.1.103: class WindowsPlatform::ThreadError
Mon, 15 Jan 2024 14:24:01 -0300 - 3992 - Failed to create plugin thread for client 999.100.1.104: class WindowsPlatform::ThreadError
Mon, 15 Jan 2024 14:24:02 -0300 - 3992 - Failed to create plugin thread for client 999.100.1.105: class WindowsPlatform::ThreadError
Mon, 15 Jan 2024 14:24:14 -0300 - 4868 - PostResultsForwarder ( CURL Error 6 ): HTTP Error 6: Couldn't resolve host name: getaddrinfo() thread failed to start
Mon, 15 Jan 2024 14:24:15 -0300 - 6928 - 23: GetURL failure on http://rootserver.domain.com:52311/bfmirror/bfsites/manydirlists_1/__fullsite_3a85d217366a374a5acab7e0361b6e8: bad allocation
Mon, 15 Jan 2024 14:24:19 -0300 - 1848 - 23: GetURL failure on http://rootserver.domain.com:52311/cgi-bin/bfenterprise/BESGatherMirror.exe?url=http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/actionsite&ManyVersionSHA1=84a7a5a5f98ac948cca74f93&ExpectedManyVersionCRC=2883422624&Time=1705339459: bad allocation
Mon, 15 Jan 2024 14:24:19 -0300 - 3992 - Failed to create plugin thread for client 999.100.1.110: class WindowsPlatform::ThreadError
Mon, 15 Jan 2024 14:24:22 -0300 - 1976 - 23: GetURL failure on http://rootserver.domain.com:52311/cgi-bin/bfenterprise/BESGatherMirror.exe?url=http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/actionsite&ManyVersionSHA1=cb3aaa7a5a5f98ab3048cca74f93&ExpectedManyVersionCRC=2883422624&Time=1705339462: bad allocation
Mon, 15 Jan 2024 14:24:23 -0300 - 4868 - PostResultsForwarder: bad allocation
Mon, 15 Jan 2024 14:24:24 -0300 - /cgi-bin/bfenterprise/clientregister.exe (2916) - Uncaught exception in plugin ClientRegister with client 999.100.1.111: bad allocation
Mon, 15 Jan 2024 14:24:24 -0300 - /cgi-bin/bfenterprise/clientregister.exe (3348) - Uncaught exception in plugin ClientRegister with client 999.100.1.112: bad allocation
Mon, 15 Jan 2024 14:24:25 -0300 - 6864 - 23: GetURL failure on http://rootserver.domain.com:52311/cgi-bin/bfenterprise/BESGatherMirror.exe?url=http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/actionsite&ManyVersionSHA1=8b8cb3aaa5f98ab30b5b948c74f93&ExpectedManyVersionCRC=2883422624&Time=1705339465: bad allocation
Mon, 15 Jan 2024 14:24:34 -0300 - 4868 - PostResultsForwarder: bad allocation

Now, the behaviour is the following. The clients choose their specified relay, after a while, it could be 1,3 or 5 days (but it’s a sure thing), they suddenly stop communicating with that relay and I see these strange logs and that particular relay stops reporting to the server as well (by looking at last report time) and there’s nothing we can do to awake it from the console.

When i access the relay I see that the BigFix Relay Service and the BESClient Service is running. So I find it intriguing.

On the client logs of the relay I see the following:

********************************************
Current Date: January 15, 2024
   Client version 10.0.9.21 built for WINVER 6.0 i386 running on WINVER 10.0.20348 x86_64
   Current Balance Settings: Use CPU: True Entitlement: 0 WorkIdle: 10 SleepIdle: 480
   IP Address 0: 172.31.1.96
   Host name: REDACTED
   Computer ID: 9999999999
   Executable Location: C:\Program Files (x86)\BigFix Enterprise\BES Client\BESClient.exe
   File Log Location: C:\Program Files (x86)\BigFix Enterprise\BES Client\__BESData\__Global\Logs
   ICU 54.2 init status: SUCCESS
   Agent internal character set: UTF-8
   ICU report character set: UTF-8 - Transcoding Disabled
   ICU fxf character set: windows-1252 (Latin 1 / Western European) - Transcoding Enabled
   ICU local character set: windows-1252 (Latin 1 / Western European) - Transcoding Enabled
********************************************
At 14:10:04 -0300 - 
   Starting client version 10.0.9.21
   FIPS mode disabled by default.
At 14:10:05 -0300 - 
   Cryptographic module initialized successfully.
   Using crypto library libBEScrypto - OpenSSL 1.0.2zg  7 Feb 2023
   Initializing Site: actionsite
   Restricted mode
   Initializing Site: BES Support
   Initializing Site: BigFix Labs
   Initializing Site: CustomSite_Windows
   Initializing Site: Enterprise Security
   Initializing Site: IBM License Reporting
   Initializing Site: Patching Support
   Initializing Site: Updates for Windows Applications
   Initializing Site: Virtual Endpoint Manager
At 14:10:06 -0300 - 
   Initializing Site: mailboxsite
   Initializing Site: opsite10
   Processing Download plugins
   Setting _BESClient_Download_FastHashVerify enabled: Off
   Beginning Relay Select
   GetRelayInfo: checking 'http://127.0.0.1:52311/cgi-bin/bfenterprise/clientregister.exe?RequestType=Version'
   GetRelayInfo: Valid Relay
At 14:10:07 -0300 - 
   RegisterOnce: Attempting secure registration with 'https://127.0.0.1:52311/cgi-bin/bfenterprise/clientregister.exe?RequestType=RegisterMe60&ClientVersion=10.0.9.21&Body=1083399088&SequenceNumber=7489&MinRelayVersion=7.1.1.0&CanHandleMVPings=1&Root=http://rootserver.domain.com%3a52311&AdapterInfo=00-50-56-a0-3b-63_IPADDRESSHERE_0'
   Unrestricted mode
   Configuring listener without wake-on-lan
   Registered with url 'https://127.0.0.1:52311/cgi-bin/bfenterprise/clientregister.exe?RequestType=RegisterMe60&ClientVersion=10.0.9.21&Body=1083399088&SequenceNumber=7489&MinRelayVersion=7.1.1.0&CanHandleMVPings=1&Root=http://rootserver.domain.com%3a52311&AdapterInfo=00-50-56-a0-3b-63_
   Registration Server version 10.0.9.21 , Relay version 10.0.9.21
   Relay does not require authentication.
   Client has an AuthenticationCertificate
   Using localhost. Parent Relay selected: rootserver.domain.com:52311. at: 999.100.1.999:52311 on: IPV4 (Using setting IPV4ThenIPV6)
At 14:10:11 -0300 - 
   Entering Service Loop.
   Starting Service Loop.
   A2AServer::Start().
   FAILED to Synchronize - General transport failure. - 'http://127.0.0.1:52311/cgi-bin/bfenterprise/BESGatherMirror.exe?url=http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/actionsite&Time=15Jan14:10:11&rand=feaef22a&ManyVersionSha1=84cbb8cb http failure code 503 - gather url - http://127.0.0.1:52311/cgi-bin/bfenterprise/BESGatherMirror.exe?url=http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/actionsite&Tim=5a1:01&adfae2&MyVerionSha1=84cbb8cb3aaa7a
   Successful Synchronization with site 'mailboxsite' (version 9) - 'http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/mailboxsite1083398'
   Successful Synchronization with site 'BES Support' (version 1486) - 'http://sync.bigfix.com/cgi-bin/bfgather/bessupport'
   Successful Synchronization with site 'BigFix Labs' (version 55) - 'http://sync.bigfix.com/cgi-bin/bfgather/bigfixlabs'
At 14:10:12 -0300 - 
   Successful Synchronization with site 'CustomSite_Windows' (version 117) - 'http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/CustomSite_Windows'
   Successful Synchronization with site 'Enterprise Security' (version 4316) - 'http://sync.bigfix.com/cgi-bin/bfgather/bessecurity'
   Successful Synchronization with site 'IBM License Reporting' (version 155) - 'http://sync.bigfix.com/cgi-bin/bfgather/ibmlicensereporting'
   Successful Synchronization with site 'Patching Support' (version 1083) - 'http://sync.bigfix.com/cgi-bin/bfgather/patchingsupport'
At 14:10:13 -0300 - 
   Successful Synchronization with site 'Updates for Windows Applications' (version 2074) - 'http://sync.bigfix.com/cgi-bin/bfgather/updateswindowsapps'
   Successful Synchronization with site 'Virtual Endpoint Manager' (version 70) - 'http://sync.bigfix.com/cgi-bin/bfgather/virtualendpointmanager'
   Successful Synchronization with site 'opsite10' (version 352) - 'http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/opsite10'
   ActiveDirectory: Refreshed Computer Information - Domain: Domain
At 14:10:14 -0300 - 
   [ThreadTime:14:10:11] SetupListener success: IPV4/6
   Encryption: optional encryption with no certificate; reports in cleartext
At 14:10:15 -0300 - 
   Report posted successfully
At 14:11:16 -0300 - 
   Successful Synchronization with site 'actionsite' (version 1403) - 'http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/actionsite'
At 14:23:18 -0300 - 
   Report posted successfully
At 14:31:21 -0300 - 
   Report posted successfully
At 14:39:23 -0300 - 
   Report posted successfully
At 14:47:21 -0300 - 
   Report posted successfully
At 14:55:32 -0300 - 
   Report posted successfully
At 15:05:54 -0300 - 
   ForceRefresh command received.  Version difference, gathering action site.
At 15:05:56 -0300 - 
   Successful Synchronization with site 'actionsite' (version 1403) - 'http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/actionsite'
   Gathering all operator/mailbox sites.
At 15:05:57 -0300 - 
   Successful Synchronization with site 'mailboxsite' (version 9) - 'http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/mailboxsite1083399088'
At 15:05:58 -0300 - 
   Successful Synchronization with site 'opsite10' (version 352) - 'http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/opsite10'
At 15:05:59 -0300 - 
   Error posting report to: 'http://127.0.0.1:52311/cgi-bin/bfenterprise/PostResults.exe' (General transport failure.
'http://127.0.0.1:52311/cgi-bin/bfenterprise/PostResults.exe' http failure code 503)
At 15:07:06 -0300 - 
   ForceRefresh command received.  Version difference, gathering action site.
At 15:07:08 -0300 - 
   Successful Synchronization with site 'actionsite' (version 1403) - 'http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/actionsite'
   Gathering all operator/mailbox sites.
At 15:07:10 -0300 - 
   Successful Synchronization with site 'mailboxsite' (version 9) - 'http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/mailboxsite1083399088'
At 15:07:12 -0300 - 
   Successful Synchronization with site 'opsite10' (version 352) - 'http://rootserver.domain.com:52311/cgi-bin/bfgather.exe/opsite10'
At 15:07:25 -0300 - 
   Error posting report to: 'http://127.0.0.1:52311/cgi-bin/bfenterprise/PostResults.exe' (General transport failure.
'http://127.0.0.1:52311/cgi-bin/bfenterprise/PostResults.exe' http failure code 503)
At 15:09:48 -0300 - 
   Full Report posted successfully
At 15:15:37 -0300 - 
   Report posted successfully
At 15:23:28 -0300 - 
   Error posting report to: 'http://127.0.0.1:52311/cgi-bin/bfenterprise/PostResults.exe' (General transport failure.
'http://127.0.0.1:52311/cgi-bin/bfenterprise/PostResults.exe' http failure code 503)
At 15:30:54 -0300 - 
   Report posted successfully
At 15:36:37 -0300 - 
   ActiveDirectory: User logged in - Domain: Domain User: 9999999
   ActiveDirectory: Refreshed User Information - Domain: Domain User: 09999999
At 15:36:41 -0300 - 
   User interface process started for user '9999999'
At 15:38:40 -0300 - Patching Support (http://sync.bigfix.com/cgi-bin/bfgather/patchingsupport)
   Fixed - Task: Windows Update Service - Start the service (fixlet:12003)
   Relevant - Task: Windows Update Service - Stop the service (fixlet:12004)
At 15:38:46 -0300 - 
   Report posted successfully
At 15:40:09 -0300 - 
   Client shutdown (Service manager stop request)

Last part happens all the time “Client shutdown (Service manager stop request)”, but when I check the server the service is running.

Any idea on where I should start investigating? It’s worth to note that after that the clients on this relay will choose their failover relay or the root server.

The event viewer does not display any BESRelay stoppage.

This thing has been going on for so many months that I started losing my hair.

Anyway, I hope to find a solution on this.

It still looks like an endpoint security tool issue to me, preventing the Relay service from launching threads.

The two basic approaches I know of, are to either start removing antivirus, EDR, etc. from the Relay until it starts working; or build a new, clean Relay server with just the OS and BigFix Relay, verify it works as expected, and then start layering on your GPOs, antivirus, EDR, etc., one application at a time, until you find the one that breaks it
Once you know which application is causing the issue, that should point us in the direction of how to configure whatever policy you need.

That said, I’ve never seen this issue before, all I can say for sure is that it is not expected behavior and we would want you to reproduce it starting from a “clean” system.

2 Likes