40k Client Infrastructure going down and wont come back - TCP Connection Timeout

gkinger · November 7, 2020, 2:11am

We have been dealing this for 3 weeks and out of ideas after working with support. We have a Windows 2019 Server Install that has been up for about 45 days with about 19 all Windows Server 2019 Relays on 10.0.1.

We had the clients auto selecting and we had it relatively well balanced under 5000 clients per Relay with a fallback relay set and advertisement lists hitting all lower relays, proceeding to select top levels if the lower levels are busy, and then finally going to Fallback if not. The first symptom we experienced was we were unable to connect to the Console and would get timeouts to the Console, “Connection refused” errors and realized we had 14k clients on the root server after repeatedly trying to log back on. We were able to bring the relays back up eventually by shutting down about 40% of the environment and firewalling off our root server to nothing but the fallback relay. Our DB is healthy according to L3 support. Our Minimum Report Interval is at 180. We have even firewalled off the Root Server to only talk to the top Level Relays and in this state we are not always able to connect to the console while on the Root Server. We are seeing lots of Timeouts in the FillDB log, pump socket errors

We are seeing our entire infrastructure go down 40k clients go down. Why might clients being rejected by the FillD?. We see mostly timeouts, pump socket errors and the following in the FillDB Log. :

Unable to parse chunk of compressed file in buffer; discarding chunk. (Client report has no verified signer. Discarding message from computer 1608209)

We are also unable to connect to the console. We have increased the TCPTimedWaitDelay in the registry to 30. We have even increased the ports since we suspected port exhaustion, but effectively we are just having communication issues. We had our network team do a pcap and we can see the plugins going red in the Bigfix Diagnostic tool.

We have been able to shutdown 40% of the clients by stopping them with SCCM and slowly bring back the relays one by one but it is only stable before cascading to complete failure under load.

There seems to be a cascading failure effect where a relay gets overwhelmed and it all ends up failing up. It happens within an hour. Even when we lock down the firewall to only talk to top level relay, we can only intermittently connect to the COnsole and do not see client data getting up, the plugin services just seem non operational. Things just stop reporting and the console stops updating. Getting desperate and hoping some heavy hitters from the forums might be able to help @jgstew @Aram @AlanM We do have a remote database and I have never seen a Bigfix implementation have this cascading failure we are seeing even when overloaded. Any ideas or input would be great. Happy to provide any further data as well. Thanks!

atlauren · November 7, 2020, 4:07am

How many relays?
How many layers of relays?
Are all the relays Windows?
Does your Windows AV have the exclusions to not scan scan the Relay file spaces?
Single sever DB, or remote DB?

gkinger · November 7, 2020, 3:55pm

How many relays?
19
How many layers of relays?
We are now at Top tier of 2 and a Fallback, A secondary tier of 4 relays, and 13 relays below (All disk. CPU, and RAM are between 4 core, 8GB-16GB, Disk on Enterprise datacenter clusters.
Are all the relays Windows?
Yes. All Windows Server 2019
Does your Windows AV have the exclusions to not scan scan the Relay file spaces?
We removed all AV to troubleshoot with no change.
Single sever DB, or remote DB?
Remote DB on an enterprise clustered, AG with 64 cores. We had our DB teams analyze the DB and all is fine performance wise. The data just stops getting there because the plugin services start timing out on requests under load.

JasonWalker · November 8, 2020, 12:20am

Are your root server and top-level relays on gigabit or faster networking? I ask because if there are any 100Mbps links involved I’d look at network speed & duplex.
I’ve also seen at least one report where Server 2019 may put inappropriately low TCP timeouts on the BigFix connections due to the new TCP Transport Filters in the network stack that can cause some meyham; let me check the details on that and I’ll post back shortly on some things you could try.

JasonWalker · November 8, 2020, 12:35am

These settings for TCP Transport Filters & Tuning are very deep in the Windows OS and should be used with care. Be sure to document what you change and be ready to back them out. Here’s a description for background information:
https://argonsys.com/microsoft-cloud/library/tcp-templates-for-windows-server-2019-how-to-tune-your-windows-server-transports-advanced-users-only-😉/

You can see the settings that apply to each Traffic Template via the following PowerShell command:
Get-NetTCPSetting | Format-List -Property SettingName, MinRto

With the Relay or Root Server service running, you can see which Profile is being dynamically selected and applied to the TCP connections:

Get-NetTCPConnection -LocalPort 52311 -State Established

If the MinRto value is being set too low for the connection, you’d see a high rate of TCP Retransmissions in that WireShark capture. You can use this display filter to see the TCP Retransmissions in WireShark:

tcp.analysis.retransmission

If you find there are too many TCP Retransmits on port 52311/tcp that could indicate that a MinRTO value is being used that is too small. You can read some background on configuring the transport settings at https://docs.microsoft.com/en-us/powershell/module/nettcpip/set-nettcpsetting?view=win10-ps

The following PowerShell statements can add a filter that will force the BigFix traffic to use the “Internet” connection profile, with its longer MinRTO timeouts values. You may need to configure this on the Root Server and on the Top-Level Relays:

Filter for Outbound BigFix connections:

New-NetTransportFilter -SettingName Internet -LocalPortStart 0 -LocalPortEnd 65536 -RemotePortStart 52311 -RemotePortEnd 52311

Filter for Inbound BigFix connections:

New-NetTransportFilter -SettingName Internet -LocalPortStart 52311 -LocalPortEnd 52311 -RemotePortStart 0 -RemotePortEnd 65536

If you can post your case number I’d be happy to have a look at it as well.

gkinger · November 9, 2020, 3:11pm

Thank you. I will look at this. CS0179292 if you have some time. Again, much appreciated.

gkinger · November 9, 2020, 3:28pm

Get-NetTCPConnection -LocalPort 52311 -State Established

Get-NetTCPConnection : No matching MSFT_NetTCPConnection objects found by CIM query for instances of the ROOT/StandardCimv2/MSFT_NetTCPConnection class on the CIM server: SELECT * FROM MSFT_NetTCPConnection WHERE ((LocalPort
= 52311)) AND ((State = 5)). Verify query parameters and retry.
At line:1 char:1

Get-NetTCPConnection -LocalPort 52311 -State Established

  + CategoryInfo          : ObjectNotFound: (MSFT_NetTCPConnection:String) [Get-NetTCPConnection], CimJobException
  + FullyQualifiedErrorId : CmdletizationQuery_NotFound,Get-NetTCPConnection

JasonWalker · November 9, 2020, 3:50pm

Ok you’ll have to run the command while you actually have an established connection. It may take a few tries but with 40k endpoints I’d expect a number of open connections.
Are you running BigFix on the default port, 52311 ?

gkinger · November 9, 2020, 5:34pm

My mistake. I had firewalled off the Root server. So I reran it and it showed established. We actually already cut everything over to Internet rto with L3 Support. It does not seem to have resolved our issue. The really wild thing to me is we have firewalled all clients off excluding 4 relays and we can view the connections to the root server in Resource Monitor. The Server Plugins become unresponsive:

Test Failed: Client Register Plugin
Reason: The HTTP request failed with error HTTP Error 28: Timeout was reached: Connection timed out after 10001 milliseconds

Test Failed: Post Results Plugin
Reason: The HTTP request failed with error HTTP Error 28: Timeout was reached: Connection timed out after 10001 milliseconds

Test Failed: BESGatherMirror Plugin
Reason: The HTTP request failed with error HTTP Error 28: Timeout was reached: Connection timed out after 10001 milliseconds

Test Failed: BESGatherMirrorNew Plugin
Reason: The HTTP request failed with error HTTP Error 28: Timeout was reached: Connection timed out after 10001 milliseconds

Test Failed: BESMirrorRequest Plugin
Reason: The HTTP request failed with error HTTP Error 28: Timeout was reached: Connection timed out after 10001 milliseconds

Test Not Run: Checking that TCP/IP is enabled on SQL server
Reason: The server is configured to use a remote database

Result

6 out of 12 tests passed, 1 ignored

jgstew · November 9, 2020, 6:19pm

this is likely due to: https://bigfix.me/fixlet/details/3959

You need to make sure to set this on the root server and all console systems to 120 seconds, instead of the default 10 seconds.

Until this timeout is fixed, then nothing is going to work.

If you set the new timeout, and you get the same error, but instead after 120 seconds, then that suggests that something is just down entirely. Probably check the connection from the Root to the DB.

What is the connectivity between the root server and the DB server?

This doesn’t sound like an ideal configuration for BigFix generally due to latency, but it seems like the bigger issue at the moment is timeouts and failures.

gkinger · November 9, 2020, 10:28pm

Thanks, @jgstew I actually saw that fixlet earlier and tried it out. Still having issues so we are going to try out a local DB as that has been a really stable experience for me with Bigfix in comparable environments. I will update here when complete.

gkinger · November 13, 2020, 8:46pm

We moved the database locally and no change. What we landed on with support was the WebReports server which was co-located was spawning excessive (2,250+) TCP connections on 127.0.0.1 That was ultimately filling up the connections and making the Root Server services non-responsive. We disabled Web Reports and you can watch the TCP connections drop. Keep in mind Web Reports was not in use at all. If you want to see what I am talking about try running netstat -abn | find /c on a server where WebReports and the Root Server are on the same server. Then disable the Web Reports service and you can see that number drop dramatically. Ultimately, we are moving Web Reports off and HCL is looking at the reason for this.