BigFix clients not receiving UDP communication from the server

BenM · January 17, 2020, 5:37pm

Hi,

Myself and the team are currently in the process of conducting a POC with BigFix within our business and have come across a standstill as we’re having some agent communication issues (particularly UDP).

The send refresh command or any tasks from the server are not reaching the BES Client via UDP and upon investigating via Wireshark, we’ve found this to be a bad UDP payload length:

This occurs immediately after receiving the initial UDP with port 52311 as destination and there is no further communication between the client and server.

During various testing, we’ve identified that this could be due to MTU fragmentation, however adjusting this did not bear any fruit. We’ve later decided to try and test whether communication is affected by an internal security product thus installed Windows from a vanilla ISO onto a virtual machine, added it to domain and pushed BES Client to it and - it worked. UDP notifications were suddenly working.

This issue was recreated in 2 other instances where the machines did not go through our in-house MDT steps in Windows Deployment. We’re scratching our heads as to what may be the cause of this as all steps customising the OS were disabled with the exception of built-in scripts running within MDT.

This may be already a case of TL;DR however I am reaching out in this forum for any ideas on what we could try to fix this communication issue.

In short, we’ve built one virtual machine and ran it through the MDT build process steps- this is failing UDP communication.
Built another VM (using same ISO used in MDT) however installed Windows manually without running it through MDT - this is communicating properly.

When comparing the two, I could not find any differences in NIC settings or firewall - complete mind boggle. We are at the point where it’s presumed that MDT built-in scripts are customising Windows in an unfavourable way, but at this point it’s guess work.

jgstew · January 17, 2020, 5:42pm

This is very odd. Not really sure what would be causing this, and it isn’t easy to investigate except to take a problematic system and just eliminate possible causes 1 by 1.

I do have a fixlet to measure the time since last UDP: https://github.com/jgstew/bigfix-content/blob/master/fixlet/Test%20time%20since%20last%20UDP%20message%20-%20Universal.bes

The idea is running the fixlet targeting a set of computers should generate a UDP message, while also returning how long ago it happened, so then you should get a very small number in return, usually between -2 and 2 … unless of course UDP isn’t working at all, in which case it will take a long time and return either a large number, or -999 … I should note that it is normal to get a delay between UDP message and running the action if another action is already running on the system and it happens to take a bit. This is NOT a sign of an issue.

If UDP notifications is not working, then you should not only investigate the cause, but also enable command polling on all affected machines and set the interval to ~1 hour.

I also have a relevance property that measures time since last UDP as well. I need to find my improved version.

JasonWalker · January 17, 2020, 7:14pm

I’m not certain that the packet in your capture has anything to do with BigFix?

Besides the source/destination ports not looking like BigFix traffic, the UDP length that is flagged incorrect (12,471 bytes) seems like a much larger packet than anything BigFix would need for a ‘force refresh’ command.

Is there any kind of VPN or IPSec tunneling that might be in play?

estebby · January 20, 2020, 2:31pm

Just to clarify a few things. I work with @BenM . We have tested this on a variety of machines including VM’s and physical machines. When an Empty Action or Manual refresh command is sent from the console, the UDP packet is receivedon port 52311 on each machine we have tested regardless of how they have been built.

On the VM’s, like @BenM says the next sequential packet after the initial UDP communication from the server sends a UDP response back to the Bigfix server on some random port with ‘Bad length value’ error as shown in the orginal post. There is no further communication between the client and server.

On physical machines, we see the same UDP error except that there is lots of subsequent TCP communication including the correct reponse on 52311. However, the ‘Last Report Time’ is still not updated.

We build our machines from a vanilla MS Windows 10 image which is imported into MDT. We do not use a reference image. We have a VM and a physical machine that were manually built from a standard Win 10 Enterprise MS image. If we run a capture while sending an empty action, the UDP ‘Bad length’ value packet is NOT sent from the client to the server, the TCP response is sent and the ‘Last Reported Time’ updates on demand. So essentially we have to figure out why this packet is sent as it seems to be the key.

We have done a lot of testing and have ruled out GPO, WIndows Firewall, Web Proxy, Antivirus and other security software we use. We built another VM via MDT with just a default task sequence applied - literally the bear minimum you can use to install a working OS and manually joined it to the domain in the same place as the other test VM’s - we see the same issue as the other physical machines not manually built from a MS ISO i.e. initial UDP packet received on port 52311, next packet attempts to send back to the BF server on a random port, TCP reponse sent on 52311 to server, Last Report Time does not update.

We have also used a Dell OEM out of the box image, manually joined to the domain in the same place as the other test machines and this also sees the UDP error. We are already working closely with HCL to try and understand this but if anyone has any ideas - they are very welcome to let us know as we are running out of them! It’s a shame as it looks to be a great product which would be super helpful if we can overcome this issue!

JasonWalker · January 20, 2020, 5:11pm

So the UDP packet in the earlier screenshot is sent from the client to the server, and is being captured on the client?

I’d warn that the error shown on the capture could be an issue in how WireShark interacts with the Windows network stack. If the network driver supports Large Send Offload or any kind of UDP traffic offload, the UDP packet sizes and checksums are calculated on the network card itself - after WireShark has seen the packet with bogus defaults. So WireShark can show spurious errors there that aren’t real. You could try capturing the packet “on the wire” with a port mirror to see the difference in how WireShark on Windows sees the packet vs what’s actually sent on the network.

(This is the same reason WireShark can show checksum errors on outbound packets, and I see you already have checksum validation turned off in that capture).

This isn’t to say that there’s no problem, but just to not get too wrapped up in that udp outbound message. The client would not normally send a UDP message to the server at all - when a UDP Gather or ForceRefresh message is received, the client should initiate a TCP connection to its Relay or Server, no UDP reply necessary.

Am I understanding correctly that this only affects your Domain-joined clients, regardless of how they are initially built? I’d still be looking at Windows Firewall, both the allow/deny rules and Connection Security (IPSec) rules, and maybe the content of that UDP message from the client (aside from the size warning from WireShark)

estebby · January 21, 2020, 7:12am

Hi Jason,

Thanks for the response and the explanation regarding Wireshark. I’m no networking expert so that is useful info. Yes HCL advised us to test turning of checksum validation as a test. It made no difference (not surprisingly as it is on by default on all machines including those built from the MS ISO that work).

We’re only seeing this issue on machines built from an MDT image of the vanilla OS imported from MS ISO (we don’t use a reference image taken from a single machine) or on an out of box manufacturer image manually joined to the domain. It does not affect machines manually joined to the domain built from the MS ISO manually. We have regretably already tried adding explicit rules to allow UDP 52311 inbound and TCP 52311 outbound and also tested with the Windows Firewall turned off completely and the error persists.

We’ll take another look at the content of the packet itself and see if I can get the network team to take another look.

Cheers,

Steve

JasonWalker · January 21, 2020, 2:47pm

Turning off checksum validation is just an option in how WireShark displays the packets, it doesn’t have any effect on the actual network communication. It just hides the warnings from WireShark about checksum mismatches (Wireshark sees it as a mismatch, when the actual checksums are calculated and added to the packet by the NIC after WireShark has seen the packet come through the Windows network stack).

Just to check, you are using the default BigFix port 52311?

Are these machines all VMs? If so, what kind of virtual network are they using? Specifically, is it a NAT interface or a bridged interface, and are they the same network type between all of the test cases?

It’s been years since I used MDT, but my recollection was that a default MDT task sequence included some hardening using SCM templates. I’ve not known those to interfere before, but that might be worth checking.

Another possibility, though I admit rare, might be a bad network card driver. MDT would automatically load a driver based on the driver library in your deployment share, and the vendor image might include OEM drivers in their media. Does it look like the network driver versions are the same, or different, between your two deployment methods?

It sounds like you’ve already covered the Windows Firewall piece. As a last check there, though, I’d bring up Resource Monitor (from the Task Manager -> Performance tab). In the Network tab, expand the bottom pane for ‘Listening Ports’. It should show the BESClient.exe listening on udp/52311. The right-most column would show the Firewall status - where you’d want it to show “Allowed, Not Restricted”.

I’m convinced we’re missing something in the configuration. Since this is a proof-of-concept for you, are you engaged with our TA team who can help with planning and initial setup?

Leaving aside the UDP communication (for the moment), we can also look at the workarounds for UDP traffic blocking, which include Command Polling (where the client checks-in to the relay on a regular schedule to look for new actions/content) or Persistent Connections (where the client keeps an opn TCP connection to the relay).

JasonWalker · January 21, 2020, 2:51pm

One more thought, in the same Resource Monitor / Network tab, is to make sure that the udp/52311 is listening by BESClient, and not by some other process. It would be rare for this to be a problem on a client, especially a new client, but on high-load servers like DNS servers (which reserve several thousand UDP ports at startup) another process could grab the 52311 port before BigFix starts up. If that turns out to be a problem, we do have a a task in the BES Support content site to reserve the port, taking it out of Windows’ “ephemeral” port range.

estebby · January 22, 2020, 1:55pm

Thanks @JasonWalker. Yes, we are using the default port 52311. No not all the machines are VM’s. We have a test machine that doesn’t work (built from a basic MDT image) and one that does (from and MS ISO). We also have physical laptops with the UDP issue where they are built from and MDT image and a machine built from the MS ISO where again, comms work correctly.

The VM’s are all vmware ESXi6.7 VM’s with default settings (bridged interface - they all have their own IP not that of the VM host). We have looked at the drivers - On a VM built via MDT (which does not communicate correctly) and one VM built from MS ISO with the same build (which does communicate correctly) - they have the same network driver. As this is a VM it is not able to load our device specific drivers for physical laptops - so I guess these are the default drivers that come with OS - anyway I think we’ve ruled that out based on the above.

I’ve checked Resmon on a machine with the problem and it does indeed show the BES Client listending on UDP/52311 with a Firewall status of “Allowed, Not Restricted”. We’ve also checked that no other process is listening on UDP/ 52311

We don’t have any SCM templates as far as I can determine. We are indeed dealing with HCL’s pre-sales techies who are aware of the problem but also scratching their heads having not seen this before. They have given us fixlet for command polling as a workaround, but obviously we need to figure out the problem before we can realistically consider purchase.

I’m speaking with our network guys to run a trace across the wire later, so we’ll see how that goes.

JasonWalker · January 22, 2020, 2:06pm

This is indeed a head-scratcher. I’m suspecting that some other endpoint security / antivirus / HIPS system is in effect and possibly blocking the traffic.

Going back to that earlier packet capture…when the client doesn’t accept the UDP message and sends back its own UDP message, is the destination port always 21297 ? Is the destination IP address on that packet the client’s server or relay? Can we check that destination server/relay and see what process, if any, is listening on udp/21297 ? Is this client itself listening on udp/21297, and if so, what process is that?

I could conject that perhaps you have another endpoint security product that only allows client-to-client communication if the sender is also running the security product, and the UDP reply might be a way of checking that.

estebby · February 3, 2020, 7:48am

Yes it appears it is reporting back from port 16965 to 21297 on each machine this is a problem. It is sending back to the Bigfix server.

We had Gwyn in for a couple of days last week working with your engineering team to troubleshoot to try and get to the bottom of this and the source of this UDP response could not be traced. What he has done to get round this is to set a local relay and enable the persistent connection feature which seems to be doing the trick.

estebby · February 19, 2020, 3:24pm

In the end it transpired it was CyberArk EPM causing this issue. This was not originally identified as the VM that was communicating correctly had this installed. We added an exception into the CyberArk EPM Agent configuration for the file location of the BESClient and the UDP comms errors on the ports mentioned above ceased and the Last Report Time updated within seconds when sending a refresh.

JasonWalker · February 19, 2020, 3:31pm

Great to hear!

Any idea of the feature name or documentation link of what CyberArk was actually trying to do? I’m guessing it’s something along the lines of the CyberArk agent on the endpoint trying to communicate with a CyberArk agent on the Relay to see if the incoming connection is from a trusted machine, or something like that? Just a guess on my part though.

estebby · February 19, 2020, 3:48pm

TBH, I don’t know. CyberArk is only installed on the client machines (not on the BF Server) so I’m not sure that would be the case. I tried to configure an exception in CyberArk to allow the BESClient.exe Service as a Trusted Software distributor as we already have set up with N-Central (another product we use for software distribution) but that made no difference so I went for the sledgehammer fix to exclude it from being inspected by the Cyberark agent since we trusted the app. By the time our CyberArk CSM added me so I could raise a support ticket with them, I’d figured it out for myself and as this is a PoC, that was good enough for me!