Endpoints not reporting in large quantities

heagsta · October 1, 2019, 9:24pm

Hey folks, having an issue where I am not getting my tasks down to my endpoints. This happens occasionally, but now it seems pretty consistent for this entire week, I am using server automation,let me know if you would like it moved there. I didn’t file this under SA because it runs just fine as far as I can tell, the plan runs just fine, and when I am running it, they are running against a group as the target, and most are not reported. I am targeting a group of around 1,600 servers and it normally runs just fine. This particular task just uploads the Computer ID and a few other details, 1k through the upload manager. Most of the 1600 are not getting the action at all. Not sure where to start troubleshooting this one. In the morning, if we re-run these not reported endpoints from before, they will run just fine targeting the failed in a manual group. If I take a blank action on all 1600 manually, they all return within minutes.

I see errors under the main server besrelay.log where it is getting socket errors when talking to the 6 upper level relays, but they are frequent in the log and always have been. Any help would be appreciated, thank you in advance.

DBAs see only 75-85% utilization; however, the newer version seems to use the DB temp tables more than it used to.

9.5.11 for the main server and upper level relays (x6) . Mixed versions below that, legacy OSs.
15,600 endpoints total

jgstew · October 1, 2019, 10:32pm

Is there a common relay for the 1600 that are having trouble? It could be that one of the relays in the chain from main server to the 1600 is getting intermittently overloaded, causing the problem.

Are the clients all getting UDP commands from their parent relay? https://bigfix.me/relevance/details/3021681

Anything else common about these endpoints? all on the same subnet? are they VMs on the same host or same storage?

Do you have command polling enabled?

heagsta · October 1, 2019, 10:59pm

No, they are spread across about 6 different upper level relays for the failures. I checked UDP with that analysis, all receiving UDP just fine as of today on the endpoints. We added a few processors to the DB just in case to see if that will yield any results. Not reported failures are 800 locations spread across different subnets, 2 servers per subnet, one of those 2 servers a local relay per subnet.

itsmpro92 · October 2, 2019, 1:24am

What do you see in the log of an affected client? How about the besrelay log on one of the upper level relays? Are other SA plans working as expected?

heagsta · October 2, 2019, 4:01pm

the client never receives the action from what I can see. Yes, the remainder of that plan and other plans run just fine. I just see normal errors that have always been there from relay logs, mostly find site ID for URL errors on Upper Level relays, very few Http Error 28 errors. On the main server, I do see frequent socket errors to the upper level relays, not sure what could cause those network errors, but those have been there for quite some time. Normally whenever I inquire about them, if I am not experiencing any issues at that time, I am told to not worry about them.

itsmpro92 · October 3, 2019, 12:25am

I wonder if there is some kind of network conflict at the time of this Plan’s execution. Have you considered enabling command polling on some of the affected machines? It is useful in situations where the UDP notification packet is not being received by the client (for any variety of reasons).

bradsexton81 · October 3, 2019, 12:01pm

My guess is you have a network issue. Best thing to do is send a blank action to those machines and if the majority do no complete that action in 30 minutes it means you have something blocking udp 52311 check your firewall and make sure the windows firewall isn’t turned on.

heagsta · November 21, 2019, 11:14pm

Was caused by a port scanning security overlapping my polling windows. After moving their scans to a different time, works flawlessly now. When in doubt, look at the security phantom scanning Blank actions during the day worked just fine outside of the polling window.