We have many machines pointing to this relay, and this one in particular is not the only one that randomly shows greyed out, I am just using it as one example. And actually this happens to random machines that either contact the primary server or either the inside relay or dmz relay. Again, fw is open tcp\udp 52311 both ways.
Now back to this machine, I am trying to deploy the 2 fixlets you mentioned but it’s just not getting to this particular client. No Gather log entries and send refresh seems to do nothing. I have restarted the BES service and even restarted the machine itself but no luck; action status is .
I would really like to get to the bottom of this since even when we had an msp handling our bfix instance they really couldn’t figure out as to why we had this issue.
I see the issue; according to this screen shot, it takes your client an average of 1.5 hours to post a report, which suggests that they are undoubtedly busy with something.
You can look into below link for enabling debug & profiler logs manually.
I am curious, but how did you get to 1.5 hours I’ll check out that link.
and I was poking around the relay server log (logfile.txt) and there are many lines like this one but with different ip’s, this is the one for the machine in question
Tue, 23 Jan 2024 05:48:08 -0500 - /cgi-bin/bfenterprise/clientregister.exe (5704) - Uncaught exception in plugin ClientRegister with client 192.168.184.42: HTTP Error 55: Failed sending data to the peer: Connection died, tried 5 times before giving up
It appears that your internet-facing client attempted to connect but was unsuccessful. There may be a number of reasons for this, but I don’t believe this error message has anything to do with your problematic computer. You can safely disregard that if the majority of your clients and your relay are able to report correctly in the console.
Common culprits would be some custom property that recurses through the whole disk using 9ne of the ‘descendant’ inspectors, or something that collects hashes of many files or of large files.
I was looking at that last night, good article you put together. I’ll look at it more today.
Could it be a resource issue on the relay server? There are no resources alerts and we have 830 machines registered.
I was sorting the computers by last report time and by this particular relay and I see that there 466 machines on that relay and I have 55 with today’s date that are showing up gray. I did a spot check and some are not replying to ping so maybe not available, but there are many that do reply and show gray.
@vk.khurava I have 10 usageprofiler.txt.000#.log files on BES client folder. Any advice on how to read them or do I post them here?
If you examine these log files, you’ll see something like this below. The elapsed timing should be your primary focus; if it’s increasing in seconds or minutes, you need to verify the content & fix it.
0000) : Serial numbering
28.407 : Timing in seconds, here its 28 seconds
actionsite.1368 : Site Name & content ID
Evaluate Property 1 : Seems Retrieved Property evaluation
Background Evaluation : Fixlet/task/Baseline background evaluation
ok, so there’s definitely an issue with evaluations. First of all I found this query to use with q&a on this machine: Analyzing profiler output
lines whose (((following text of first ") " of preceding text of first “.” of it) as integer) >= 1) of files whose (name of it contains “usageprofiler.txt” and modification time of it = (maximum of modification times of files whose (name of it contains “usageprofiler.txt”) of parent folder of regapp “BESClient.exe”)) of parent folder of regapp “BESClient.exe”
Looking at the last file with the above query (is that called a query?) it’s taking up to 280 seconds!
Start:Wed, 24 Jan 2024 09:36:31 -0500
Elapsed Time:01:00:04
Tracking: Top 100
Samples:8837
Elapsed Evaluation Time:00:36:11
So there’s obviously a big delay on this particular client, can we also presume that other clients will be experiencing the same delays? And this delay is what is causing the bes client to not properly report on time and make the machines look offline?
That longest-evaluating entry is from ‘actionsite’, i.e. the Master Action Site. That could be an Action, or a custom Fixlet/Task/Baseline/Analysis.
Because the evaluation type is ‘Background Evaluation’, we know that this is the Relevance of a fixlet/task/action, and not an Analysis Property. If this were a Property, it would have a message like ‘Evaluate Property 2’ on lines 26 & 27.
In the Console’s Fixlets/Baselines views, be sure to add a column for ‘ID’ and then you can see whether this is a Fixlet, Task, or whatever, and can see whether the Relevance for it can be optimized.
The next two long-running entries are from the ‘Enterprise Security’ site; this is the internal name of ‘Patches for Windows’. You won’t be able to do much to tune that, unless you’ve configured the clients to ‘EnableSupersededEvaluation’ and they’re continuing to spend time evaluating superseded fixlets; you could return that to the default of Disabled.
But there are some optimizations that HCL can, and does, do on external sites like Patches for Windows that apply to External Content but not so much to your Custom Content. For example, we can configure fixlets to only evaluate once a day, or every six hours, or other intervals for fixlets we know will be expensive; a fixlet in the external site that takes 180 seconds but only evaluates once a day, is much better than something that takes 120 seconds but re-evaluates for every client loop.
Your Row 1 entry 187.204: Enterprise Security.281736901:Background Evaluation corresponds to a Fixlet in Patches for Windows, that is an update for Office 2010 released in year 2014. It might be worth checking those problematic machines to see if they still have Office 2010 installed, or registry traces that it used to be; this relevance evaluation runs fast on my machine, but I can see it might take much longer if the machine actually had Office 2010 installed (because there are a lot more files/registry values to check, that are short-cutted if the Office 2010 registry paths don’t exist).
It’s really difficult to say what a ‘good’ evaluation time is, it’s all about your expectations. I’d suggest using the link I had earlier to retrieve the client performance analysis, import it, and activate it, so you can start getting a baseline of how your machines are performing.
What you’d want to watch for, are outliers (some machine evaluating much more slowly than others); as well as changes over time (good evaluation times in December and much longer times now might indicate a new bad property/relevance added).
Once you know your expected eval times, you can tune the Console’s graying-out to match your times. I am aware of some large customers with hundreds of thousands of computers, with a ton of properties, that take two or three hours to complete an evaluation cycle. Those properties are important to them though, so they live with those times and accept that the actions they issue can take longer to respond (generally they issue patching actions several days ahead of their maintenance windows anyway).
I also want to add the importance of https://help.hcltechsw.com/bigfix/10.0/platform/Platform/Config/c_real_time_av.html - I’m having a customer that complained about a Huge Average Evaluation Cycle (More than 10 hours) - We Started with a Clean Image with just BigFix Client, the average evaluation cycle, was up to 15 minutes… so we knew that there is something that caused the Client to work a lot harder. after he Installed Carbon Black… The average evaluation cycle sky rocketed to … 15 Hours
Of course at first , he said that he already excluded as we asked for, but we confirmed that he did not do that…
Always make sure to make the baseline with a Clean Image and BigFix Client
@JasonWalker thank you for all that information, I am trying to digest this all to make sense of it, improve our environment and learn along the way.
Now for some Q’s…
^I found ID 6170 in Master Action Site and it was Baseline created back in 2019 when we first were getting started with bigfix. No longer needed it was removed, so that shouldn’t be an issue for any other besclient.
^I found the Analyses, I changed it from evaluating every hour to every day.
^I dont see EnableSupersededEvaluation on the settings of this particular client nor on other clients while spot checking. I was trying to find a fixlet\task that would tell me which computers have it but I am unable to. Is there such fixlet\task or how can I found out overall in the environment if this is set? Searched the forum for that and it seems like it’s not a good idea to this set, am I correct in this?
^So I was not finding this fixlet until I decided to click on Show Hidden Content and then Show Non-Relevant Content…that’s when I saw both ID’s 281736901 and 281736903.
There should be no machines with Office 2010 and both of these fixlets show that they’re applicable to 0 machines. Why would these come up in the logs as being evaluated? Then I searched for some more IDs from the log and there’s old adobe reader DC fixlet from 2020 but yet the machine in question has reader XI. Why would these old fixlets for Patches for Windows or Updates for Windows Applications being scanned by the besclient? These applications are not even on the machine I am currently troubleshooting with?
I know that’s a lot of Q’s but thanks for your time.
@orbiton thanks for that link. I am not going to say that I remember having any exclusions configured, but when I checked…there were NO exclusions anyways. Let’s just say I’ve added them to the server side and will have another team exclude for the desktop side. This should definitely help.
So I’ve been looking at the Patches for Windows external site (we do not have a custom site for patches for windows), and when showing non-relevant content there are 20557 fixlets and tasks! I sort by applicability and they go back to 1998!!! Going to back as to why would the client be wasting valuable time evaluating old non-relevant fixlets, is it normal practice to go into Patches for Windows, show non-relevant content, select all the non-relevant, rt click and select "hide globally?
As you can see from the screenshot you can see the 1998 fixlet and then after that I start seeing the ones that are relevant.
If you scroll to the bottom I see most of machines relevant and some of those fixlets are not really current but at least it’s better than having the agents go through old as heck 1998 fixlets\tasks, correct?
@JasonWalker I imported the first link provided on your other post (2994765) and it’s activated. Seeing lots of data but I can’t make sense of it not knowing what exactly to look for. Should I open a ticket with HCL? Would they be able to provide some guidance on this?
That first analysis is good for getting an overall idea of the client health, but the second analysis at https://bigfix.me/analysis/details/2998424 will be more useful for checking which specific pieces of content are increasing the eval loop (if indeed it’s even a content problem - maybe this is already resolved with the antivirus exceptions you added?)
From this second analysis, one would start with the results from these three properties to see what fixlets/baselines/tasks might be taking too long on the client -
Ah, that analysis may have been abandoned by the owner. I’ve created a new analysis that retrieves some of the properties we need, please have a look instead at https://bigfix.me/analysis/details/2998690
Thanks for that quick turnaround, I loaded the analysis and I am seeing the data, I see the columns you mentioned and it’s showing data, for example slower than 1 min:
Now it’s what to do with it?
And another question regarding your other Client Responsiveness post, you say to enable Client polling on both clients and relays because in the beginning of the paragraph you mention clients and later you mention:
If so do I set the same time polling internal I used on my clients of 300 seconds (15 mins) and this would be for main server and both internal and dmz relay, correct?