Computers stuck in <not reported>, have to manually restart BES service

It’s really difficult to say what a ‘good’ evaluation time is, it’s all about your expectations. I’d suggest using the link I had earlier to retrieve the client performance analysis, import it, and activate it, so you can start getting a baseline of how your machines are performing.

What you’d want to watch for, are outliers (some machine evaluating much more slowly than others); as well as changes over time (good evaluation times in December and much longer times now might indicate a new bad property/relevance added).

Once you know your expected eval times, you can tune the Console’s graying-out to match your times. I am aware of some large customers with hundreds of thousands of computers, with a ton of properties, that take two or three hours to complete an evaluation cycle. Those properties are important to them though, so they live with those times and accept that the actions they issue can take longer to respond (generally they issue patching actions several days ahead of their maintenance windows anyway).

1 Like

I also want to add the importance of https://help.hcltechsw.com/bigfix/10.0/platform/Platform/Config/c_real_time_av.html - I’m having a customer that complained about a Huge Average Evaluation Cycle (More than 10 hours) - We Started with a Clean Image with just BigFix Client, the average evaluation cycle, was up to 15 minutes… so we knew that there is something that caused the Client to work a lot harder. after he Installed Carbon Black… The average evaluation cycle sky rocketed to … 15 Hours
Of course at first , he said that he already excluded as we asked for, but we confirmed that he did not do that… :slight_smile:

Always make sure to make the baseline with a Clean Image and BigFix Client

3 Likes

@JasonWalker thank you for all that information, I am trying to digest this all to make sense of it, improve our environment and learn along the way.

Now for some Q’s…

^I found ID 6170 in Master Action Site and it was Baseline created back in 2019 when we first were getting started with bigfix. No longer needed it was removed, so that shouldn’t be an issue for any other besclient.

^I found the Analyses, I changed it from evaluating every hour to every day.

^I dont see EnableSupersededEvaluation on the settings of this particular client nor on other clients while spot checking. I was trying to find a fixlet\task that would tell me which computers have it but I am unable to. Is there such fixlet\task or how can I found out overall in the environment if this is set? Searched the forum for that and it seems like it’s not a good idea to this set, am I correct in this?

^So I was not finding this fixlet until I decided to click on Show Hidden Content and then Show Non-Relevant Content…that’s when I saw both ID’s 281736901 and 281736903.
There should be no machines with Office 2010 and both of these fixlets show that they’re applicable to 0 machines. Why would these come up in the logs as being evaluated? Then I searched for some more IDs from the log and there’s old adobe reader DC fixlet from 2020 but yet the machine in question has reader XI. Why would these old fixlets for Patches for Windows or Updates for Windows Applications being scanned by the besclient? These applications are not even on the machine I am currently troubleshooting with?

I know that’s a lot of Q’s but thanks for your time.

@orbiton thanks for that link. I am not going to say that I remember having any exclusions configured, but when I checked…there were NO exclusions anyways. Let’s just say I’ve added them to the server side and will have another team exclude for the desktop side. This should definitely help. :+1:

1 Like

So I’ve been looking at the Patches for Windows external site (we do not have a custom site for patches for windows), and when showing non-relevant content there are 20557 fixlets and tasks! I sort by applicability and they go back to 1998!!! Going to back as to why would the client be wasting valuable time evaluating old non-relevant fixlets, is it normal practice to go into Patches for Windows, show non-relevant content, select all the non-relevant, rt click and select "hide globally?

As you can see from the screenshot you can see the 1998 fixlet and then after that I start seeing the ones that are relevant.

If you scroll to the bottom I see most of machines relevant and some of those fixlets are not really current but at least it’s better than having the agents go through old as heck 1998 fixlets\tasks, correct?

@JasonWalker I imported the first link provided on your other post (2994765) and it’s activated. Seeing lots of data but I can’t make sense of it not knowing what exactly to look for. Should I open a ticket with HCL? Would they be able to provide some guidance on this?

That first analysis is good for getting an overall idea of the client health, but the second analysis at https://bigfix.me/analysis/details/2998424 will be more useful for checking which specific pieces of content are increasing the eval loop (if indeed it’s even a content problem - maybe this is already resolved with the antivirus exceptions you added?)

From this second analysis, one would start with the results from these three properties to see what fixlets/baselines/tasks might be taking too long on the client -

  • Evaluations slower than 15 minutes
  • Evaluations slower than 60 minutes

I had to contact the owner to have access to that analyses. As soon as I get it I’l try it out.

I’ll look into those.

Ah, that analysis may have been abandoned by the owner. I’ve created a new analysis that retrieves some of the properties we need, please have a look instead at https://bigfix.me/analysis/details/2998690

2 Likes

Thanks for that quick turnaround, I loaded the analysis and I am seeing the data, I see the columns you mentioned and it’s showing data, for example slower than 1 min:

image

Now it’s what to do with it? :confused:

And another question regarding your other Client Responsiveness post, you say to enable Client polling on both clients and relays because in the beginning of the paragraph you mention clients and later you mention:

If so do I set the same time polling internal I used on my clients of 300 seconds (15 mins) and this would be for main server and both internal and dmz relay, correct?

Thanks!

What is Fixlet 5953 I. That site? What is the relevance?

Persistent Connections is a different thing. We may look at that later but your problem right now is evaluation cycle time.

1 Like

It’s a Custom site and it’s a 2019 Excel 2016 security update and it’s applicable to a single machine which is actually a server which at some point probably had office installed but no longer does.

1 Like

I would recommend deleting a lot of the old duplicated patch content from that shared site. I am also a little confused why that content is being duplicated in the first place, but also the relevance evaluation of duplicated patch content is going to be much worse in a custom site because it looses some of the optimizations that are made that can only be done in external sites and do not work within custom sites.

Globally hiding old content can enhance the performance of the console for NMOs or MOs that are not showing hidden content, but it does not affect client evaluation loop.

it is generally ideal to keep the actionsite as small as possible and minimize the content stored within it.

If you create a new custom site that all computers subscribe to and put everything in there instead of the actionsite, it can create the same problem to some degree, but it is still generally better than putting things in the actionsite because of the frequency at which the actionsite can be updated for other reasons.

I generally recommend having something like:

Shared
Shared/Windows
Shared/WindowsServer
Shared/WindowsDesktop
Shared/Linux

and then put the appropriate content in to the appropriate site and limit the site subscription criteria.

I am concerned that the issue could actually be the report processing speed from the relay and/or root server, especially if all of the machines reporting to the same relay have the same issue, while machines reporting to a different relay do not. You may need to tune some of the settings on the relay.

Optimizing the client eval loop is a very good ideas as well, but you might have more than one issue going on simultaneously.

Do your relays have slow disks? or slow NICs? Ideally the volume which runs FillDB on the root server, or processes incoming reports from clients on the relay would be on faster disks. (SSDs)

I’d like to understand how to do that. I am assuming that if I select that ID 5953 and select remove while being on Shared Content > Fixlets and Tasks that it will completely remove it from Bigfix, correct? I would have to export the fixlet and re-import it to get it back?

This is what our Shared Site looks like and I wish I actually knew why the msp did this? What are dis\advantage to this?:
image

Good to know, thank you.

Here’s our Master Action Site:
image

We are on a vm environment running on all flash attached disk storage with 10gb on the vm hosts.

Thanks @jgstew

1 Like

That looks like quite a high number of actions. Are you deploying content as master operator accounts?

One situation we had that impacted performances was we were using MO accounts to deploy baselines from a custom site. As it was deployed by a MO, the baselines action was created in the Master Action site and as the baselines were between 100-200 components it was an action that was up to 12mb in size. That was repeated over several baselines so the size of content in the Master Action site grew and became a performance hit on the agents. We stopped using MO for day to day activity and this reduces the size of the action site and avoid the performance issues.

You could check the size of the “C:\Program Files (x86)\BigFix Enterprise\BES Client__BESData\actionsite” folder on some endpoints. I recall seeing it mentioned that keeping this below 25MB is good practise.

1 Like

There are a few select things you should deploy as MO targeting “all computers”, like for instance some things to set initial client settings if unset to enhance client performance as early on to the process as possible.

Otherwise, as is mentioned here, you should generally avoid taking action on things as an MO unless it is something that actually needs to target all computers, or is so important that it is warranted.

Patching baselines are definitely things that should NOT be taken as an MO in the ideal case due to their size and complexity.

almost 8000 actions is quite a lot and almost certainly needs some cleanup.

Yes, that is correct. You can export the content and save it somewhere. Then delete it. Then if you ever need it again you could import it again.

Having a shared content site is a good idea in general, but you should also not keep things in it that you definitely do not nee any longer.

1 Like

Everyone, thanks for helping out- I don’t wanna lose you now…thanks for sticking here with me.

Yes we push fixlets and such using MO accounts. But it’s only 2 users that actually use the system for this purpose. I"ll have to put a pin on this one for after I get this not reported issue under control.

So I tried to apply that fixlet ID5953 which was for excel 2016 on the server that at some point had Office installed for whatever reason and it failed with exit code 1605

ERROR_UNKNOWN_PRODUCT 1605 This action is only valid for products that are currently installed.

And indeed after checking to log file it did mention product not installed, so the fixlet is picking up some registries left behind.

I exported the fixlet for ‘safe keeping’ and removed it from the Shared Content site. But why would one need a fixlet that is no longer in use? We’re not going to use Office 2016, do I need to save that fixlet for any reason?

And some additional questions, I know that if you create a fixlet you can select the site and that’s how you can get fixlets, analyses, etc… into the shared content site, but how can a regular fixlet end up there? For example a patches for windows fixlet (the excel one) ended up in shared content?
I started going through the analyses that @JasonWalker provided and I exported and removed the fixlet, task or analyses from shared content side. That should obviously stop the workstations that were getting ‘stuck’ on those and expect to see evaluation times go down unless others come up taking more than 1 min/15min. I am not done cleaning up but I should expect to see new fixlet ID’s at some point.

@jgstew you recommended to create custom sites that target the OS:

Thinking of it like a folder structure that looks like nested sites within Shared, but I dont see the ability to do that. If so I would assume the following structure instead of having multiple sites called Shared-

Custom Sites
Shared
Shared-Windows
Shared-WindowsServer
Shared-WindowsDesktop
Shared-Linux

Thanks

1 Like

You can export a fixlet from one site and import it to another. not sure why you would do it for patch content though.

Sites can’t actually have a folder structure in reality, but you can actually have a hierarchy of sites in reality by how you structure the relevance for the site subscription criteria.

I guess that’s how the msp managed to do this then. <— retracting that sentence…this was done internally when this was first implemented about 5 years ago by a colleague also trying to understand the application.

Yeah I would have a relevance for each custom site to match what I am targeting. so if one CS is windows desktop targets windows desktops, the other CS is for win servers then it targets win servers, etc…
I hope that made sense

I deleted the ID’s coming up as taking X minutes from the Evaluation Cycle analysis but I still have machine reporting on those ID’s. I made sure to click on remove and searched bf again after selecting Show Non-Relevant and Show Hidden items buttons.