Computers stuck in <not reported>, have to manually restart BES service

So I’ve been looking at the Patches for Windows external site (we do not have a custom site for patches for windows), and when showing non-relevant content there are 20557 fixlets and tasks! I sort by applicability and they go back to 1998!!! Going to back as to why would the client be wasting valuable time evaluating old non-relevant fixlets, is it normal practice to go into Patches for Windows, show non-relevant content, select all the non-relevant, rt click and select "hide globally?

As you can see from the screenshot you can see the 1998 fixlet and then after that I start seeing the ones that are relevant.

If you scroll to the bottom I see most of machines relevant and some of those fixlets are not really current but at least it’s better than having the agents go through old as heck 1998 fixlets\tasks, correct?

@JasonWalker I imported the first link provided on your other post (2994765) and it’s activated. Seeing lots of data but I can’t make sense of it not knowing what exactly to look for. Should I open a ticket with HCL? Would they be able to provide some guidance on this?

That first analysis is good for getting an overall idea of the client health, but the second analysis at https://bigfix.me/analysis/details/2998424 will be more useful for checking which specific pieces of content are increasing the eval loop (if indeed it’s even a content problem - maybe this is already resolved with the antivirus exceptions you added?)

From this second analysis, one would start with the results from these three properties to see what fixlets/baselines/tasks might be taking too long on the client -

  • Evaluations slower than 15 minutes
  • Evaluations slower than 60 minutes

I had to contact the owner to have access to that analyses. As soon as I get it I’l try it out.

I’ll look into those.

Ah, that analysis may have been abandoned by the owner. I’ve created a new analysis that retrieves some of the properties we need, please have a look instead at https://bigfix.me/analysis/details/2998690

2 Likes

Thanks for that quick turnaround, I loaded the analysis and I am seeing the data, I see the columns you mentioned and it’s showing data, for example slower than 1 min:

image

Now it’s what to do with it? :confused:

And another question regarding your other Client Responsiveness post, you say to enable Client polling on both clients and relays because in the beginning of the paragraph you mention clients and later you mention:

If so do I set the same time polling internal I used on my clients of 300 seconds (15 mins) and this would be for main server and both internal and dmz relay, correct?

Thanks!

What is Fixlet 5953 I. That site? What is the relevance?

Persistent Connections is a different thing. We may look at that later but your problem right now is evaluation cycle time.

1 Like

It’s a Custom site and it’s a 2019 Excel 2016 security update and it’s applicable to a single machine which is actually a server which at some point probably had office installed but no longer does.

1 Like

I would recommend deleting a lot of the old duplicated patch content from that shared site. I am also a little confused why that content is being duplicated in the first place, but also the relevance evaluation of duplicated patch content is going to be much worse in a custom site because it looses some of the optimizations that are made that can only be done in external sites and do not work within custom sites.

Globally hiding old content can enhance the performance of the console for NMOs or MOs that are not showing hidden content, but it does not affect client evaluation loop.

it is generally ideal to keep the actionsite as small as possible and minimize the content stored within it.

If you create a new custom site that all computers subscribe to and put everything in there instead of the actionsite, it can create the same problem to some degree, but it is still generally better than putting things in the actionsite because of the frequency at which the actionsite can be updated for other reasons.

I generally recommend having something like:

Shared
Shared/Windows
Shared/WindowsServer
Shared/WindowsDesktop
Shared/Linux

and then put the appropriate content in to the appropriate site and limit the site subscription criteria.

I am concerned that the issue could actually be the report processing speed from the relay and/or root server, especially if all of the machines reporting to the same relay have the same issue, while machines reporting to a different relay do not. You may need to tune some of the settings on the relay.

Optimizing the client eval loop is a very good ideas as well, but you might have more than one issue going on simultaneously.

Do your relays have slow disks? or slow NICs? Ideally the volume which runs FillDB on the root server, or processes incoming reports from clients on the relay would be on faster disks. (SSDs)

I’d like to understand how to do that. I am assuming that if I select that ID 5953 and select remove while being on Shared Content > Fixlets and Tasks that it will completely remove it from Bigfix, correct? I would have to export the fixlet and re-import it to get it back?

This is what our Shared Site looks like and I wish I actually knew why the msp did this? What are dis\advantage to this?:
image

Good to know, thank you.

Here’s our Master Action Site:
image

We are on a vm environment running on all flash attached disk storage with 10gb on the vm hosts.

Thanks @jgstew

1 Like

That looks like quite a high number of actions. Are you deploying content as master operator accounts?

One situation we had that impacted performances was we were using MO accounts to deploy baselines from a custom site. As it was deployed by a MO, the baselines action was created in the Master Action site and as the baselines were between 100-200 components it was an action that was up to 12mb in size. That was repeated over several baselines so the size of content in the Master Action site grew and became a performance hit on the agents. We stopped using MO for day to day activity and this reduces the size of the action site and avoid the performance issues.

You could check the size of the “C:\Program Files (x86)\BigFix Enterprise\BES Client__BESData\actionsite” folder on some endpoints. I recall seeing it mentioned that keeping this below 25MB is good practise.

1 Like

There are a few select things you should deploy as MO targeting “all computers”, like for instance some things to set initial client settings if unset to enhance client performance as early on to the process as possible.

Otherwise, as is mentioned here, you should generally avoid taking action on things as an MO unless it is something that actually needs to target all computers, or is so important that it is warranted.

Patching baselines are definitely things that should NOT be taken as an MO in the ideal case due to their size and complexity.

almost 8000 actions is quite a lot and almost certainly needs some cleanup.

Yes, that is correct. You can export the content and save it somewhere. Then delete it. Then if you ever need it again you could import it again.

Having a shared content site is a good idea in general, but you should also not keep things in it that you definitely do not nee any longer.

1 Like

Everyone, thanks for helping out- I don’t wanna lose you now…thanks for sticking here with me.

Yes we push fixlets and such using MO accounts. But it’s only 2 users that actually use the system for this purpose. I"ll have to put a pin on this one for after I get this not reported issue under control.

So I tried to apply that fixlet ID5953 which was for excel 2016 on the server that at some point had Office installed for whatever reason and it failed with exit code 1605

ERROR_UNKNOWN_PRODUCT 1605 This action is only valid for products that are currently installed.

And indeed after checking to log file it did mention product not installed, so the fixlet is picking up some registries left behind.

I exported the fixlet for ‘safe keeping’ and removed it from the Shared Content site. But why would one need a fixlet that is no longer in use? We’re not going to use Office 2016, do I need to save that fixlet for any reason?

And some additional questions, I know that if you create a fixlet you can select the site and that’s how you can get fixlets, analyses, etc… into the shared content site, but how can a regular fixlet end up there? For example a patches for windows fixlet (the excel one) ended up in shared content?
I started going through the analyses that @JasonWalker provided and I exported and removed the fixlet, task or analyses from shared content side. That should obviously stop the workstations that were getting ‘stuck’ on those and expect to see evaluation times go down unless others come up taking more than 1 min/15min. I am not done cleaning up but I should expect to see new fixlet ID’s at some point.

@jgstew you recommended to create custom sites that target the OS:

Thinking of it like a folder structure that looks like nested sites within Shared, but I dont see the ability to do that. If so I would assume the following structure instead of having multiple sites called Shared-

Custom Sites
Shared
Shared-Windows
Shared-WindowsServer
Shared-WindowsDesktop
Shared-Linux

Thanks

1 Like

You can export a fixlet from one site and import it to another. not sure why you would do it for patch content though.

Sites can’t actually have a folder structure in reality, but you can actually have a hierarchy of sites in reality by how you structure the relevance for the site subscription criteria.

I guess that’s how the msp managed to do this then. <— retracting that sentence…this was done internally when this was first implemented about 5 years ago by a colleague also trying to understand the application.

Yeah I would have a relevance for each custom site to match what I am targeting. so if one CS is windows desktop targets windows desktops, the other CS is for win servers then it targets win servers, etc…
I hope that made sense

I deleted the ID’s coming up as taking X minutes from the Evaluation Cycle analysis but I still have machine reporting on those ID’s. I made sure to click on remove and searched bf again after selecting Show Non-Relevant and Show Hidden items buttons.

The Analyses that report those results can be work-intensive; I think by default they are set to only update every 6 hours or maybe 12.

You can right-click on one or a few of those computers and use the “Force Refresh” option to re-evaluate all the properties now. Give that fifteen minutes or so and check the results again (in the client log you’ll see a message similar to “Full Report posted successfully” when the client completes the re-evaluation of everything)

1 Like

@JasonWalker I think it’s been just about 24 hours since you recommended to change the refresh to “every report”…yet I am still seeing ID’s of items that are no longer in the Custom Sites > Shared Content site. I explicitly exported and deleted by right clicking on the item and selecting Remove. I feel like I am chasing my tail here.

I ended up increasing the mark as offline under preferences to 90 mins so at least we can get better visibility of machines that are actually online and reachable…and even yet I find some that are past 90 mins :confused:

For example this one I was able to ping when it was grey, check out the Report posted successfully times…

and to show that I did change to every report :slight_smile:

Oh my… I apologize if I wasn’t clear, but I didn’t mean to make those properties evaluate “every report”. Recommend you change those back to 6 hours or 24 hours as soon as you can.

What I meant was to right-click a Computer and use the “Force Refresh” option to make that computer reevaluate all properties and create a new report (one-time refresh).

lol…I thought it was weird that you were recommending to change all the Properties to Every Report considering the machines taking long to eval, but yeah…I TOTALLY misread that. I thought it was to get data faster from the machines…and after 24 hours I would have updated data, which I am obviously not.

Either way I just changed it to their default values and refreshing on 3 different systems that I am currently testing with. I’ll see what comes up for those.

Thanks!

Edit: I also found this article regarding performance counters. By the looks of it, it reports the same data about cycle times. https://support.hcltechsw.com/csm?id=kb_article&sysparm_article=KB0023415 Shouldn’t make a difference to use the one your recommended vs this one?