Issues with populating machines in automatic computer groups

Heisenbug · June 19, 2019, 12:53pm

Hi all,

Relative newbie to Bigfix, but have spent quite a bit of time reading, googling, testing etc. For my sins, I’ve inherited a large scale established environment (~100,000 endpoints) which has been performing poorly.

I’ve made quite a few advancements over the last few weeks, but I think I’m beating myself to death with automatic groups and thought it best to ask for a couple of pointers here.

The current environment is 9.5.9. I’ve already asked for an upgrade to 9.5.12 (or even 13) so hopefully that is on the cards.

The main issue is I am currently facing is with automatic group population. I’ve tried to simplify it as much as possible, but I cannot get it working consistently.

So:

Custom site with a subscription rule of: DEVICE TYPE contains WORKSTATION (no apparent issues)

Site has an automatic group - “All Workstations”: DEVICE TYPE contains WORKSTATION (no apparent issues)
Site has an automatic group - “Workstations beginning with M”:
Computer is member of: All Workstations
and
Relevance expression is true: (computer name as lowercase starts with “m”)

Probably around 10% of the machines who begin with M are not joining the 2nd group. They have joined the “All Workstations” group. They have been running for a few days and checking in/reporting regularly. Clients have a wide range of uptimes - some hours, some weeks.

I’ve run various reports, checked logs, but cannot see anything obvious. Some of the clients have existed for years, others are new. All reporting to various combinations of different relays. Some W7, some W10.

Can anyone help save my sanity?

Many thanks!

JasonWalker · June 19, 2019, 1:24pm

Is there any commonality to the machines that are not updating? In particular I’d check the Relay to which they’re reporting, it’s possible they could have a corrupted site data and are not propagating the change (to the single site). Or could those clients be stuck running an action (check the Action History on those clients for actions stuck in a ‘Running’ status).

Heisenbug · June 19, 2019, 1:56pm

Nothing obvious - I did an extract of the data and reported on the relays the machines were connecting to. No significant clusters, fairly even distribution. And the other 90% of PCs are connecting to the same relays without issues.

Same for action status - a spot check shows some running but not a significant number

But thanks for those ideas - at the moment I’m open to any suggestions!

Aram · June 19, 2019, 2:16pm

I’d suggest checking two primary things (assuming that the devices are reporting in):

Have the Clients gathered the latest site in question (i.e. do they have the automatic group file in their site data)? This can be checked within the filesystem (search for the Group’s ID within the folders under __BESData), or confirmed via regular Client logs (check that the Client has successfully synchronized with the latest version of the site containing the automatic group).
Have the Clients evaluated the content in question, and subsequently reported? This will require enabling debug logging…

And if you haven’t already opened a support ticket, I’d recommend doing so

Heisenbug · June 19, 2019, 2:26pm

The clients are definitely reporting in - I run a report and set “last report time” and can see them coming in. They also show up as live in the console.

As for evaluating the content… that’s the million dollar question.

I’m going to setup an analyses:
(names of it, last gather time of it,version of it) of sites

This should help me see if the site versions/dates etc are correct as it will be pulled back to the console and web reports.

Debug logging is currently an issue as I don’t have access to the actual machines or file system access to the server. Yet.

I’m holding off on opening a support ticket until we get moved to the later version. I have looked at the PMR/APAR changelist and can’t see any obvious fix for this, but wouldn’t surprise me if it’s related.

JasonWalker · June 19, 2019, 2:38pm

I don’t recall which version introduced BigFix WebUI and the Query app…do you have it available? I’ve found it useful to use Query to retrieve client log from endpoints I’m troubleshooting, when I don’t have direct access to the endpoint.

Heisenbug · June 19, 2019, 2:54pm

I’ve never heard of Query - I’ll ask if that’s available. Looking at the online docs for that, it looks like it would be incredibly helpful. Many thanks for that tip!

Heisenbug · June 19, 2019, 4:04pm

Just checked Query. How on earth did I not know about this tool? It’s amazing!

Wow. Just wow. This opens up whole new realms of troubleshooting and investigation. Thank you so much for this information.

Admittedly I still need to figure out why the groups aren’t populating, but I can do some real testing now. I’ll either be posting with more puzzling information or hopefully posting what went wrong so others can learn from the mistakes I have here.

Once again - thank you so much.

JasonWalker · June 19, 2019, 4:27pm

Glad to hear Query may be helpful for you!

While you’re at it, here’s a Query I wrote to retrieve the last 50 lines from the current BESClient log. You should be able to adapt it to look at the debug logs if you enable them -

(locked lines of item 1 of it, item 0 of it) whose (line number of item 0 of it > item 1 of it - 50) of (number of locked lines of it, it) of files ((year of it as string & month of it as two digits & day_of_month of it as two digits) of date (local time zone) of now & ".log") of folders "__Global\Logs" of (data folder of client)

Heisenbug · June 20, 2019, 1:28pm

Todays update - thanks to the joys of Query I managed to do some direct targeting of machines. I’m seeing some interesting results e.g. when I click on the machine in Query I can see it reports locally as being members of some of the groups. But that’s not what I see in Web Reports or the console.

I tried force refreshing the machine - still no change in group membership status.

I then had the brainwave of trying to determine which system is lying by checking the relevance directly through Query. However I get the following issue:

(member of group 2634554 of site “Name_of_site_that_computer_is_in”)

and get

The expression could not be evaluated: class InspectorSiteContextError

Based on that, am I correct to infer that Query doesn’t use the Local Client Evaluator? If so, is there a way to either switch to the local client evaluator or is there an alternate way of figuring out the results?

JasonWalker · June 20, 2019, 2:04pm

I don’t think that the Local Client Evaluator is available yet, but this has been brought up as an enhancement and I expect it in the future (you’d have to upgrade your deployment to take advantage when it becomes available though).

The two things I’d check would be whatever relevance is used in your group criteria directly (it should be available on the Properties page of the group), as well as the Registry entries that get added when a computer reports itself to be a member of the group (is that what you were checking locally?). I think that’s what the ‘member of group’ inspector references anyway.

Also FYI, in current versions the Fixlet Debugger itself can use Query to run relevance against a remote machine, you don’t have to use WebUI for that. That might be helpful for anyone else reading this post.

Heisenbug · June 20, 2019, 4:32pm

Thanks for those tips. I’ll try the individual relevance queries for each group within the nest tomorrow.

I’m still a little stumped as to why the machine “locally” appears to report as being a member of the group but it’s definitely not at console level.

Given the changes in 9.5.13 (being able to run Query against the agent), I’m going to push harder to get the upgrade done as this will significantly simplify the debugging.

Tonymoroni · June 21, 2019, 3:52pm

Would love to know how it turns out.

Heisenbug · June 21, 2019, 4:11pm

At the moment each machine I have investigated shows to be a member of the relevant groups I’m investigating when displays via WebUI query. However they definitely don’t show up in the group in the console/Web reports.

This suggests that they have correctly evaluated locally, but some of the results have failed to reach back to the database. It’s not a single relay issue.

I’m going to get an analysis to evaluate all the groups locally on the clients and return them into a string and then report against that (while waiting to get upgraded to 9.5.13).

I’ve tried using the refresh computer icon, and also sending a refresh task but although it processes it doesn’t change the missing groups on the server. This possibly suggest that refresh doesn’t reevaluate everything.

I need to figure out a way to tell the machine to fully reevaluate preferably without nuking the client in the process. I’ll report back when I get more information.

JasonWalker · June 21, 2019, 6:56pm

It’s a long shot, but…have you tried clearing your Console Cache? It’s in the File->Preferences dialog. You have to restart the console for it to take effect.

Heisenbug · June 24, 2019, 9:46am

Yup, it was a long-shot - and unfortunately not the cause

I cleared the cache and restarted. No change. The discrepancy also shows on webreports.

But thank you for the suggestion!

Aram · June 25, 2019, 1:59pm

A refresh should in fact perform a full re-evaluation (you’ll see “Full Report Posted Successfully” when this completes int he Client log). If that is not working, one drastic measure to try is to stop the Client service, delete the __BESData directory, then restart the Client service. This will cause it to re-gather/synchronize, but will also cause it to lose application usage information (and other historical context).

JasonWalker · September 2, 2021, 12:42am

Does the file path include “Program Files” or “Windows\system32” ? I wonder if 32-bit redirection is an issue.

gk_why · September 2, 2021, 12:45am

It of course reported right after I posted and sent a refresh. It took a solid 3 hours for another device. My Client eval time looks good around 16 minutes on average. I have seen this a few times where I have had to reboot a client. Does it prioritize running actions over evaluation or does it constantly run evaluation as I understood it in the background simultaneously?

JasonWalker · September 2, 2021, 2:42am

Evaluation continues in the background, but…joining a computer group is itself an action, just a hidden action. If another long-running action is active, the client cannot join a computer group until the other action completes.