Endpoints temporarily leaving sites due to CRC Mismatch, giving "not reported" results

Heisenbug · May 14, 2020, 11:18am

Hi all,

Wasn’t entirely sure if I should put this post here or under platform usage so please forgive me if it should have been in another category.

Looking for a little help with machines “bunny hopping” out of sites and therefore losing reporting data. Apologies for the length of the post, but I think the information is relevant and may also help Bigfix newbies on how to perform methodical debugging.

Background: I was originally chasing an issue trying to track down a number of “<not reported>” status for web reports, and with the help of HCL ended up at the article here:

https://help.hcltechsw.com/bigfix/9.5/platform/Platform/Web_Reports/c_tuning_webreport.html

This lovely article enlightened me to the fact that “not reported” can also effectively mean “yes, it did report previously, but it’s actually become non-relevant so now I’ll change the original result to not reported”. The previously reported results from a client can still be seen in the console.

Now, the easy thing to do would be to flick the registry as per the article, but In my opinion that would just mask the problem. I want to know if machines have really stopped reporting as that’s a problem requiring resolution.

Soooo - I created an analysis to show when machines last subscribed to the site that contains the analysis which periodically show “<not reported>”:

(minimum of subscribe times of site whose(name of it = “name_of_my_site”)) as universal string

When I report against this new analysis and sort by time this shows that roughly 1.5% of our machines “bunny hop” out of the site at any one time and then rejoin it.

Using a little Query to pull back log files, I see the following happening on clients:

—
(…standard client startup routine snipped …. )

At 10:35:44 +0800 -
    Initializing Site: name_of_my_site
At 10:36:03 +0800 -
    CRC mismatch while loading site ‘name_of_my_site

(… normal relay selection stuff snipped …)

 At 10:36:27 +0800 -
    Processing Download plugins
    Adding custom site (name_of_my_site)

(… life goes on, site gets reprocessed…)

—

Times obviously differ per client

The machines are various levels of W10. Clients are 9.5.14. The logs don’t show any abnormal shutdowns of the client prior to the event. Happens across the whole estate (not 1 or 2 particular relays). Different clients, different times, but due to the size of the estate, this is affecting roughly 1,000 machines every day. Most of the machines reprocess the site and continue to report however around 3-400 don’t have time to finish and this is the root cause of them showing up periodically as “not reported”. If we hadn’t chased down the machines using the last subscribed analysis, we would probably be none the wiser to this issue happening

I’ve googled this and I’m not seeing any results for Bigfix CRC Mismatch

I don’t know if it affects other sites apart from “name_of_my_site” - although I plan on creating more analysis to scan specifically for this issue. There are no other apparent problems with the system.

Ideally we want to stop the CRC mismatches from happening, but if that’s not possible is there a way to stop the machine abandoning the whole site and bunny hopping out (and then back in)? Getting physical access to a faulting machine is hard due to the nature of the business so I’m currently limited to using Query and analysis.

Thanks for reading, and I’ll be interested in any theories/thoughts/offers of strong drink.

Heisenbug · May 28, 2020, 8:23am

Would anyone be willing to do a similar analysis on their setup? I’ve got a PMR open with HCL and as this is hard to track down it would be interesting to know if anyone else spots this (or if we have a particular issue with this one site). Ta!