All endpoints stop checking-in

The last few months we’ve seen a couple instances where all endpoints stop checking in suddenly and fail to communicate again until we reboot the root server. We’ve got a few top level relays and a few thousand endpoints in our environment. Since it’s all endpoints vs. just endpoints connecting to a specific top level relay, I’m guessing it’s something going wrong with the root server, or maybe a db, but don’t really know where to how to go about determining the true cause. I have looked through some of the documented log files https://help.hcltechsw.com/bigfix/9.5/platform/Platform/Installation/c_logfiles.html but it’s a bit mucky. Any pointers would be appreciated.

Is there any pattern to the timings when the check-ins stop?

If it is occurring right at midnight,.I’d recommend disabling the UpdateHistoricalCounts procedure as described at All Endpoints stop updating for a specific time of the day

If it is at some other time of day, it would be useful to check how that aligns to SQL maintenance jobs such as backups, reindexing, or consistency checks (DBCC).

If it’s not occurring predictably, I’d recommend opening a support incident and enabling FillDB logging.

1 Like

I’ve seen this before. It usually occurred when we had our sql index/cleanup processes scheduled to overlap with external sql backups. Too many jobs caused sql to lose sync until a reboot. Since most of my sql jobs were Saturday evening at 10pm, and backups were nightly Midnight, sometimes they would go long and overlap.
My root and sql servers are separate also.
Moving the cleanup jobs to 8pm resolved it for me.

1 Like

I’ve also seen it happen when the partition holding relays’ uploads directory is filled.

On Linux, this is /var. In our case, growth from normal rpm patching was pushing the partition size. We wound up adding a separate partition and mounted it under the /var location where BESRelay looks for its stuff.

2 Likes