Relay space self-management

atlauren · August 10, 2022, 7:53pm

We had an interesting problem on our linux relays. The monitoring systems reported that the partition holding the BESRelay data was over its threshhold. By the time we managed to investigate, the partition was 100% full.

This was caused by an action using a prefetch payload over 20GB. But, why didn’t the infrastructure self-manage the space and clear the way?

Partition size: 148GB
_BESGather_Download_CacheLimitMB: 128000MB

As of this writing,
bfmirror: 131G
bfmirror/bfsites: 5.4G
bfmirror/downloads: 126G
bfmirror/downloads/sha1: 126G

So far, so good, right? Downloads is less than the configured cache limit. There’s roughly 17GB of delta between the bfmirror usage and the partition size.

Still good.

BUT…

When I investigated last night, the partition was 100% full.

The bfmirror/downloads/ActiveDownloads directory contained a dynamic... file of (at the time) over 17GB. In order to clear the condition, I stopped both BESClient and BESRelay, deleted several GBs of data from bfmirror/sha1, then the dynamic file from ActiveDownloads. After restarting the services, it completed the download, then moved it over to the sha1 directory.

WHY?

Why did ActiveDownloads stream in a file that would exceed the configured cache limit? (And thereby fill the partition)
Does the _BESGather_Download_CacheLimitMB only govern bfmirror/downloads/sha1 ?
Does the service not perform space maintenance before slurping data into ActiveDownloads?

Is there a setting I should have that would have prevented this circumstance?

Thanks all!

-Andrew

FatScottishGuy · August 10, 2022, 7:58pm

Interestingly I had something almost identical on two relays last week, following along with interest

atlauren · August 10, 2022, 8:03pm

Also: Would I not have this problem if the relays were Windows?

On the one hand, I like using linux on relays because of the small footprint. On the other hand, computers should run errands for humans – not the other way around.

If there’s something inherent about relays on one platform or another, I’ll happily chuck them and rebuild so that this doesn’t happen.

AlanM · August 10, 2022, 8:27pm

A note that seems to be missing from the official documentation link for _BESGather_Download_CacheLimitMB is that it needs approximately 2X its value in actual drive space due to action folder and cached files using space.

https://help.hcltechsw.com/bigfix/9.5/platform/Platform/Config/r_client_set.html

What you are asking for I believe is the equivalent of _BESClient_Download_MinimumDiskFreeMB which would protect the relay from filling its drive, like that setting does for the client. This would be the same on Windows or Linux most likely

atlauren · August 10, 2022, 8:44pm

Alan, do you mean that [disk size] should be 2x _BESGather_Download_CacheLimitMB? Or unfilled [disk space]?

atlauren · August 10, 2022, 8:47pm

Vote for this idea!
https://bigfix-ideas.hcltechsw.com/ideas/BFLCM-I-129

AlanM · August 10, 2022, 8:53pm

Your cache in “practicality” can use up to 2x the value presented. So if you have only 120GB available to downloads I might put the value at 60GB to allow the duplication. The same is technically true on the client as the cache expands into the running downloads but the MinimumDiskFreeMB helps protect that.