We had an interesting problem on our linux relays. The monitoring systems reported that the partition holding the BESRelay data was over its threshhold. By the time we managed to investigate, the partition was 100% full.
This was caused by an action using a prefetch payload over 20GB. But, why didn’t the infrastructure self-manage the space and clear the way?
Partition size: 148GB
_BESGather_Download_CacheLimitMB: 128000MB
As of this writing,
bfmirror: 131G
bfmirror/bfsites: 5.4G
bfmirror/downloads: 126G
bfmirror/downloads/sha1: 126G
So far, so good, right? Downloads is less than the configured cache limit. There’s roughly 17GB of delta between the bfmirror usage and the partition size.
Still good.
BUT…
When I investigated last night, the partition was 100% full.
The bfmirror/downloads/ActiveDownloads directory contained a dynamic...
file of (at the time) over 17GB. In order to clear the condition, I stopped both BESClient and BESRelay, deleted several GBs of data from bfmirror/sha1, then the dynamic
file from ActiveDownloads. After restarting the services, it completed the download, then moved it over to the sha1 directory.
WHY?
Why did ActiveDownloads stream in a file that would exceed the configured cache limit? (And thereby fill the partition)
Does the _BESGather_Download_CacheLimitMB only govern bfmirror/downloads/sha1 ?
Does the service not perform space maintenance before slurping data into ActiveDownloads?
Is there a setting I should have that would have prevented this circumstance?
Thanks all!
-Andrew