Different Minimum reporting interval for Client on regular endpoints vs client on a Relay

heagsta · January 13, 2017, 9:10pm

Hey guys, I have working at this for quite some time and haven’t had much luck. We are running 9.2.9.36, but only as of last week, before that we had 9.2.2.21, same issue occurring on both and probably before 9.2.2.21.

I have a need in our setup to have clients report back fairly fast. I have two types of machines in my setup as well, normal servers with a client, and server with a relay and a client on them. Each location we have these two devices, and normal clients’s parent relay is the device within that location with the client/relay installed. At each location, it is the same way, even though both clients use the same relay, the device without the relay installed reports back way faster than the client that is local to the relay.

Normally, after the gatherHasMV is received in the log, the action is retrieved and ran fairly fast, but then the report after that can take up to 12 minutes or so for the clients with a relay installed, compared to the devices with only clients on them, normally report back within 5 minutes after the action, or faster. It seems like having a relay installed on a device is delaying the reporting interval on the client. These essentially have the same settings, other than the extra relay settings. Any help would be appreciated, thanks folks!

heagsta · January 13, 2017, 10:13pm

Here is a link to my last thread with this, sorry for opening up a new one, I forgot I posted it.

I already cranked up[ the CPU usage , which seems to help evaluate things faster. I also the minimum reporting interval to 45 seconds, which doesn’t appear to affect anything, the reporting still takes a long time compared to the non relays.

AlanM · January 13, 2017, 10:24pm

If Client on endpoint A reports to a Relay on endpoint A and a Client on endpoint B reports to a Relay on Endpoint A their latency should be about the same.

The main difference is that the client on the endpoint with a Relay may have more content and more to report on than the other client even if they are the exact same OS etc.

Compare the average loop time for the two, the sites subscribed etc and you may see differences.

heagsta · January 13, 2017, 10:37pm

Are there any other settings that I can change to tweak anything other than CPU usage and minimum reporting interval? I will check the loop on the two, or maybe I will try upping the CPU usage yet again.

AlanM · January 17, 2017, 1:24am

No those are the only things you can tweak. You can check the following to see what the difference is with how much work the endpoints are doing

average duration of evaluationcycle of client

https://developer.bigfix.com/relevance/reference/evaluation-cycle.html#average-duration-of-evaluation-cycle-time-interval

jgstew · January 21, 2017, 8:44pm

It is very possible that you have content that only relays evaluate that is slow. Look at analyses that are only applicable to relays. I wouldn’t be surprised if one of the old ones I wrote is the cause.

You could also look at reducing the site membership for relays so that they have less content to evaluate.

If the relays have spinning disks, it is possible that they are more IO constrained than the clients reporting to them.

AlanM · January 24, 2017, 2:34am

One of the items would be a relay health item which sums up files in the upload, download and other directories for the relay health check. Any disk type operation like this could be fairly slow.

heagsta · January 25, 2017, 12:10am

Thanks for the replies folks, I appreciate it.

I checked on the three relay analysis items activated:

Bes Relay Cache Information
BES Relay Status
BES Relay Cache

I pasted all of these into the debugger and ran them on a few endpoints. Looks like if I evaluate using the local debugger, it seems to calculate very fast (only takes a second) and shows an average of 160ms for all properties combined from above (all properties from all of the analysis above have been manually put in to the debugger). If I evaluate using the Client, it takes almost an entire minute + to process, but then shows an average of 5ms. Not sure why it’s takes longer, but then gives me a quicker evaluation time.

Looks like the one that takes the longest evaluation time is from the Relay Cache one:

if (exists folder "bfmirror/downloads/sha1" of folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client)) then ((name of it & "|~|" & size of it as string & "|~|" & accessed time of it as string) of files of folder "bfmirror/downloads/sha1" of folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client)) else null

Should I just turn on verbose logging on one of these so I can see what is evaluating in the loop? Will it give me the actual timings of each property evaluated? Might be the way to go. Not sure what the repercussions would be if I disable the analysis above that contains the property with the largest evaluation time.

jgstew · January 25, 2017, 1:06am

that seems pretty quick

This is due to the way client eval works. You don’t actually eval against the client directly, instead you add your relevance to the client’s queue and it gets around to it when it finishes what it is working on.

yes

You can also look at how much of the client’s eval loop is spent on the different types of work using inspectors, as well as the average eval loop time.

See here:

https://bigfix.me/analysis/details/2998284
https://bigfix.me/analysis/details/2998371
https://bigfix.me/analysis/details/2998424
https://bigfix.me/analysis/details/2994765
this one is older, that some of the above is based upon

JasonWalker · January 25, 2017, 1:19am

(I edited your post to put the code formatting around the variable names)

The analyses you cite can drastically impact Server and Relay performance. They will have little or no impact on other clients. I recently encountered that in my deployment as well. If you check what that relevance is doing,

if (exists folder "bfmirror/downloads/sha1" of folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client))

…this setting only exists on Relays and BES Servers, so any other client stops evaluating right there.

((name of it & "|~|" & size of it as string & "|~|" & accessed time of it as string) of files of folder "bfmirror/downloads/sha1"

…examines every file in the sha1 cache folder, retrieving its name, size, and last accessed time. This can be expensive, especially if you have configured large cache sizes. (Because my environment is Airgapped, I have to keep a large cache size. I’m at 1.5 TB now. Translates to a couple hundred thousand files in the sha1 folder).

If you look at those analyses further, you’ll also see that those properties are configured to be processed On Every Report. So every single time it goes through the evaluation loop, it processes these files.

(IBM folks, maybe that default analysis interval could be turned down to daily, or a warning in the description that it should be used for debugging only?)

Also, if you have the sites activated, look in both BES Support and BigFix Labs for these kinds of analyses. If I recall there are analyses in both sites that monitor the sha1 cache, retrieve overlapping information, and are set to Every Report, so you might actually be cycling through the sha1 folder several times on every report.

jgstew · January 25, 2017, 1:37am

If the var names are within relevance or other code, you should instead put all of the code in code blocks, not just the var names. I re-edited the post to reflect this.

It is a minor thing, but I also edited your post to use something with syntax highlighting which I prefer because I’m picky. Now I see myself making another video.

AlanM · January 25, 2017, 7:24pm

I doubt daily would fly but possibly slowing it down a bit would - every report seems a bit over the top. I’ll see what people think.

JasonWalker · January 25, 2017, 9:08pm

Much appreciated!
The analyses in BES Support look ok; not sure if I was mistaken or it’s already fixed, but “BES Relay Cache Information” has all of the report periods set to 1 day.

BigFix Labs might benefit from an update. “BES Relay Cache” is all set to Every Report.

AlanM · January 25, 2017, 9:32pm

I cant affect the Labs site but I’ll point it out. I thought the Server/Relay stuff was set to fairly low evaluation times already so thats good to know

heagsta · January 25, 2017, 10:21pm

Ok guys, I think I found it, I cranked the eMsg logging up to 11 and found an analysis that was applicable to both of my Primary and Backup servers, and the property in question was querying a folder with hundreds of thousands of files on the Backup servers on every report. There are only a few hundred files on the Primary servers. That is why only my Backup servers were affected. Notepad++ helped viewing the log, I didn’t realize it formatted the log so nicely, very easy to read. Now I need to go back and possibly the change the minimum reporting interval on the Backup servers back to normal, 60 seconds I think.

Thanks for the help guys, much appreciated, this frigging property in the analysis was causing upwards of 10 minute delays in the status reporting back to the main server.

Does the reporting minimum interval only exist during the action? From what I read, it looks like it only reports the status of an action during an action, not after it completes.

AlanM · January 26, 2017, 3:11am

The report minimum interval merely controls how short of a time between reports can be. Reports go up at varying intervals and the higher the value for minimum interval the less granularity you will see on actions etc.