Relays reporting back slower than normal clients

Hey guys, quick one for anyone that may have an answer for this. I have about 1500 stores, 2 servers at each location, the Primary and the Backup. The backup server is the internal relay for each location, which talks to one of 5 parent top level relays, before reporting back to the main server. I have tons of jobs that run each night through server automation itself, but I don’t believe the issue is with server automation itself.

Each night I have between 20 and 70 failures, almost always about 95% of those failures are Backup servers. The only difference between the Primary and the Backup is that the Backup has the relay installed on it. The Primary server at each store almost always complete on time. note: When I mention failures, I just mean that the endpoint didn’t complete it’s action within the timeout specified in server automation.

I am going to restart the relays and bes client on each of these servers that I found to be almost daily repeats and see what occurs. Has anyone experienced anything like this before, where clients on a box with a relay tend to respond minutes slower than another client that flows through the same relay at the same location?? Thanks in advance :smile:

Be aware that there is some content that relays run that other endpoints don’t. These are to help in the health of the deployment. If your content windows are relying on too small a windows then you can see these type of issues. Content won’t run precisely on the time scheduled as a reminder so how big are these windows of time you have?

I have turned up the CPU usage on all of these machines to evaluate content faster, up to 10%. Do you think it would be wise to crank up just the Backup servers a little more, maybe 15%

The window for each job is very small, around 10 minutes, but it has to do be that way to get the result we need and be able to run as many jobs a night as we do. It is not pretty, but it does server its purpose and accomplishes what we need to get completed each night.

10 minutes is way too small.

What is the average duration of evaluationcycle of client ?

If that is greater than 10 mins then you will often never be able to even notice your time window has been reached by the content.

1 Like

Is this a legitimate way of telling??

less significance 3 of (average of evaluationcycle of client as floating point / 1000 / 60)

I don’t have the option to make the windows larger because of the amount of jobs that need to run in a relatively small amount of time. Do you think that the CPU usage increase would give it the ability to chug through and make that evaluation loop average smaller? Thanks for the help Alan

average is the old one while average duration returns a time interval which formats better and has higher granularity

https://developer.bigfix.com/relevance/reference/evaluation-cycle.html#average-duration-of-evaluation-cycle-time-interval

CPU usage would come back to the same problem. If you have a 10 min window to increase the CPU time via an action the action may not become relevant and execute within that 10 min window

I would apply the cpu usage outside of that short window. When the small windows hit where they are spaced at 10 minutes, the CPU usage would already be higher

That will indeed help if the client is running faster yes.

After finding out that the majority of the failures were repeats at some point within the past week, after rebooting the relay service and the bes client on those endpoints, failures dropped by 75%, and looking good. Just a heads up. I did not have to modify CPU usage any more.

I would bet that some of these jobs could run concurrently, or don’t need server automation to do it and could instead be deployed with a baseline that would run them back to back more quickly without having to wait around as much.

Also, as for your original problem, I would bet that you have some expensive relevance being run in an analysis that is specific to relays that is too aggressive in its reporting period.