RFE - Site Gather Scheduling

Today I created an RFE to allow the scheduling of site gathering. The idea is to control when clients receive updates to be able to do so off business hours and manage bandwidth to remote sites and reduce SAN performance impact in VM farms.

When either IBM or internal custom site updates are published, they immediately go to all endpoints. For organizations with many remote site and WAN links, this can generate significant WAN traffic and/or SAN traffic for highly virtualized environments at inopportune times. By allowing site data to be gather per local time of the client, it would 1) naturally distribute site updates by timezone 2) allow off business hours download of metadata and 3) not impact SANs with VMs.

In our case, we are a global company with over 1000 sites spanning Asia and North America. We get beat up about WAN performance when sites are updated during the business day. The reason for scheduling based on the client per local time is the business day in our Hong Kong sites, for example, is off hours for our North American sites and vice versa.

Please vote for this request for enhancement.

This is similar to other threads like this and this.

3 Likes

The Bandwidth Throttling Wiki specifically calls out site gathering as not throttled:

“When the Client asks the Relay “please tell me the latest contents of site X” (logged as ‘GatherActionMV command received. Version difference, gathering’), the interaction is not throttled. The response of the Relay is typically small (anywhere from 0-~40k). If absolutely necessary, you can turn down the gather intervals on Clients to get this information less frequently, but this traffic should usually be negligible.”

BESsiteWAN

This is a real-world example of what happens on a large WAN when a large site is published (either by IBM or internally). This is why we’re looking for site gather scheduling ability. Please vote for my RFE.

2 Likes

You can limit this by limiting how quickly Relays send UDP notifications to clients, or not use UDP at all and instead use command polling, though I wouldn’t generally recommend that approach.

It seems that limited the speed of UDP notifications is a very good thing for large environments where this is likely to be an issue and helps mitigate exactly the problem you are referring to with SANs and WANs.


Also, to clarify, it sounds like you are asking for scheduling around notifications for updates at the Relay level so that Relays in one region would tell clients about new things on one schedule, while Relays in another location would tell clients about new things on a different schedule due to the region specific timing of business hours.

One problem with this approach is if the Relay doesn’t notify any clients of any changes at all until a window is hit, that means the amount of changes and the number of clients that will be notified at that window coming will be much more significant than it would be otherwise, so you will still need to limit the UDP notifications as mentioned above, otherwise this situation will be even worse. What you really need is to spread this out more evenly over time, by slowing down UDP or using command polling.

BigFix by default wants to get changes to clients as quickly and efficiently as possible, this is a case of BigFix working too well and overwhelming a WAN due to lack of Relay behind it or a SAN due to increased Disk IO.

Client Settings for Relays to Limit UDP speed:

_Enterprise Server_ClientRegister_BatchCount

  • ( unique value of (it as integer) of values of settings "_Enterprise Server_ClientRegister_BatchCount" of clients | 100 )

_Enterprise Server_ClientRegister_BatchDelay

  • ( unique value of (it as integer) of values of settings "_Enterprise Server_ClientRegister_BatchDelay" of clients | 1000 )

Related:

1 Like

The following relevance will give you the time it will approximately take for a particular Relay to notify 2000 clients of changes with UDP, assuming all clients get UDP notifications successfully:

Q: (it * second) of (it / 1000) of ( ( /* Number of Clients on a Relay: */ 2000 / ( unique value of (it as integer) of values of settings "_Enterprise Server_ClientRegister_BatchCount" of clients | 100 ) ) * ( unique value of (it as integer) of values of settings "_Enterprise Server_ClientRegister_BatchDelay" of clients | 1000 ) )
A: 00:00:20
T: 0.221 ms
I: singular time interval

By default, the time taken to notify 2000 clients is ~20 seconds.

If you do the same calculation with 10 clients per batch and 10 seconds between each batch, then you get a time of 33 minutes for all of the UDP notifications to go out:

Q: (it * second) of (it / 1000) of ( ( /* Number of Clients on a Relay: */ 2000 / ( /* Endpoints Per Batch: */ 10 ) ) * ( /* Batch Delay in ms: */ 10000 ) )
A: 00:33:20
T: 0.055 ms
I: singular time interval

Using Session Relevance:

Number of Clients per Relay:

(multiplicity of it, it) of unique values of relay servers of bes computers

Approximate time for UDP notifications per 1000 endpoints per relay:

(it * second) of (it / 1000) of ( (1000 / item 0 of it) * item 1 of it ) of (unique value of (it as integer) of values of client settings whose(name of it = "_Enterprise Server_ClientRegister_BatchCount") of it | 100, unique value of (it as integer) of values of client settings whose(name of it = "_Enterprise Server_ClientRegister_BatchDelay") of it | 1000) of bes computers whose(relay server flag of it OR root server flag of it)

If you take this value and multiply it by the (number of endpoints per relay / 1000) then you would get the approximate total time for all UDP notifications by that relay.

Thanks for the helpful post. I had tried doing some minor adjustments to those settings before, but was afraid to make too drastic of a change. I’ll continue testing. Thanks!

1 Like

The worst case should be slower responding clients to new actions / new content, but it sounds like that is what you want.

You could make the changes more significant on Relays that have many clients that talk to it that are on the same SANs or behind slower WAN links. You may not need to make these changes on all Relays, but it does seem like relaxing the values on all Relays at least a little is a good idea, while doing it more so on others.

You’ll have to experiment a little to figure out what works best for your environment, either smaller batches with smaller delays between them, or larger batches with larger delays between them, but in either case you should look at the calculation that would determine the total amount of time required per 1000 endpoints and make sure that the time taken is “slow enough”, since the default of 10 seconds per 1000 endpoints is clearly too fast in these cases.

Yes, I’ll experiment more with those settings.

Even if they work, I’m still advocating for the RFE as it would be nice to schedule times - non-business hours - for certain updates.

Slowing the notification is also universal for all sites - which cuts both ways. It would be handy to be able to prioritize by site.

1 Like

I do agree, having some sites be more responsive than others in terms of updates makes sense, but I still think there are problems with having all updates wait for a certain window of time. I think it could make the situation worse instead of better if not also limiting UDP.

Also, it is a good idea to have a link to this forum post in the RFE itself and vis versa.

Also, what happens if a client is never awake during the gather window? Should there also be a maximum delay as well that would cause an immediate gather?

This is not perfect and will miss some relays in some cases, but this session relevance attempts to give the time per 1000 endpoints for UDP notificaitons, plus the number of computers per relay, plus the relay itself:

unique values of (it as string) of (items 1 of items 0 of it, tuple string items 0 of items 1 of it, tuple string items 1 of items 1 of it) of (items 1 of it, elements of items 0 of it) whose(item 1 of it contains item 0 of item 0 of it) of ( it, ( relay hostname of it | hostname of it, (it * second) of (it / 1000) of ( (1000 / item 0 of it) * item 1 of it ) of (unique value of (it as integer) of values of client settings whose(name of it = "_Enterprise Server_ClientRegister_BatchCount") of it | 100, unique value of (it as integer) of values of client settings whose(name of it = "_Enterprise Server_ClientRegister_BatchDelay") of it | 1000) of it ) of bes computers whose(relay server flag of it OR root server flag of it) ) of sets of (it as string) of (multiplicity of it, it) of unique values of relay servers of bes computers

This goes a step further and attempts to give the time for UDP notifications to go out based upon the actual number of clients using a particular relay:

( (item 0 of it * item 1 of it as integer) / 1000, items 1 of it, items 2 of it) of (items 1 of items 0 of it, tuple string items 0 of items 1 of it, tuple string items 1 of items 1 of it) of (items 1 of it, elements of items 0 of it) whose(item 1 of it contains item 0 of item 0 of it) of ( it, ( relay hostname of it | hostname of it, (it * second) of (it / 1000) of ( (1000 / item 0 of it) * item 1 of it ) of (unique value of (it as integer) of values of client settings whose(name of it = "_Enterprise Server_ClientRegister_BatchCount") of it | 100, unique value of (it as integer) of values of client settings whose(name of it = "_Enterprise Server_ClientRegister_BatchDelay") of it | 1000) of it ) of bes computers whose(relay server flag of it OR root server flag of it) ) of sets of (it as string) of (multiplicity of it, it) of unique values of relay servers of bes computers

A further refinement is to only get this info for computers which have checked into bigfix within the past 48 hours: (computers that have not checked in for too long will not have a client registration and will not get UDP messages sent to them)

( (item 0 of it * item 1 of it as integer) / 1000, items 1 of it, items 2 of it) of (items 1 of items 0 of it, tuple string items 0 of items 1 of it, tuple string items 1 of items 1 of it) of (items 1 of it, elements of items 0 of it) whose(item 1 of it contains item 0 of item 0 of it) of ( it, ( relay hostname of it | hostname of it, (it * second) of (it / 1000) of ( (1000 / item 0 of it) * item 1 of it ) of (unique value of (it as integer) of values of client settings whose(name of it = "_Enterprise Server_ClientRegister_BatchCount") of it | 100, unique value of (it as integer) of values of client settings whose(name of it = "_Enterprise Server_ClientRegister_BatchDelay") of it | 1000) of it ) of bes computers whose(relay server flag of it OR root server flag of it) ) of sets of (it as string) of (multiplicity of it, it) of unique values of relay servers of bes computers whose(last report time of it > (now - 2*day) )

I would say you probably don’t want any relay to have a result that is over an hour unless you really know you need it to be that slow, especially given the default should be around 20 seconds, you probably don’t need to make it 1 hour+ to get a noticeable reduction in peak load.

The approach @jgstew laid out is logical and rational. As our business was closed for the holiday, it was the perfect time to test out making adjustments without impacting anyone. Test results deviated significantly from my expectations.

Some background on my test: We have 5 central relays in an affiliation hierarchy. We have just under 1100 remote locations on a large WAN. The design is simple with remote relays reporting to central relays that report to the root server. I have groups of systems by role. Each role has one or more systems per location. Each site has at least one relay (larger sites have more). In my testing, I would send a blank action to a role group which contained one computer per remote location.

I initially focused my testing efforts on the central/top level relays as notifications to the remote sites go through them to the field. After each change in value, I would cycle the relay service to ensure the new value was effective.

Values I tried for the top-level relays for BatchCount and BatchDelay respectively:
100, 1000 (this is the default)
10, 10000
10, 30000
5, 60000
5, 120000
5, 180000
5, 300000
5, 600000
4, 900000

The earlier values should have spread out the UDP notifications by a few minutes while the later values by more than 2 days. The surprising thing was that the values seemed to have little to do with actual notification results. In each case all ~1100 machines targeted received their UDP notification within about 5 minutes. Using a WAN monitoring tool I was able to see the communications spike.

Then I thought perhaps this needed to be at the root server as well. Values I tried on the root server (cycling the root service between each):
100, 1000 (default)
20, 10000
10, 30000
10, 60000

Again in each case all ~1100 machines targeted received their UDP notification within about 5 minutes. Using a WAN monitoring tool I was able to continue to see the communications spike.

So I feel like I’m missing something here.

Hey JonL,

I read over your thread and have an odd idea you might want to consider…

  1. Setup a scheduled task in Windows/Cronjob in Linux (or even a BigFix task that will run at a given time) that will set a firewall rule to block UDP traffic outbound on your relay infrastructure, and then re-enable outbound UDP traffic at another time.
    (Keep in mind if using Windows task scheduler or cronjob in Linux you will need to do the math to ensure that timezones are taken into account if your BigFix environment spans multiple timezones to ensure you are disabling all at the correct times to meet your needs).

Reason I recommend this, as granted it is a very odd solution. By making the “disable rule” at the relay level, should something occur in your environment where you need to get a task out of BigFix down to the clients ASAP you would only need to re-enable your “disable rule” on the relay level (change a setting on a few devices instead of potentially thousands of clients).
Also by disabling only OUTBOUND UDP traffic on your relays, you could send that “enable rule” to your relays via BigFix. That way if you need something in an emergency situation, you just assign your “enable rule” action, then immediately assign your other action.

Another positive to this odd approach is it would not require restarting of the BESClient / BESRelay services, which when those services start, they immediately call out to their relay, which could be part of the reason you received the results you did when you were performing your testing over the holiday. (Atleast the clients in our BigFix environment do, but that could potentially be a client setting)

Also by utilizing this approach, you could be very granular with when you want to disable UDP outbound traffic, for example. You could disable between all business hours, or only during lunch hours, or even every other hour if there is another incident in your companies data center to lessen load.

Might not be the right solution for your needs, but figured I’d atleast mention it.

Happy Holidays! Let me know how you make out either way as I find this topic interesting!

I appreciate the idea and I’ve considered it several times. The challenge is that not all notifications are ‘bad’. Some are good and necessary. I’d love to be able to be selective and schedule when notifications are send based on site membership or other criteria. That brings me full circle back to the RFE that I posted. Please vote if you would like granular configurable notifications.

I have a case open now to explore the BatchCount and BatchDelay settings further. I feel like something is missing when I set these values. Does anyone have experience setting them in a mid-sized environment? Are they actually working?

As far as those settings unfortunately I do not have experienced with them.

Its possible I’m missing some of the notifications that might be sent to the device via UDP that you use in your environment that might not be used in my environment.

I think this might be where you are running into an issue. A top-level relay needs to send only a few UDP messages - it is only informing the downstream relays or clients that have directly registered to this relay, so staggering those would have little effect. I believe you’d only see a benefit to staggering UDP notification batches on the relays that have a large number of clients directly registered to them, in your case the remote relays.

(I’m not certain, but the relay-to-relay notifications might not even UDP, those could be going over the persistent TCP connection that is established between relays.)

I still think there’s something we are both missing though, there shouldn’t be that large of a WAN hit for site gathers, since only the relays should need to traverse the WAN for gathers and only a single instance of each file download per relay should be necessary for action downloads. Where I’ve needed to stagger UDP polling wasn’t for WAN bandwidth - it was due to heavily subscribed client VMs hammering their shared SAN connections as they all were all writing the site content locally.

1 Like

Ah, that’s very much related and I missed that on first reading.

The UDP notification batches apply individually to each relay. They aren’t sent individually from the root server and don’t coordinate between each relay. When one endpoint in each site is targetted, remote relay only needs to send one UDP notification - which is sent immediately.
Whatever stagger is happening, is just the stagger from the top-level relays to the remote relays.

If your 1100 remote relays are evenly spread among your 5 central relays, thats 220 child relays per parent. In effect you would be staggering 220 UDP notifications, not 1100 of them.

But, again, as this is relay-to-relay I’m not even sure the UDP is even used at that level. Once the remote relays know about the site update, they are still staggering only one UDP notification - to the one targetted endpoint in their site.

You might try setting the UDP delay to a remote relay in one site, then an empty action to all clients in that site to observe the difference.

1 Like

Interesting thinking Jason. I agree that the UDP notifications seem to be designed for tuning local relay to client communications, not top-level to remote relay communications. That would explain why, when I even set extreme values for BatchCount and BatchDelay, nothing seems to change.

Does anyone know how to affect the staggering of notifications between relay levels?

As Jason observed, a key ingredient in our environment is remote locations and relays that service a small amount of local clients. We have over 2200 remote relays, many of which serve 6-20 local clients. (There is a small number of larger sites too.) That’s why I focused my testing efforts on dynamic groups that span nearly 1100 locations.

Can you try a network capture at a top-level or remote relay to see whether they actually do use the UDP notifications for relay-to-relay notifications? I’m not sure which protocol is used, but if it behaves just like the client UDP notifications, then I think you are on the right track but may need to use extreme values on the root server and top-level relays to account for the notifications occurring in parallel across the relays.

I think a good RFE would be to have the bandwidth throttling options apply to site gathers as well.

Going back to your original RFE - have you considered emulating a gather schedule manually?

You could block your root server’s access to sync.bigfix.com via firewall, proxy, or by putting a dummy address for it in your server’s hosts file. If you are ready to upgrade to 9.5.11, there’s a new server setting to block syncing from the Internet which would remove the need for blocking access manually.

Then you’d use the AirgapTool to manually download the site content and import it into your root server. This can be automated through scheduled tasks or via Bigfix itself.

You could also try the dummy IP address in the hosts file, with a scheduled action to add/remove the entry on demand, and restart the BESGather service after changing it.

I’ve considered a manual gather process, but have decided against it (at least for now). It would only partially address the issue and would be a hassle to manage.

Sources for notifications can be any of the following: Actionsite, client mailboxes, internal custom site changes/additions, or external site changes/additions. Manual gather of external sites wouldn’t address the other items.

In the RFE, that’s why I’m suggesting a site scheduling and/or staggering option per site. If I had that choice, all the external sites would be relegated to gathering off business hours. Many custom sites would be as well. However I’d probably have a small ‘high priority’ custom site that would be always sync immediately and be available to deal with mid-day emergencies.

Understood, and yes I agree those limitations would apply.

I’d expect the actionsite, custom sites, and operator sites would be much smaller than the external content sites though, so hopefully gathering updates to those sites would be less impact on your bandwidth. Just something to consider for the short term.

I have policy actions in my deployment to automate the airgap tool, once it’s set up it can be done silently and only needs to be modified when you change the list of sites to gather or upgrade root server versions.

I’d really like for bandwidth throttling to apply to site gathers, I think that would offer a solution that requires less configuration and maintenance once set.