Stagger External site distribution

JaviCO · November 22, 2016, 5:19pm

Hello,

I’m looking for get ‘external sites’ distribution staggered.
Somebody knows if there are any server setting or parameter in ‘administration tool’ to obtain this result?

VM department has grouped host in centers by OS and now any update in associated external sites causes performance issues (high load on IO HDD).

Any suggestion is welcome!
Regards.

JasonWalker · November 24, 2016, 2:23am

There are two custom settings that can be applied to their Relays to delay their notifications to clients that new site content exists.
One setting controls how many clients are notified per batch, and the other controls a delay between each notification batch.
I encountered this same issue and was able to work around it by manipulating these values to stagger client notifications over a 15-minute period.
But I’m out of the office this week, can’t recall the setting names, and they’re not listed on the Client Configuration Settings wiki page

Maybe this will jog someone else’s memory.

jgstew · November 28, 2016, 7:26pm

CC: @AlanM @BigFixNinja

JasonWalker · November 29, 2016, 4:18pm

_BESRelay_ClientRegister_BatchCount
This setting controls the number of UDP pings the IEM relay will send before delaying for a period of time. The length of the delay is controlled by _BESRelay_ClientRegister_BatchDelay. This setting could be used to limit the rate at which a IEM relay sends out UDP pings if this network traffic is harmful in some way.

Default Value: 100
Setting Type: Numeric (number of pings)
Value Range: 1-4,294,967,296
Task Available: No

_BESRelay_ClientRegister_BatchDelay
This setting controls how long the IEM relay will wait between sending out a batch of UDP pings to IEM clients. This setting could be used to limit the rate at which a IEM relay sends out UDP pings if this network traffic is harmful in some way.

Default Value: 1000
Setting Type: Numeric (milliseconds)
Value Range: 1-4,294,967,296
Task Available: No

For Relays supporting large sets of Virtual Clients, the default values for these settings are modified:

_BESRelay_ClientRegister_BatchCount Set batchcount to '5'
_BESRelay_ClientRegister_BatchDelay  Set batchdelay to '6000'

JaviCO · November 29, 2016, 4:39pm

Hello Jason,

I am out of office too! I will test your solution as soon as possible.

Many thanks!

jgstew · December 1, 2016, 8:30pm

Here is some relevance to help calculate the approximate number of batches and delay required for a relay to reach all of the registered endpoints with UDP notifications.

Ideally there would be an easier way to get the actual number of endpoints currently registered to a relay to make the numbers accurate for the particular relay. It would also be useful if this could be calculated with Session Relevance.

Number of batches:

(it / ( ((it as integer) of value of settings "_BESRelay_ClientRegister_BatchCount" of clients) | 100 ) ) of /* Assumed number of registered clients to relay -> */ 1000

Time to reach all endpoints:

(it * millisecond) of (it * ( ((it as integer) of value of settings "_BESRelay_ClientRegister_BatchDelay" of clients) | 1000 ) ) of (it / ( ((it as integer) of value of settings "_BESRelay_ClientRegister_BatchCount" of clients) | 100 ) ) of /* Assumed number of registered clients to relay -> */ 1000

After being notified, the client could have its own delays due to other processing occurring at the same time. As the clients may not all delay the same amount of time, this could cause overlaps when the clients actually gather the site.
The clients may take longer to write the site contents to disk, so that as new clients begin earlier clients are still working on it.
The clients may be reporting to different relays. Each relay handles its own set of client notifications, so if the clients are reporting to different relays their notification schedules can overlap.

You have a few different options you can investigate. Reference the settings list at https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20Endpoint%20Manager/page/Configuration%20Settings

Ensure that the clients contending for SAN resource are reporting to the same relay. You can use manual relay selection or the client setting _BESClient_Register_Affiliation_SeekList and Relay setting _BESRelay_Register_Affiliation_AdvertisementList to “tune” the Auto-Selection process.
Increase the BatchDelay setting and reduce the BatchCount setting to further stagger the UDP notifications to clients.
Disable the client-side UDP notification entirely using _BESClient_Comm_ListenEnable, or disable the Relays’ sending of UDP notifications via Enterprise Server_ClientRegister_DisableChildUDPMessages. If you do this, you should also configure the clients to use Command Polling, using the _BESClient_Comm_CommandPollEnable and _BESClient_Comm_CommandPollIntervalSeconds client settings.

The settings that work for you will depend heavily on your SAN performance and how heavily you’ve allocated VMs to it… Unfortunately I don’t think that any of the bandwidth throttling options apply to site gathers, I think they only apply to downloads triggered by actions (but I’d welcome IBM correcting me on this). It’s counter-intuitive, but you might force the Relay’s network configuration to run at a lower speed (say, forcing 100-MB network card speed on a 1 GB Ethernet link) to make the network speed the bottleneck rather than the SAN.

I’ve not seen this issue (yet) with the Windows sites, but my Windows VMs aren’t as heavily provisioned as are our Linux VMs. But I do think part of the problem is specific to how IBM is publishing the Patches for RHEL sites - the site content is a small set of very large files, and each time one of the site files is updated the client has to download the entire file.

jgstew · December 29, 2016, 8:25pm

This could actually make the situation worse and harder to control if the clients all end up polling around the same time, particularly if the systems are rarely shutdown and all have polling enabled at the same time. Command polling should end up being more randomly distributed over time in most cases, so it might be okay… even so, I don’t love this solution.

JasonWalker · December 29, 2016, 8:28pm

I don’t love this one either. If command polling has to be used, I’d put that more in the range of 8 - 12 hours and accept that that clients will not respond quickly to new content or actions.

My first course would be to increase the batch options and check the client-to-relay mappings.

JaviCO · January 11, 2017, 3:17pm

Hello everybody.

Thanks for all the suggested solutions, at last we take an alternate path: a custom site with the RHSA patchs and disassociate endpoints from external site. So we will have control about contents update.

Again, thanks to all.

fermt · January 11, 2017, 3:30pm

That adds more management for the BigFix Admin, if there are updates to the original Fixlets the custom copies will need to be updated.

JasonWalker · January 13, 2017, 2:57am

On a conference with L2 and L3 support they warned me against unsubscribing clients from the RHEL sites. They indicated that some portion of the functionality requires the client to be subscribed to the Patches for Red Hat (Native Tools) site. Some of the fixlets seem to have dependencies on that site name.

I’m not sure what those dependencies would be, but take care esp. with advanced features like multi-patch baselines

WurliTzurn · January 23, 2023, 6:48pm

Any changes since 2017?
We would like to stagger the external site content downloads (relay to endpoint) volume to spread out network impact within our VM environment.
We see a very large spike in network load (for 1 hour) just after patch Tuesday content updates. If we could spread that load out over a few hours it would greatly improve our tenant performance.

JasonWalker · January 23, 2023, 10:42pm

The BatchCount and BatchDelay settings referenced earlier are still effective. This staggers client notifications of “all the things”, like new site versions or new issues actions, and can prevent clients from all gathering at once.

I’d recommend staggering over minutes though, rather than hours.

Stagger External site distribution

Number of batches:

Time to reach all endpoints:

Related: