Stagger External site distribution

Hello,

I’m looking for get ‘external sites’ distribution staggered.
Somebody knows if there are any server setting or parameter in ‘administration tool’ to obtain this result?

VM department has grouped host in centers by OS and now any update in associated external sites causes performance issues (high load on IO HDD).

Any suggestion is welcome!
Regards.

There are two custom settings that can be applied to their Relays to delay their notifications to clients that new site content exists.
One setting controls how many clients are notified per batch, and the other controls a delay between each notification batch.
I encountered this same issue and was able to work around it by manipulating these values to stagger client notifications over a 15-minute period.
But I’m out of the office this week, can’t recall the setting names, and they’re not listed on the Client Configuration Settings wiki page :frowning:

Maybe this will jog someone else’s memory.

1 Like

CC: @AlanM @BigFixNinja

1 Like

_BESRelay_ClientRegister_BatchCount
This setting controls the number of UDP pings the IEM relay will send before delaying for a period of time. The length of the delay is controlled by _BESRelay_ClientRegister_BatchDelay. This setting could be used to limit the rate at which a IEM relay sends out UDP pings if this network traffic is harmful in some way.

  • Default Value: 100
  • Setting Type: Numeric (number of pings)
  • Value Range: 1-4,294,967,296
  • Task Available: No

_BESRelay_ClientRegister_BatchDelay
This setting controls how long the IEM relay will wait between sending out a batch of UDP pings to IEM clients. This setting could be used to limit the rate at which a IEM relay sends out UDP pings if this network traffic is harmful in some way.

  • Default Value: 1000
  • Setting Type: Numeric (milliseconds)
  • Value Range: 1-4,294,967,296
  • Task Available: No

For Relays supporting large sets of Virtual Clients, the default values for these settings are modified:

_BESRelay_ClientRegister_BatchCount Set batchcount to '5'
_BESRelay_ClientRegister_BatchDelay  Set batchdelay to '6000'
3 Likes

Hello Jason,

I am out of office too! I will test your solution as soon as possible.

Many thanks!

Here is some relevance to help calculate the approximate number of batches and delay required for a relay to reach all of the registered endpoints with UDP notifications.

Ideally there would be an easier way to get the actual number of endpoints currently registered to a relay to make the numbers accurate for the particular relay. It would also be useful if this could be calculated with Session Relevance.

Number of batches:

(it / ( ((it as integer) of value of settings "_BESRelay_ClientRegister_BatchCount" of clients) | 100 ) ) of /* Assumed number of registered clients to relay -> */ 1000

Time to reach all endpoints:

(it * millisecond) of (it * ( ((it as integer) of value of settings "_BESRelay_ClientRegister_BatchDelay" of clients) | 1000 ) ) of (it / ( ((it as integer) of value of settings "_BESRelay_ClientRegister_BatchCount" of clients) | 100 ) ) of /* Assumed number of registered clients to relay -> */ 1000

Related:

Hello again!

I configured the relays with @JasonWalker recomendation, and at first seems it was working fine (no more perfomance issues was notified).

Until that morning. The synchronization of external patches for RHEL6 Native Tools (version 260) has caused a serious performance problem because the distribution on some endpoints wrote 250 MB and it seems that more than five endpoints were attempting to write at the same time.

I have set up an analysis with the useful properties suggested by @jgstew (thanks James!), but the configuration do not proves the real behaviour.

Anyone knows how can I check than relays works as I have configured? The only way I know is take some singular endpoint logs and check it.

1 Like

If all relays are configured with 5 for _BESRelay_ClientRegister_BatchCount then that means that it would be 5 endpoints per relay.

Also, you may have to restart the relay service, and probably also the bes client service for the new settings to be effective.

I’m not certain, but you could at least see that the setting is in place in the console for the relays themselves.

With that configuration, the relay would notify 5 clients every 6 seconds about the site update. There are several reasons you could still have more than five clients writing simultaneously -

  • After being notified, the client could have its own delays due to other processing occurring at the same time. As the clients may not all delay the same amount of time, this could cause overlaps when the clients actually gather the site.
  • The clients may take longer to write the site contents to disk, so that as new clients begin earlier clients are still working on it.
  • The clients may be reporting to different relays. Each relay handles its own set of client notifications, so if the clients are reporting to different relays their notification schedules can overlap.

You have a few different options you can investigate. Reference the settings list at https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20Endpoint%20Manager/page/Configuration%20Settings

  1. Ensure that the clients contending for SAN resource are reporting to the same relay. You can use manual relay selection or the client setting _BESClient_Register_Affiliation_SeekList and Relay setting _BESRelay_Register_Affiliation_AdvertisementList to “tune” the Auto-Selection process.

  2. Increase the BatchDelay setting and reduce the BatchCount setting to further stagger the UDP notifications to clients.

  3. Disable the client-side UDP notification entirely using _BESClient_Comm_ListenEnable, or disable the Relays’ sending of UDP notifications via Enterprise Server_ClientRegister_DisableChildUDPMessages. If you do this, you should also configure the clients to use Command Polling, using the _BESClient_Comm_CommandPollEnable and _BESClient_Comm_CommandPollIntervalSeconds client settings.

The settings that work for you will depend heavily on your SAN performance and how heavily you’ve allocated VMs to it… Unfortunately I don’t think that any of the bandwidth throttling options apply to site gathers, I think they only apply to downloads triggered by actions (but I’d welcome IBM correcting me on this). It’s counter-intuitive, but you might force the Relay’s network configuration to run at a lower speed (say, forcing 100-MB network card speed on a 1 GB Ethernet link) to make the network speed the bottleneck rather than the SAN.

I’ve not seen this issue (yet) with the Windows sites, but my Windows VMs aren’t as heavily provisioned as are our Linux VMs. But I do think part of the problem is specific to how IBM is publishing the Patches for RHEL sites - the site content is a small set of very large files, and each time one of the site files is updated the client has to download the entire file.

This could actually make the situation worse and harder to control if the clients all end up polling around the same time, particularly if the systems are rarely shutdown and all have polling enabled at the same time. Command polling should end up being more randomly distributed over time in most cases, so it might be okay… even so, I don’t love this solution.

I don’t love this one either. If command polling has to be used, I’d put that more in the range of 8 - 12 hours and accept that that clients will not respond quickly to new content or actions.

My first course would be to increase the batch options and check the client-to-relay mappings.

1 Like

Hello everybody.

Thanks for all the suggested solutions, at last we take an alternate path: a custom site with the RHSA patchs and disassociate endpoints from external site. So we will have control about contents update.

Again, thanks to all.

That adds more management for the BigFix Admin, if there are updates to the original Fixlets the custom copies will need to be updated.

On a conference with L2 and L3 support they warned me against unsubscribing clients from the RHEL sites. They indicated that some portion of the functionality requires the client to be subscribed to the Patches for Red Hat (Native Tools) site. Some of the fixlets seem to have dependencies on that site name.

I’m not sure what those dependencies would be, but take care esp. with advanced features like multi-patch baselines

Any changes since 2017?
We would like to stagger the external site content downloads (relay to endpoint) volume to spread out network impact within our VM environment.
We see a very large spike in network load (for 1 hour) just after patch Tuesday content updates. If we could spread that load out over a few hours it would greatly improve our tenant performance.

The BatchCount and BatchDelay settings referenced earlier are still effective. This staggers client notifications of “all the things”, like new site versions or new issues actions, and can prevent clients from all gathering at once.

I’d recommend staggering over minutes though, rather than hours.