Investigating bandwidth highe usage

mikke · July 20, 2018, 4:19pm

Hello guys

I have a complain from customer: "BigFix is using lots o bandwidth"
There was an event recently where 2 mid relays used 800-900 MB of the bandwidth causing saturation for the network (this quantity is very high for the client network). The thing that have me scratching my head is that practically no actions were Open when the saturation event occurred and network team has share statistics that in fact BigFix port 52311 caused that saturation.
I have reviewed logs from BF clients involved in the event (clients receiving data) and the mid relays and I see no actions that could have causing this.
NOTE: Windows Patching activities were executed regularly days and weeks before, in fact we had another event 1 week ago where another saturation happened and we found that huge amount of patches were cached to bottom relays, we stopped mid relay and the high bandwidth usage stopped.
It would be such a great help if you can help me. Please.

Environment is:
BigFix Root server
2 Top Relays
—WAN—
2 Mid Relays
1000 bottom relays

cmcannady · July 23, 2018, 3:41am

Do you have a network graphic showing high WAN utilization on port 52311 between the root BES server, top-level, mid-level and bottom-level relays?

Do you have any monitoring on your root BES server and top-level relays that shows/tracks network I/O that can corroborate or disprove the statements by your network team?

Depending on your enterprise requirements, you may wish to enable outbound and download bandwidth throttling on your mid-level and bottom-level relays respectively in order to prevent future finger pointing at BES.

If you go that route, make sure to work with your network team to define exactly how much bandwidth BigFix is allowed to consume at each WAN layer/level. There’s a fine line between protecting the WAN and choking BigFix.

Best,
@cmcannady

fermt · July 23, 2018, 3:31pm

I agree with @cmcannady, if you enable bandwidth throttling that will help to prevent future events like these.
On the other hand, if there were not open actions during that period of time it’s odd the network got overloaded by the BigFix application

JasonWalker · July 23, 2018, 3:37pm

For how long did this spike last?
One possibility is a large number of clients selected for a “Send Refresh” operation. That would appear in the client logs, but not in action history at the server.

Another would be a site refresh for a large content site. For example the “Patches for Red Hat” sites have a couple of large content files totaling about 200 MB. When that site is updated we’ve experienced SAN hits on our virtual Linux systems and had to tune some of the UDP notifications to stagger. When those sites refresh you might see high bandwidth use, but it should be for a short period of time.

mikke · July 23, 2018, 4:58pm

Thanks guys.

-I got no graphics from network team, only a text report showing 52311 usage across the network.
-I had already enable the bandwidth throttling and that seems to have fixed the issue.
-The trigger for the very high usage of bandwidth was the massive windows patches sent by operators.

The stop of mid relay services last 5 days…
Now, I found that the big quantity of bottom relays with outdated ActionSite Version could have been the cause of the issue, I’m almost sure. I’m interested in your opinions, do you agree with my statement?

Thank you so.

Venjirra · October 17, 2018, 7:39am

What is the minimum network bandwidth we need to have, between BigFix root server and relays. ?

our network team saying that we have we have switch of 8MB enabled. Does it mean only 8MB file can be allowed to download from BigFix root server to relays ? If yes, what is the minimum bandwidth needed for infra of 5k machines?

JasonWalker · October 17, 2018, 5:24pm

That’s going to depend very much on your network architecture. You would want to enable large caches on your Relays, and keep Relays close to your clients in case of a bandwidth that small. You should also check into the settings that allow a Relay to do direct Internet patch downloads rather than downloading through the root server, if that’s an option, but again depends on how the Internet bandwidth at your remote sites.

To clarify, do your network admins mean 8 MB (megaBYTE) or 8 Mb (megaBIT), because network folks usually talk in terms of megabit.

If I’m calculating this right (not a guarantee!), 8 MB or 8 Mb are both pretty abysmal. This month’s patch rollup for Windows 10 build 1709 is 913,832,228 bytes in size so would take about 892,414 seconds or 10 days 7 hours to download at 8 Mbps, or 1 day 6 hours at 8 MBps. (IF that were the only patch downloading at a time, which is unlikely).

Once the patch is cached on your Relay, you would want to ensure a higher bandwidth between the Relay and the Clients to which it is distributing the file.

Do a form search for the “BigFix Capacity and Planning Guide”, there are a lot of good instructions there on tuning performance.