Relay server: Linux Redhat or Windows Server 2012 R2

steini44 · January 12, 2016, 8:33am

Hi Guys

We are setting up our new infrastructure (because we want to go from 5000 endpoints to 20 000 endpoints), but we need to decide for our relay servers if we are going to use Linux or Windows.

Since there is only 1 port where BigFix is listening on, we need to know how many simultaneous TCP connections we can have on that port. We heard from IBM Support US that on Windows there was a limit of 1024 connections and with Linux there is no limit.

Is this true? Is there still a limit on Windows Server 2012 R2? What is the experience of you guys? I know there are a couple of guys here with more than 20 000 endpoints… I would like to hear your opinions on this one.

Thanks guys!

mtrain · January 12, 2016, 2:20pm

I’m not aware of a Windows 2012 Server limit on the number of simultaneous TCP connections. I thought I saw entries such as 65536 or even 16 million, but you’d probably never get that high in reality.

For the Relay server, I’d simply go with whatever you’re more comfortable supporting. In general, while a Relay server can handle more than 1,000 connections, it is a best practice to have a relay server for every 500 - 1000 connections you expect to have. The Relay server can handle more than 1,000 connections, but it becomes less optimal in handling them.

See https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli+Endpoint+Manager/page/Relay+Health, the section labeled “There are fewer than 1000 BigFix Clients using any BigFix Relay”.

–Mark

jgstew · January 12, 2016, 9:41pm

There is a limit on the number of simultaneous TCP connections on Windows Server by default, but I believe you can increase that limit. I think the default limit is 2048.

If you want to handle the highest number of clients possible with the lowest number of relays, then Linux is your best bet.

That said, BigFix infrastructure used to be Windows Only, so most mature and large BigFix environments are using Windows Relays.

Even so, there isn’t a reason you can’t have a mix of both kinds of relays.

You can have much more than 1000 endpoints per relay, even on windows relays, particularly if you have good network connections in and out of the relay. SSD storage may be a good idea, and redundancy is not required, accept for the top level relay where it might be a good idea.

Relays are relatively disposable.

steini44 · January 13, 2016, 6:30am

Thanks @mtrain and @jgstew, we’ll definitely take this in consideration and i’ll let you guys know what we chose!

Wouter · January 13, 2016, 8:39am

Hi @jgstew,

Do you know a technet article where this limit is documented?
I’ve searched for this but I cannot find any official statement.

gearoid · January 13, 2016, 2:01pm

Relays are supported on CentOS - something to consider if OS license costs are important for you.

steini44 · January 13, 2016, 2:11pm

thanks for your input, but the OS license costs is not important.

gearoid · January 13, 2016, 2:16pm

There are a number of successful and large deployments with Linux relays and BigFix supported relays on Linux before the BigFix root server (platform) was supported on Linux.

BigFixNinja · January 14, 2016, 7:45am

I did an OpenMic on this. While you can tweak settings in Windows to accept more network connections, and Linux by default allows a greater number of concurrent networking sockets (both dependent on memory resources); IBM BigFix Support, as a best practice, still recommends keeping the number of clients assigned to any given relay to 1000 or less (and ideally 500 to 800 per relay) to allow for redundancy in failover events. And, it is also advised to keep the number of clients connecting directly to the root server as close to 0 as possible.

See slide # 14 from the OpenMic:
http://www-01.ibm.com/support/docview.wss?uid=swg27046968&aid=1

The OpenMic video is still being prepared for upload to YouTube.

steini44 · January 14, 2016, 8:07am

Hi @BigFixNinja

Thanks for your answer.

We’ve designed our Server infrastructure so that we have around 1000 connections to each relay server. We also don’t have any clients connection to the root server. We have a root server, then a top relay server and then other relay servers (which we want to decide to use Windows or Linux). They are all in the same data center, so that is no problem.

Do you have any idea which settings you can tweak in Windows? In case it’s necessary… Or do you have any documentation where it states that there is (no) limit on Windows? Our architect needs that if we want to chose Linux…

Thanks!

sbl · January 14, 2016, 9:52pm

Chiming in as a person who manages 65K+ endpoints, architect-ed our BigFix installation and with 10+ years BigFix experience (no really I do).

I would factor in some other considerations into planning your your relay infrastructure.

Will all 20,000 endpoints be online and all connecting to BigFix 24/7 - 7 days a week?
*If you are managing 20,000 server then probably if you are managing 20,000 desktops/laptops then probably not.

*What are planning to do with BigFix? Are you going to be pushing large updates every day to all 20,000 endpoints at one time frequently? Are you going to be mostly just sitting there and monitoring and reviewing machine status and inventory most of the time and pushing updates once a month? How fast of response do you expect from your clients? 500(clients)-1(relay) might give you the <1 min response but 1500-1 relay might give you <5min response. I would consider what is acceptable as far how responsive you want results in the console.

Let just say you plan to deploy BigFix for Power Management only and you will be monitoring and setting PM settings once in a while I bet you can easily go with 4-5 relays for 20,000 clients. Given the low demand.

If you are planning to push config updates, enforcing updates, pushing software packages constantly to new and existing machines all the time then sure definitely beef up the relays and have more of them. Also make sure they have GOOD network connectivity and are not doing anything else but being a BigFix relay.

In my case I have around 60-70% of my endpoints on average talking to my BigFix deployment at any one time with 48 dedicated windows relays they all have fast network connections and SSDs. I am in one site and with a dedicated logical vlan spread across multiple locations and all of my relays are +/- within 100 clients of each other. Relay planning need to take the network into consideration and if you build 20 relays (with auto relay selection enabled) and 10 of the relays are 3 hops away from your 20,000 endpoints and the other 10 are 5 hops away then the 10 with less hops will be utilized much more. I’ve had relays in distributed network (i didn’t control) and suddenly 2000+ clients connecting that relay instead of my centrally managed ones.

You can go with the rule of thumb of 1000-1 if that makes your life easier but you can scale up and as your installation grows. The network design is also a key design consideration in conjunction with your relay design.

What design did you go with to not allow any clients to talk to your root server?

/Stacy

steini44 · January 15, 2016, 7:19am

Thanks @sbl for your answer. I will try to explain it as good as possible.
First of all, this is what we use from BigFix:

SUA
SCA
Endpoint Protection with SPS
System Lifecycle
Patch Management
Remote Control
WebReports

None of them are Relay servers. we have a different server for SUA, SCA and SPS. SUA, SCA, Remote Control and WebReports connect to the database. SPS connects to the root server. Endpoint, Patch and System Lifecycle are modules on the root server.

Secondly, you need to know we have a dynamical environment and that some devices are connecting over 3G/4G (that’s one group, 2500 online at the same time of a group of 5000). Then we have the normal desktop/laptops (that’s another group, 9000 devices) and we have also a special group (2000). Those are computers that we’ll only do patching and monitoring on.

We have set different settings for each group (depending on the need of it) and some have manual Relay selection (the 3G/4G group, because they can be everywhere in the country and they connect to the relay servers in the data center, otherwise the load would be too much to look for the nearest relay and that can change every 15-30min since they travel a lot. They also have a limited time that they are online, so that’s why they connect straight to the relays in the data center). The special group also connects to the data center, but less frequently than the 3G/4G group.

We have around 250 computers in the field that we use as “relay” server too. That are computers that are less used (like in rooms to chill etc), but still have good hardware specs. The normal computers will connect to those for reporting, patches and small software distribution. Those have automatic relay selection, depending on there location/hops. When we distribute large software packages, then our first line in the action will be to change the relay server so that it connects to a relay server in the data center. We can’t distribute large software packages through those computers: imaging having 200 computers in need of a software and they all connect to a different "chill"computer as relay, then you need to cache that software 200 times. That will have a big impact on the network. So we’ll have them connect to 2-3 relay servers in the datacenter for that softwarepackage, so the software only needed to be cached 2 or 3 times. We have enlarged the cache on the relayservers also (i think standard is 10Gig?) to 60Gig, so they don’t need to pull it in from the server everytime.

This design is validated by IBM Support too and in our eyes it also looks good.
What do you think about it? If you need further information, feel free to ask!

jgstew · January 15, 2016, 9:45pm

If you have a bunch of computers in the same location, then they should all connect to a local relay or 2 and that relay connects to the data center. This will save a lot of traffic going over the WAN to the data center. It would be nice if these relays behind each WAN with many devices connecting to it had a decent sized cache.

I’d recommend that either the root or the top level relays have a larger than normal cache. If you push a lot of software and patches, then 100GB or more is a good idea.

mtrain · January 18, 2016, 5:38pm

Hi @sbl; I’m not following this. Wouldn’t the 10 relays 3 hops away be utilized more because they’re closer as opposed to the other 10 relays further (5 hops) away?

–mark

jgstew · January 18, 2016, 10:25pm

That is what that statement says.

mtrain · January 18, 2016, 10:29pm

Ahhhh, I see it now - there’s a misspelling in the text, at least when I read it It says “then the 10 with less hope” but he meant “then the 10 with less hops” … I thought that the 10 with less hope meant the ones that were further away!!! As Bugs Bunny used to say, “don’t be so dang literal”

–Mark

jgstew · January 18, 2016, 10:59pm

I fixed @sbl 's original reply, though I hate to take away hope, and I don’t generally like the bitterness of adding hops.

sbl · January 18, 2016, 11:59pm

Reading the BigFix forum on my day off work “MLK day”.
@jgstew: thanks for responding on my behalf.

@mtrain: that is what I meant to say clients will try to connect to relays with fewer hops. It sounds as if you have a highly distributed network of endpoints with a lot of different purposes. The default cache size is 1GB so definitely increasing the size is a good idea.
There are trade offs with going with manual relay selection vs auto. relay affiliation is something to consider if you haven’t looked into that already.

A well run infrastructure will make all the difference to how BigFix performs.

steini44 · January 19, 2016, 6:39am

hi @sbl

We have a mix between manueal and auo relay selection, we’ve looked in to that.
We have right now about 11 000 devices in our Console (and 25 operators, not all tho with full rights, only 4 or 5) and we don’t have any performance issues right now.

We also ordered the new servers (we opted for Windows), so normally it won’t be a problem.

Thank you all for your help/input!

jgstew · January 19, 2016, 8:06am

When it comes to buying hardware for BigFix, I’d say single threaded performance typically matters much more so than multi-threaded performance. Also, storage tends to be the main bottleneck, and network can be a bottleneck as well.