Migrating bfxmaster... Advice from the Pros?

I’ve been referencing this online guide for migration: How to Migrate the IBM Endpoint Manager Server

Current setup:

  • Current bfxmaster server BFX01* replicates to DSA servers BFX02 and BFX03
  • We do not currently have a top-level relay and hence all of our relays - and a couple of thousand clients with restricted access and no local relay - report directly to BFX01 (which is not ideal)

I’m proposing a change that will hopefully hit two birds with one stone, i.e. move the master server to a better server and re-purpose the former master server as a top-level relay (as all the global firewall access, etc. is already in place to its IP); taking pressure off the new master. Plan:

  • Setup new master server BFX04 (to use more powerful hardware in same VLAN)
  • Remove replication from BFX01 to BFX03 (as no longer required)
  • At time of switch-over, stop BES root services on BFX01 and install BES Relay component (to change this server from a master server to a relay)
  • Modify the hosts file on BFX01 to point the bfxmaster DNS alias to the IP of BFX00 (new master server). My goal here is to make the change transparent to the clients and only have BFX01 know that the master server has moved from it to BFX00
  • Manually change Primary Server for all relays to BFX01 (new top-level relay) and keep bfxmaster as the Secondary Relay
  • Check on BFX00 that machines are reporting in correctly and then change replication to now go from BFX00 (new master) to BFX02

Few things I’d appreciate some feedback on:

(i) Is it possible to stop - but not uninstall - the root service/component on BFX01 and install the relay component, and just have this server running as a relay? I’d like to wait before uninstalling the root server component from BFX01, until I’m sure that I don’t need to roll-back for any reason

(ii) Regarding replication… How easy is it to add/remove servers and/or change the direction of replication? It was setup in my environment before I joined, so I’ve never had to modify it. Doing a browse online makes it seem pretty involved and there doesn’t seem to be a GUI tool. I’m obviously not recovering from a server failure, so I’m hoping that planning changes - rather than reacting to a failure - means that it is easier to modify. Thoughts?

Articles I’ve seen:

DSA Disaster Recovery: Version 19

How to remove a secondary DSA server from TEM Administration Tool

(iii) Regarding modify the hosts file on BFX01 to point the bfxmaster DNS alias to the IP of BFX00 (new master server)… presumably this will work rather than specifically rerouting the DNS alias to the new server? It suits our purposed as I am happy for all other servers to think BFX01 is still the master server, as long as the host file hack is enough for BFX01 to know that it’s not and that BFX00 actually is the new master. How will that affect things like the assignation of relays; what the console thinks of as the “Main IBM BigFix Server” and any other pitfalls?

Whilst I’m looking forward to the positive affects of these changes, it’s also not a question of choice… Due to factors out of my control, it’s a migration that MUST be completed ASAP. To that end, any advice regarding simplification - or possibly my over-simplification! - would be much appreciated. :slight_smile:

Thanks.

I do not believe that this is possible – the root server installation includes relay components so i would not recommend installing the relay over it.

I would recommend the following Infrastructure changes:

  1. Create a DNS entry called BFX-TL-Relay.domain.tld. Point this DNS entry at BFX01
  2. Point all the clients to that new DNS entry
  3. Setup a new server to be the BigFix Top-Level Relay

To migrate to the new Top-Level Relay just:

  1. Shorten the TTL on BFX-TL-Relay to 5 minutes
  2. Change the IP on BFX01 to be something else
  3. Set the IP on your new bigfix relay to be the old BFX01 IP
  4. Make sure BFX-TL-Relay is pointing at the top level relay

This maintains all of your bigfix root servers, has everything go through a relay, doesn’t require a super fast removal of root server components and install of relay, and doesn’t modify any hosts files.

Then I would decommission bfx01 (if you still wanted to).

1 Like

Thanks @strawgate! I should have prefaced this by saying I’ve been having a really hard time getting a new server setup for anything and that goes for a new top-level relay too. Hence the desire to re-purpose what we have. Getting anything done with DNS too is a pain as our org has been recently segmented and everything has to go through ticketing; you can’t just ask someone to shorten the TTL, etc.

Regarding this:

Point all the clients to that new DNS entry

What exactly do you mean here? I’m a little confused by the target (“all clients”, as in all 50k+) as well as the method; how you would do it?

If I absolutely had to re-use BFX01 as the top-level relay, I’m presuming I would need to uninstall the root server component and then install the relay component. While not ideal, it’s doable… the database is the important thing, and in case of a major issue I could always just reinstall the root server component.

Thoughts?

Sorry – I misunderstood the scope slightly.

When I say point the clients to the DNS i mean set their primary relay to the root server. However, i think now that I re-read your question I’d recommend:

  1. Create a DNS entry called, “BFX-TL-Relay.domain.tld”

  2. Configure a global failover relay. This is the last resort relay before the client tries talking to the root server. Take this Fixlet and modify it to point to the new DNS

  3. Apply this Fixlet globally to make sure any clients talk to the DNS entry instead of your root server directly

  4. Create a fixlet to change primary relay and secondary relay. Take these fixlets and modify them to set your primary relay and secondary relay

  5. Apply these fixlets to all your relays that talk to the root server so that they will point to the DNS instead of directly to the root

  6. Make BFX02 the Master server, remove the server components from BFX01 and install the relay components.

All that being said, I’d strongly recommend provisioning a new relay to be the top level relay and another one to be the “naughty relay” (for clients that do not have a relay they can talk to). Trying to shuffle infrastructure like this is always kinda scary.

1 Like

I agree with this approach as far as how to accomplish a new top level relay.

I would not do both quite like this. Get the new root server in place and make that transition happen first before doing anything else. There is enough that could go wrong here already.


I would create a new CNAME DNS record for something like BFX-Temp-Relay.domain.tld and point it at the new BFX04 and set that as the failover relay for everything, in particular the other relays.

Migrate everything to BFX04 and test it to make sure it is working with just a few clients.

Once you are ready to make the switch, you can bring down the root on BFX01 and everything should failover to BFX04 through the CNAME of BFX-Temp-Relay.domain.tld.

Once the migration has completed and everything is working, you can turn the root server DNS in your masthead, which is probably BFX01 into a CNAME record pointed at BFX04 just so that newly installed clients without relay configuration will be able to get to the root server. This doesn’t have to be done right away, and eventually it could be switched to a “fake root top level relay” instead of the actual root server.

Only once the migration to BFX04 is completely successful and you make an archival backup of BFX01, then you should reimage that system and repurpose it as a new top level relay. You can use it’s existing IP, but can give it a new DNS name.

1 Like

For some further info. When you install a new client with only the masthead and the client installer, the only “relay” it knows about is the “root” server FQDN entry in the masthead file. There are ways around this, but generally that is the case.

The new client doesn’t actually ever need to talk to the real root server, only a relay, which could be on the root server, or it could be any other relay.

The concept of “fake root” is to have the DNS in the masthead file resolve to a relay and NOT the actual root server. This allows new clients to find a relay when they are not able to to talk to the root server directly, and without needing to make your root server accessible to them.


An added oddity is that you can have a single relay act as if it were multiple relays. It can have a regular DNS A record, as well as an unlimited number of CNAME records and all resolve to the same relay, plus the relay’s IP addresses work as well.

It doesn’t look great in the reporting of how many clients are using what relays when the same relay is actually being used but being reported differently.

In my bigfix dev environment, I only have 1 relay and I actually have the primary relay of all clients set to the local private IP of the top level relay and the secondary relay set to the FQDN of that same relay, which resolves to it’s public IP, not it’s private IP. This means that all clients try to communicate over the local LAN first, but if that fails, then it tries the public IP, which works across other networks. I could actually use 2 different DNS records to do the same instead of the private IP, but there is actually some advantage to having a relay with a public IP configured as a failover option, because if for some reason DNS doesn’t work, then clients will still be able to communicate.

This is all really helpful, thanks.

Couple of questions…

@strawgate, by this:

Make BFX02 the Master server, remove the server components from BFX01 and install the relay components.

Did you mean "Make BFX04 " the Master server? (I.e. the planned new master server, rather than the DSA server, right?)

Trying to shuffle infrastructure like this is always kinda scary.

Truth…especially when there is little room to move!

@jgstew

Once you are ready to make the switch, you can bring down the root on BFX01 and everything should failover to BFX04 through the CNAME of BFX-Temp-Relay.domain.tld.

So, say our IP setup is like so:

BFX01 = 10.10.10.1
BFX04 = 10.10.10.4

There are global firewall rules in place, so most businesses have a route to 10.10.10.1 (but not all; some just use a fake root to point bfxmaster to their local relay and only the relay can access bfxmaster). There are no firewall rules currently in place for clients to route to new master server 10.10.10.4. (We can request them and they will be added eventually, but not by the time this migration takes place.) So my concern would be in setting .4 as the failover relay, we would lose connectivity. Hence the idea to leave the bfxmaster DNS alias and IP - both of which are reachable - on BFX01 and have it then act as a top-level relay whose hosts file points bfxmaster to .4.

So I guess in a nutshell, rather for looking for alternative ways to do this - which would be better! - I need to know of any possible pitfalls with my current plan. Not ideal, but like I said - this is a migration under pressure :slight_smile:

Couldn’t you just re-ip BFX04 as 10.10.10.1 after the migration?

No matter what, I would leave the computer currently called BFX01 untouched as far as the software / hardware is concerned until the migration has taken place completely for a while.

Make sure you have backups of everything. I also don’t know what the added complexities of DSA are. I have never used DSA.

Why is this under pressure?

Under pressure for political reasons - nothing I particularly care about!

Ok, so held off on implementing the top-level relay and just switched the bfxmaster IP to the new master server (and switched the bfxmaster DNS alias to the new hostname). All is going well per se, but there’s still a lot of lag which I guess is to be expected when the platform is trying to catch up with a full days’ worth of reports. (We’re also adding a lot of clients at the moment.)

There was initially an issue with the actionsite… the version number was less than it was before the migration and I couldn’t send out any actions initially, but IBM were able to show me how to reset that.*
FillDB was overloaded to begin with also, but I believe that’s due to the amounts of reports it was trying to process… IBM recommended some REG keys to aid with this* and FillDB is still clearing out the temp files so I think we’re good, but time will tell.

* If anyone is interested in further details, let me know and I will post.

I still want to implement a top-level relay and would love ideally to use the fake root option for all clients to think it is bfxmaster. So once things settle down and I’m sure I don’t need to roll-back, I may look to switch the bfxmaster DNS alias and the 10.10.10.1 IP back to BFX01.

To confirm:

  1. Is the real master server ok with the top-level relay advertising itself as bfxmaster? Doesn’t cause any confusion?
  2. Will there be a lag while the machines are all shifting over from the master to the top-level relay? Would it be worth reducing the heartbeat frequency to make sure the top-level relay doesn’t become overloaded with all the new connections at once?
  3. All I need to do for the information to get from the fake root is to add the hosts file entry on it pointing bfxmaster to the real master server’s IP, correct?
1 Like

This should only happen if you took action on the old root server setup after taking the DB snapshot and restoring it to the new server.

Ideally you would turn off all of the bigfix processes on the old root server, then backup and copy everything over to the new one, start it up, and make the switch.

It does make sense to change the heartbeat and minimum report interval and minimum analysis time before doing the switch over to be less aggressive, and then ramp that back up once everything seems to be working well on the other side, too late now though.

What is the storage speed on the new server where FillDB is, as well as the MSSQL/DB2 instance? It isn’t a bad idea to put that on SSD storage because it is sensitive to IOPS more so than the rest of the root server stuff. If it doesn’t catch up then you’ll probably need to consider faster storage.

How many endpoints? How many operators?


Yep, that is fine. The only thing that could get confused is the bigfix client on the root server itself, but it should connect to localhost and not to an actual relay FQDN so that should not be a problem, and even if it was, it should still work correctly, just would be in an odd case of reporting to the fake root first, then things getting passed along to the real root.

They may have to run relay autoselection again or something. I’ve never actually done this before, but the clients will think they are still talking to the same relay, when in fact they are not. I think the relay should make them reregister, or perhaps just accept their connections silently as if nothing changed.

You could set up a test client and relays to play around with this scenario. Set the client for manual relay selection to talk to a specific relay, then swap out which host that FQDN resolves to a different relay and see what happens.

Probably not a terrible idea, but probably not required either. It is usually the root that is more likely to be affected by things like this, not usually the relays, but maybe. How many clients are talking to the root?

You could do that, but if the real master server has a FQDN that doesn’t match what is in the masthead, you can just tell the fake root that it’s parent relay is the real master server’s FQDN (BFX04)… the fake root doesn’t need to know that it is talking directly to the real master/root server. It only cares that it tries to talk to a relay and succeeds. (the root is also a relay)


There is another thing that I would also recommend trying, which you should probably do first. If you are using Relay autoselection, then you should set the following on the real root server: _BESRelay_Selection_AutoSelectableRelay to 0

Read more here: Legacy Communities - IBM TechXchange Community

The real root server and the fake root should both not be set to be autoselectable to encourage clients to talk to other relays instead. Clients will always failover to the root server (FQDN in the masthead) if they fail to connect any other way, but you only want them to try this if they have no better option.