The replication of database was fine but however, during the failover test we encountered some problems.
Scenario:
Primary server was shutdown to simulate that the server was down and to allow the BES-relays and BES-client to automatic swing to the secondary server. After 2 hours plus, only about 1.5% of machines successfully switched over to the secondary server.
From my knowlege, it will take 6 hours for the relays and client to fully switch over by default.
2nd issue:
When we brought up the primary server and shut down the secondary server to allow clients and relays to switch back to the primary server, some of the client do not go back to the primary server and is still reporting to the secondary server.
would anyone could advise me that if there is anything imappropriate with the approach or configurations. Thanks
The time it takes to failover to the replica server is very much dependent on how your clients and relays are configured. We recommend that you not try to have your clients individually failover to the replica server, as that process could take several hours with default settings. Your clients should just be set to report to their relays, as normal. Then you setup your relays to failover to the replica, which would happen much more quickly.
If you have top-level relays, then only those need to be configured to failover (set primary relay to the main server, and secondary or failover relay to the replica server). When an outage occurs, the relays should failover in about 10 minutes. Since all other relays and clients are reporting up through those relays, your entire deployment has failed over at that point.
If you don’t have top-level relays (i.e. all relays report to the main server), then just set them all to use the replica server as their secondary or failover relay. Again they should failover much more quickly than individual clients because they communicate with their parent much more frequently than an individual client does.
Please note that relays
must
be set to use manual relay selection in order for the failover process to work correctly.
Thanks for your response, got the idea you were trying to put across to me.
so am i right to say in order to have a more effective failover the relay selection should be set to manual instead of automatic as this will only affect the top levels relay and not the client below the relays. This will facilitates a smoother failover, correct?
However, due to the complexity in the customer environment, the clients are set to automatic relay selection and in the event they could not get to the relay during the failover, they would naturally tries to connect to the server directly and i forsee this might be an issue. Let me see if we can work this out. appreciate your assistance on this.
I would still recommend setting the clients to use Automatic relay selection, but all relays should be set to use Manual which will allow them to fail over in the manner I described. As long as the relays fail over to the replica server quickly, most clients will never even notice that the server went down.
For clients that fail to find their local relay due to network complexity or network outage, you should set a failover relay for them that points to a relay not one of the servers. So an example setup would be:
The clients in this setup are using Automatic relay selection with the FailoverRelay setting set to FORelay. The low-level relays (LLRelays) are all manually pointing to the top-level relays (TLRelay1 & TLRelay2). The top-level relays and the failover relay are all manually pointing to Server1 as primary and Server2 as secondary.
Note that in the event of a failure that causes the top-level relays to fail over to Server2, they will not instantly fall back to Server1 as soon as it is back online. That will occur once the relays do their relay selection process again (every 6 hours, by default), or you can force it using the task ‘Force BES Clients to Run Manual Relay Selection’.
Would like to check with you, if _BESClient_RelaySelect_IntervalSeconds setting is set 600 seconds to reduce the time for the relay to go back to the primary server after the server is up after the failover test, will it be recommended to set it at 600 seconds? Will it cause any major impact?
Clients reporting to the main BES Server when using auto relay selection means they were not able to successfully reach a relay via ICMP or HTTP. You should specify a failover relay using the setting _BESClient_RelaySelect_FailoverRelay (http://support.bigfix.com/bes/misc/besconfigsettings.html) to give them another option before trying the main server. You can investigate the responses clients get from the relays by enabling debug logging and using the Run BES Client Diagnostics task with the relay selector option (action 2).
If you need additional assistance with this, I would suggest opening a support case.
There are quite a number of client that are reporting to the main server so troubleshooting the clients one by one would be very tedious. However, we have confirmed the relays can be reached by the clients.
Currently, we manually assigned the relays to these clients and then change back to auto selection and it seems to be working.
Likely you would discover a few general issues (e.g. ICMP is blocked on some subnets) after investigating a few different clients, but you’re welcome to use manual selection.