Have you tried yet using a browser from the WebUI machine to connect to masthead.server.name:52315
? I think it would be useful to see what certificate the root is presenting (or whether there’s a transparent proxy overriding it)
Just did that on 52311 with HCL Support last night, just tried 52315. Both show…ready?..a.domain.com
!!
So, from the migration of data from a
to b
, some certificate from a
is still there confusing everything…
Now what?
And…"a.domain.com"
is an individual server hostname, not the masthead name of the deployment? That’s confusing, I’ve not seen our self-signed certificate use the real hostname before, it should be issued based on the masthead name
That’s what I thought. I was very surprised to see the server name as the CN in the cert.
How about the old server? Does it also have its real hostname in the self-signed cert on port 52315 ? It shouldn’t…
Any chance that something was entered wrong when installing the new server? If I understand your scenario correctly, should have started by installing a new root server, using the option to “Install using an existing masthead” and provided the masthead.afxm / actionsite.afxm from the original installation. In that scenario I don’t think it would present an install option for you to type in the gather URL, but if so it should have been "http://mastheadname.domain.com:52311"
, not an individual server name? After installing the new root you’d continue by restoring the database and files backups from the original root server to it…
(I emailed this response, but it hasn’t shown up yet, so manually posting a copy…)
There’s every possibility that something was entered wrong, but under no circumstances would anyone have entered the old server name as part of the process.
On the current production server:
https://localhost:52311/ (Our own certificate)
Issued to:
Common Name (CN) masthead.server.name
Organization (O) Duke University
Organizational Unit (OU)
Issued by:
Common Name (CN) InCommon RSA Server CA 2
Organization (O) Internet2
Organizational Unit (OU)
https://localhost:52315/ (Self-signed for WebUI)
Issued to:
Common Name (CN) a.domain.com
Organization (O) IBM
Organizational Unit (OU) platform
Issued by:
Common Name (CN) WebUI Certificate Authority
Organization (O) IBM
Organizational Unit (OU) WebUI Certificate Authority
So, how do I change the 52311 and 52315 certificates? Or where are they stored? Google and HCL search aren’t being much help…
The certificate on 52311 is what’s changed on the root server when you generate your own certificate and apply the client settings on the root server to use the custom certificate as described at https://help.hcltechsw.com/bigfix/10.0/platform/Platform/Config/c_restapi_https_settings.html
As for the certificate on 52315…I’m afraid I don’t know. I don’t think it’s supposed to be customizable…but it’s a really key thing that you should be sure is in the support ticket, I think the Platform Dev team may have some details on it, and I think that’s almost certainly where your WebUI connection is going wrong.
Update: We did some reviewing and brainstorming this afternoon and managed to figure out a workaround that gets the WebUI server (and therefore SAML authentication) running: we added the old server address (instead of just the masthead’s “service” address) to the local Windows hosts file so that the new server thinks that the old server name points at itself. The WebUI server started and ran fine, also handling SAML authentication requests for the WebUI, Web Reports, and the Console.
Granted, this is only a workaround. We’d still like to get the matter truly fixed before we move forward with our migration from the old hardware to the new. We’re eagerly awaiting next steps from HCL to remove/replace the old server name certificate that comes over as part of the migration.
Seems too obvious not to have already been checked, but I’ll ask anyway about the table in BFEnterprise called dbo.WEBUI_CERTIFICATE, with columns for Host, Serial, Certificate, isRevoked, and Revoked Time. Is it related to your issue?
They’re checking on that. Using besadmin.exe, I’d already revoked the existing certificates and created a new one (using a different-but-valid server name), but the name on the cert offered up on port 52315 never changed. It’s either pulling the oldest (and revoked!) certificate from the table or it’s pulling a cert from somewhere else.
So, we found the cause, the question is now what exactly to do about it.
The certificate created for use on port 52315 is based on the value found in the REPLICATION_SERVERS table. Once that value was updated and certificates were recreated, everything worked fine.
Now, what to do with that knowledge? There are two concerns that I’m working through with HCL support:
-
Step 5 of the “Server Backup” documentation begins “If leveraging DSA”, then goes on to state that “If DNS aliases are being leveraged for the servers, this should not change. If is using hostnames, and the hostnames are changing, these column values may need manual modification after the restore.” Since we’re no longer using DSA and based on the “if leveraging DSA” wording, this step was skipped. We now see, however, that it applies to non-DSA environments as well. This should be changed in the documentation.
-
Exactly when during a migration should the BigFix software be installed? I had thought, based on the statement that “The new BigFix server has been built” under the “Before you begin” section of the “Migrating the BigFix root server” documentation, that the proper time was “first”, resulting in the SQL and file restores being performed on a fully operational system. However, the “Server Recovery” documentation seems to indicate that the proper order is: Install SQL, Restore DBs, Restore files, Re-encrypt decrypted keys, THEN INSTALL THE BIGFIX SOFTWARE. Which is correct? Am I putting too much into the statement that “The new BigFix server has been built”, assuming that means “functioning BigFix server” rather than just “the hardware is ready”?
Hoping that @JasonWalker hasn’t forgotten about this thread and will speak to the thoughts above.
This is a DSA installation??? Ohhhhhh. I’m out.
(Our installation is entirely virtualized, relying on VMWare features to handle resiliency.)
It’s actually NOT a DSA installation. That’s why we skipped that step, but it turns out that step applies whether or not it’s DSA.
I’m here…that’s…really interesting. I may need some time to think about how to use this info, great find, thanks for sharing.
I may have been real close to finding this…I had raised internally that our instructions for WebUI failover for DSA servers might be invalid since the server name in the certificate may not match the masthead name, but I never considered that we might build the cert from the replication_servers table entries.
Ah! Indeed and you’d actually said that! Sorry.
That seems like something that’s pretty important. I hope your engagement spins out a documentation bug.
I’ll defer to Support on these, since they’ve been able to do some analysis from your logs…
… I’m not sure that we document the behavior for “a system that was once DSA, but has had the DSA functionality removed”. I think you’re correct though, and a system that has ever been DSA will probably need to consider DSA ramifications ever after.
I’ve done it both ways. If there’s already a clean BigFix install, restoring the data over it (including re-encrypting the keys for the new host) overwrites the default data and works fine; if you do the restore first and then the new install, when the new install runs it reuses the existing data (in fact each server upgrade performs an uninstall of the old version followed by an install of the new version, which is kind of scary to watch when you do it manually).
I always like to do dry-runs first, testing that we can get the database connected and the new server reporting to itself, then have a separate outage for a “final” restore of the files and database to the new machine (skipping the encrypted keys on that final copy, as they should have already been restored & fixed up during the dry-runs).
Edit:
If you do the “restore first, then install the software”, I don’t recall whether you’re prompted for masthead options on the new install; if so, be sure to specify install using an existing masthead.
This is exactly what we’ve been doing.