Initial DSA replication

Is there a reason why an initial DSA replication would cause the bufferdir on the primary to hang? BFEnterprise is 200GB on the primary so the initial replication takes hours but we’re seeing that it hangs the bufferdir contents on the primary when that happens, which causes our primary to have an outage. Is that normal?

The description of your problem is not clear to me, however this behaviour could be connected to the large data base size. I suggest you to open a case to the Bigfix support team.

The issue is when the DSA server replicaties from the primary, the primary’s filldb stops processing bufferdir files. , which I don’t think is normal.
The primary’s BFEnterprise is about 200GB and the audit trail cleaner is ran every 7 days.

Yes, working with support but hoping for some experience from the knowledge forum.

How frequently is it set to replicate, and are you getting lock timeout messages in filldb.log?

You may need to increase the interval in BESAdminTool if it’s too frequent, and may need to increase the UninterruptibleSeconds (linked above). If the replication cannot complete a full table before getting interrupted, I believe it may start over again from the beginning at the next replication period.

If you adjust UnInterruptibleReplicationSeconds, start gradually - maybe 60 or 120 seconds rather than the default 30. In my post I went as high as 720 seconds, but I was replicating between two datacenters over a high-latency link (which is not recommended at all, by the way)

I have it set to 15min just for now but usually it is 2hr. Yes, i can look into the UnInterruptibleReplicationSeconds but my bigger question is: Should the primary filldb stop processing (and the bufferdir fill up) when then the DSA replicates from it? I wouldn’t think so because it causes us an outage on the primary.

opinion here
DSA in general can require significant tuning to prevent database locking issues, and at a large enough scale that can become untenable.

If there is contention for database resources, yes that can cause locking problems where FillDB can’t update rows because the rows are locked for replication by the DSA partner. The Support team can help, but the usual tuning guidance from the Capacity and Planning guide apply - including FillDB tuning, console refresh times, client reporting intervals, offloading processing to top-level relays, to database & hardware tuning (keeping DB transaction logs, tempdb, and bfenterprise on separate volumes, high-performance storage such as SSD RAID volumes), and high-capacity links between the DSA partners.

In large deployments, I haven’t seen anyone recommend DSA in quite a while due to the overhead of tuning, and at a large enough scale it becomes untenable. For deployments of over 45k endpoints or so, we usually do a traditional DR instead - having the BigFix server services preinstalled at a backuo site, and copying wwwrootbes files and SQL database backups to it as frequently as needed (usually daily) for a recovery.

Yeah, we’re finding that too that DSA isn’t a great fit for our multiple 50k+ deployments. Our next techrefresh will use standard SQL replication practices for HA and some sort of HA for the root server (TBD).
Hopefully support can help. thanks

1 Like

If not already done, to further investigate your problem, you may want to enable the Filldb debug and Filldb performance logs. See:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20Endpoint%20Manager/page/BigFix%20Logging%20Guide

Once you have the evidence of the problem, rememeber to disable the above logging as it could affect performance.

I was able to tighten the audit trail cleaner values up and removed a bunch of rows. This allowed DSA to replicate past the ACTION DEFS table. I think i’ll have to run that audit trail cleaner daily as we have some days with 5000+ jobs a day.

thanks for the help.