I recently turned on FillDB performance logging, in order to get insight on whether my BigFix server could be better sized/configured. However, is there any guidance on what are considered “optimal” vs “bad” metrics?
Guidance or recommendations on ‘optimal’ versus ‘bad’ performance metrics really require having more information (for instance, the requirements or goals of the implementation). What might be considered poor performance metrics in one environment may be perfectly adequate for another based for instance on the number of endpoints as well as the subscribed sites/content). This may be an obvious statement, but at a minimum, FillDB’s insertion rates should be higher than the rate of incoming data (preferably both when an environment is ‘idle’ as well as during periods of higher activity such as large-scale deployments). Otherwise, backlogs within the bufferdir ‘queue’ will occur which will lead to reduced responsiveness and delays in reporting results.
One of the nice things about the FillDB performance logging is that it provides some insight into the types (and volumes) of data being processed. This can not only provide context around the amount of certain incoming data (such as ActionResults versus FixletResults versus Property/QuestionResults), but how quickly those particular data types are processed. Large differences between insertion rates of these data types could be an indication of an issue for instance (though longpropertyresults are expected to be slower to process). Comparing the volume of rows of the different data types can also provide context around the usage patterns of the environment.
When measuring filldb performance, it’s important to mainly consider metrics associated with larger batches (i.e. those with at least 500 messages). Smaller batches typically do not provide accurate insight into throughput.
While I’m going to refrain from a suggestion of “good” versus “bad” metrics here for the time being, I will say that we tend to like seeing results rates in the thousands of rows per second rather than the hundreds (with longproperty results being a potential exception). As I’ve attempted to describe above however, please keep in mind that this is really only one perspective of the data that the filldb performance logging provides.
I may be wrong, but I would think that the performance metrics would be the most meaningful when there is a slight backlog than any other time since it is those times where the maximum performance is really being tested, unless the backlog is being caused by some other process / bottleneck, which adds complexity.
I think we run into a general issue where we feel that we would like better performance out of IEM/BigFix on the whole, but it is unclear how to get that better performance. Where are the bottlenecks? What are the most meaningful metrics to try to increase and how will that trickle down to the feeling of performance in the system?
In the case of the root server, what is the right combination and configuration of CPU / RAM to give us the best performance? Could certain configurations actually cause the system to be slower? ( for instance, is a single 16 core CPU better than 2 8 core CPUs? )
What is most important? Single Threaded performance? Multi-Threaded performance? RAM throughput? Latency? Do NUMA nodes matter? What matters more for the storage, IOPS or Throughput, or both? And frankly, all of these questions should be answered separately in terms of what is best for the root server processes and the DB processes if they are on a single server vs separate servers.
If we wanted to set up a root server with the idea that it could handle 100000 endpoints, what would that look like? What about 500000? I think the answer of what is important for a very large environment is meaningful so that you could better configure a root server for a much smaller environment so that it could take advantages of whatever optimizations are needed on the high end and be scaled up as needed.