Large data set question

(imported topic written by djorlett91)


I have an environment that has 200,000 nodes

I need to retrieve netstat information, vuln information (patches and fixlets), and dissa stig information per host

on average lets estimate that a given host has 20 open ports running, netstat will return on average three records for each port open (port protocol and host, owner, process name)

lets then say that each line is about 256 bytes on average

200000 * 3 * 20 * 256 / 4 / 1024 = ~ 3 gig

I have two methods off the wsdl from which to pull the data into my system.




Current Solution idea:

My feeling is I want to pull the data using a host by host

so for instance

foreach group cGroup in bes Groups

foreach host cHost in cGroup

do netstat query

do vuln query

do stig query

however there is a high possibility that there are going to be hosts that are not in groups.

soap is not the fastest thing in the universe

I am looking to get the best optimized way to pull this data.

I’ve thought perhapse a packet system but I am not certain how to tell big fix to give me results for hosts

iCurrentHost through iCurrentHost + iPacketSize

Ultimately my challenge comes down to scalability of solution

Anyone have thoughts / suggestions / questions ?



(imported comment written by BenKus)

Interesting challenge… and at the scale of 200,000 agents, you definitely are right to make sure you properly architect your approach to avoid scaling issues…

How often do you want to query the data?


(imported comment written by djorlett91)

The query is user driven, typically on the same time frame as a scan would be for a system.

Think in terms of a vulnerability scanner (nessus, macafee vuln manager, etc)

The goal here is to use the current node base as a scan utility.

That is the long way to say that it will be whenever they client does a scan of their environment for vulnerabilities. Sometimes its once a week, sometimes its once a month, or even once a quarter.

Obviously I can’t hit the whole system at once because the string array will not serialize and send across the proxy. So its really about trying to find the closest thing to optimized as I can get.

I appreciate any insight into the big fix system or any thoughts about using the get relevance results async in creative ways. I’ve thought about it a bit but my thinking is geared toward working with sql and packeting result sets so its a little different.


(imported comment written by BenKus)

Hey Devon,

Hmmm… I thought I knew what you were trying to accomplish based on your first post, but then I was confused by your follow-up post… Do you mind summarizing again what you were looking to accomplish to help me get clarity on the use-case?


(imported comment written by djorlett91)


So here it is in a nutshell

I am integrating big fix with a tool suite that builds data from a vulnerability scanner.

So what I need is a “snapshot” of the existing environment from which this other tool suite will build a model.

While I can use sql to pull the data, relevance is really the only way to get at most of this information at least as far as I am aware.

So if I use the WebService to pull my data I am faced with the possibility that the amount of data coming down the http pipe (or https). I know from experience that at a certain size the serialize will bomb out when its building and there is no native subdivision or packeting functionaltiy.

I have to build it for the largest case first and then look at some shortcuts if i have a smaller sample set. Either way I am looking at a very possible 500,000 nodes in the system even though my previous example was 200,000.

So at request I need to get a snapshot of the existing environment which includes hosts, the ports open on those hosts, and the vulnerabilities associated with those hosts (patches needed or fixlets)

I am also looking at pulling the current grouping structure, and the DISSA STIG information for each of the hosts.

I hope that clears up what I am trying to accomplish.