Metabase inspection - BESClient and Relevance Debugger Hangs

(imported topic written by MrFixit)

I ran into today a BESClient that I just couldn’t keep from hanging. I traced down the hanging occurs right as it starts evaluating an analysis that is looking at some Metabase values for IIS configuration specifics.

I further tried a number of Relevance debugger versions to see if this may have been fixed at some point and it even is hanging when using the latest version (7.2.5.22).

The hang will occur even if performing a simple exists metabase.

Due to the live production status of the system, I’m not able to procmon or do much more than what I have to get to this point.

Any ideas? Could I change or add some relevance to avoid this hang?

(imported comment written by BenKus)

Interesting… I have never seen the metabase inspectors hang… but in general, if we make an API call that doesn’t return, the agent will wait for a response and will appear to hang… So it is true that if your system has some sort of issue that causes the metabase APIs not to work properly, it can lead to the agent not working properly.

You might try accessing the metabase through something else (like vbscript) and see if it also hangs…

Ben

(imported comment written by MrFixit)

System does have some issues and apparently anything that we have tried to look at the metabase hangs or has an error.

I’ve excluded it from that analysis for the time being and the BESClient stays up.

I was surprised that the BESClient doesn’t have a timeout for stuck calls for this type of situation. Is this the case for all or the majority of inspectors?

Also does the BES Client Helper Service help if a BESClient is hung in this state? I don’t have the helper deployed everywhere yet but the service shows as running … just hung and the only indication that there is an issue is that the log stops updating and it stops reporting to the console.

thanks,

-Gary

(imported comment written by BenKus)

Hey Gary,

Well… it is tricky… Some inspectors have timeouts built-in when we think they might have a hanging issue , but we don’t have a general purpose way to figure out when inspectors are taking a long time and when they are hung (this is a famous computer science question of the “halting problem”: http://en.wikipedia.org/wiki/Halting_problem).

The helper service wouldn’t have been able to detect the situation (because the agent service was still running).

Ben

(imported comment written by MrFixit)

Perhaps a “is it really alive” be a great new feature for the client helper. We watch for updating logs with a Ops Manager management pack, but it would be great if the helper could catch these too.

-Gary