Issue with persistent open tasks

llamuh · January 25, 2016, 4:16pm

I have a few tasks/fixlets which must be ran nightly within a range of about 12 AM - 4 AM. To accomplish this, I have created open ended jobs such as this:

This action starts 1/19/2016 6:45:00 AM UTC.
This action will never expire.
It will run between 6:45:00 AM and 7:45:00 AM UTC, on any day of the week.
If the action becomes relevant after it has successfully executed, the action will be periodically reapplied an unlimited number of times, waiting 1 day between reapplications.
If the action fails, it will be retried up to 999 times, waiting 1 day between attempts.
The action’s start times will be staggered over 5 minutes to reduce network load.

The problem is, after a week of a job like this running, the job slowly gets off track and ends up with 30%+ of the endpoints sitting in a waiting state. I have tried playing with several options such as giving it a larger window to run in, less and more stagger (more would be preferable to reduce network load), and changing my waits to runs but nothing seems to help. It is not always the same 30%+ endpoints sitting in a waiting state each day, or even between the jobs. One job may have 500 endpoints sitting in waiting while the other job may have 200 different endpoints sitting in waiting. If I give the job a much larger window of say 12 hours (which is not desirable as these jobs really need to run overnight only) then I will get a few more completions but the times are all over the place even with a low stagger.

For informations sake, these jobs are typically ran against approx. 1800 endpoints which are using a local relay (1300 local relays) which are pointed at 2 server relays.

I have checked the logs on the endpoints which are supposed to be running the jobs and also on the relays and do not see any issues. At the times the job is supposed to run, I do often see a relevance check complete in the logs, however I do not see anything actually associated with the job. The jobs are not transferring large files so I do not think it is an issue with poor bandwidth/network congestion. Oh and I have pretty much removed all relevance from some of the jobs as well just in case there was something weird with the relevance.

What am I doing wrong here? What is the best method to have something run consistently on a daily basis?

JonL · January 25, 2016, 6:26pm

Try reducing the waiting time between reapplications. For example, try 4 hours or any value greater than your action window, but smaller than 1 day.

heagsta · January 25, 2016, 6:34pm

You mentioned 1300 relays, I am guessing 1300 different locations?? Either way, how do you have those split across your two top level relays? Just checking load per relay. Also, the ones that worked vs the ones that did not, do you see a trend with the ones not working and the relay they are pointing to? Also, what is the speed of the networks at these locations and the size of the files(s) you are pushing?

llamuh · January 25, 2016, 6:52pm

Sorry I neglected to note that I have toyed with the reapply setting. I have tried 12 hours which really had no affect considering I only originally had a 4 hour window for these tasks. I reduced it down to reapply every hour while relevant and then even reduced it to apply immediately while relevant, which I would assume would cause some of the endpoints to run the job multiple times within that window which is not ideal.

llamuh · January 25, 2016, 7:23pm

Yes, 1300 different remote locations. Speeds are mostly low tier DSL speeds so at bare minimum 1.5mb connections with most being 3mb+. The local relays are manually “load balanced” by pointing half of them to one top level relay as primary and the other half to the other top level relay as primary. I just verified that the 500 endpoints stuck in waiting from last nights job are running a mix of the two top level relays.
As for the size of the files, I have reduced the number of daily file transfers everywhere possible. Many of the overnight jobs are simply running a single command and then later reading the results in an analysis. The two jobs that still transfer a file are around 100kb each. There are still one time deployments being sent out overnight as needed for other things that may randomly come up in the week, but it is not a nightly or even weekly occurrence.

dtamillow · January 25, 2016, 7:42pm

It would be helpful to know a little more about the relevance being used. If the relevance always evaluates to true, then your concerns about the immediate reapply may be (pardon the pun) relevant. You mentioned that you toyed with the reapply settings some. Was that for the “if the action becomes relevant” clause, or the “if the action fails” clause. One thing that might be happening is that on some endpoints, the action is failing and going into a wait state for the retry multiple times, which could then be conflicting with the once-a-day expected reapplication. You might cut your retry of 999 times down to something less, both in count and attempt duration. Maybe “it will be retried up to 24 times, waiting 1 hour between attempts”, to make sure that the next daily run won’t conflict. Just my $.02…

llamuh · January 25, 2016, 7:53pm

My relevance right now for the two main jobs that are having issues is actually none. I completely cleared the relevance in hopes of somehow improving the results. The relevance that was there prior was still very simple relevance, “member of group XXX of site “xxxx”” type of stuff.
The reapply settings I toyed with are the “if the action becomes relevant” I did not think about the fact that the job could be failing and then immediately going into a waiting state while waiting on the next day. The only problem there is that the following day that particular endpoint will work just fine with absolutely no changes made to the job or the endpoint. One of the jobs with problems only issues a run command, “run xxx.exe” The executable exists everywhere by default so it shouldn’t fail due to the file being missing, but just in case I even tried adding an ‘if then’ checking for the file and had previously used relevance to verify that the file exists. I may try changing the “if action fails” setting to see if I’m somehow getting failures.

heagsta · January 25, 2016, 8:46pm

I would take out all of the retry and reapply stuff (criticality permitting) and just send a normal action (not policy) to these during your same window and see what you get. Possibly monitoring the FillDB on your main server and top level relays to see if they are backing up or something. You will at least be able to tell if the action is getting to your endpoints like it is supposed to and take the complexity out of the equation. have you checked some of the endpoints that were showing waiting and evaluated the times of when the fixlet was evaluated, etc? Also, are all of these endpoints in the US?

It doesn’t sound like you have a bandwidth issue, you have more bandwidth than I normally work with and I push a lot more files/larger sizes out every night without those issues. Hope some of that helps.

llamuh · January 25, 2016, 9:05pm

I have sent other jobs to these same endpoints around the time that the daily jobs should be running and have much better results with the one time jobs. When I first schedule these daily jobs, the first couple days results in 95%+ completed. It is after this point that the percentage starts to gradually dip to where eventually half of the endpoints are actually completing nightly.

I have checked logs on a handful of the ones that were stuck in waiting and it does look like the relevance was checked within the window that the job was supposed to run:

At 01:49:45 -0600 - actionsite (http://xxxxIEM:52311/cgi-bin/bfgather.exe/actionsite)
Relevant - xxnightly (fixlet:22533)

However, I do not see the actual job attempt to run. The previous and following days, it will run fine:

At 01:57:48 -0600 -
ActionLogMessage: (action:22533) Distributed - time has arrived
ActionLogMessage: (action:22533) Action signature verified for Execution
ActionLogMessage: (action:22533) starting action
At 01:57:49 -0600 - actionsite (http://xxxxIEM:52311/cgi-bin/bfgather.exe/actionsite)
Command succeeded run c:\nightly\xxnightly.exe (action:22533)

I do not see any failures anywhere in the logs on the days where it does not run.

Yes all endpoints are in the US and do not show any network drops or saturation on the network monitoring solution I use.
I have not checked the FillDB.

jgstew · January 26, 2016, 5:38am

This has already been pointed out, but if you want it to reapply every day within a specific time window, then the reapplication needs to be less than one day.

What was your reapplication set to? Was it reapply while relevant, waiting? It sounds like it was.

Did you try it without this:

In theory, I get what this should do, but I’m not really clear on it, and I think it could be the source of your issue if you haven’t tried it without this.

llamuh · February 4, 2016, 2:48pm

So I changed some of the settings as suggested and so far there is an improvement. There are still a few endpoints which are not always running nightly, but that could be from factors outside of big fix.

Here are the settings I am going with for now:

This action will never expire.
It will run between 7:45:00 AM and 8:45:00 AM UTC, on any day of the week.
If the action becomes relevant after it has successfully executed, the action will be periodically reapplied an unlimited number of times, waiting 30 minutes between reapplications.
If the action fails, it will be retried up to 2 times, waiting 10 minutes between attempts.

I really do not like running a job against that many endpoints without a stagger, but at least the two main jobs that I need to run nightly are very light and shouldn’t cause too much congestion.

Thanks everyone for your help.

jgstew · February 9, 2016, 7:15am

That is pretty normal for a small percentage to not be awake or something like that.