Root api "server too busy"

cstoneba · October 9, 2019, 2:26pm

We occasionally get the response “server too busy” when pulling back applicable content from a computer via the Root Server REST API:

GET https://root:52311/api/computer/computerID/tasks

Does the Root server do rate limiting of some sort? If so, what are those thresholds?
Is there anyplace where these events are logged?

gaetano.fichera · October 11, 2019, 2:21pm

Hi cstoneba,
If this happens to you a lot, it would be useful if you could collect the deadlock log (SQL Server: Turn On Deadlock Trace Flag & DB2: Collecting data: DB2 Deadlocks ). Also I think your problem could be related or similiar to this one https://www.ibm.com/support/pages/users-logged-webui-are-suddenly-logged-out. Take a look and let me know if you need something else.

cstoneba · October 11, 2019, 7:39pm

Hi, i am seeing some deadlock errors in the filldb.log.

Fri, 04 Oct 2019 22:01:36 -0500 -- 2416 -- Encountered error during long property results update: Database Error: [Microsoft][SQL Server Native Client 11.0][SQL Server]Transaction (Process ID 61) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. (40001: 1,205) (RESULTS LOST)

Fri, 04 Oct 2019 22:01:36 -0500 -- 2416 -- Error storing reports: Database Error: [Microsoft][SQL Server Native Client 11.0][SQL Server]Transaction (Process ID 61) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. (40001: 1,205)

I’ll have to investigate those in SQL. But you think they would cause the Root server API to respond with “server too busy”?

gaetano.fichera · October 14, 2019, 9:01am

Most likely yes, let me know what you find in the SQL deadlock log

cstoneba · October 14, 2019, 12:36pm

@ 2019-10-11 22:05:00.247 :

spid 61 is being blocked by spid 195 and inturn spid 195 is being blocked by spid 61 and 61 is chosen as deadlock

spid 61 query:

merge into LONGQUESTIONRESULTS as q using (values(@P1,@P2,@P3,@P4,@P5,@P6,@P7,@P8,@P9,@P10)) as v(siteid,analysisid, propertyid, computerid, isfailure, isplural, resultscount, resultstext, reportnumber, webuisiteid) on q.SiteID=v.siteid and q.AnalysisID=v.analysisid and q.PropertyID=v.propertyid and q.ComputerID=v.computerid when matched then update set IsFailure=v.isfailure, IsPlural=v.isplural, ResultsCount=v.resultscount, ResultsText=v.resultstext when not matched then insert(SiteID,AnalysisID,PropertyID,ComputerID,IsFailure,IsPlural,ResultsCount,ResultsText, WebuiSiteID) values(v.siteid,v.analysisid,v.propertyid, v.computerid,v.isfailure,v.isplural,v.resultscount,v.resultstext, v.webuisiteid);

spid 195 query:

DELETE TOP (@BatchSize) LONGQUESTIONRESULTS FROM LONGQUESTIONRESULTS L WHERE SiteID = L.SiteID AND AnalysisID = L.AnalysisID AND PropertyID = L.PropertyID AND ( NOT EXISTS ( select C.ComputerID FROM Computers C WHERE C.ComputerID = L.ComputerID ) OR EXISTS ( select C.ComputerID FROM Computers C WHERE C.IsDeleted = 1 AND C.ComputerID = L.ComputerID AND DateDiff(day, C.LastReportTime, GetUTCDate()) > @InactiveDays ) )

gaetano.fichera · October 14, 2019, 1:37pm

It looks like the BES Computer Remover is running. If yes, it should run when the Filldb is more unloaded and to not specify a big value for the batch size in order to avoid deadlock.

cstoneba · October 14, 2019, 2:11pm

btw, this is for hcl incident CS0052264

Yes, we have the BES Computer Remover running every 6 hours with a batch size of 500000. This BigFix deployment/SQL is used very heavily so it is not known when there is a low use window period.
Think that batch size is too large?

gaetano.fichera · October 14, 2019, 3:17pm

Yes, I think is a quite large value.

jbruns2017 · October 14, 2019, 3:18pm

do you have recommendations?

JasonWalker · October 14, 2019, 5:25pm

The default batch size is 10000 and works well for most customers. The computers specified in the filters still get removed, it is just that they are run in several smaller batches so other queries don’t get blocked by database locking as long.

jbruns2017 · October 14, 2019, 5:39pm

Any reports a person could run that would help in this area to really know where it should be?

cstoneba · October 14, 2019, 8:02pm

we’ve set the batch size for the scheduled Computer Remover and the scheduled Audit Trail Cleaner from 500,000 to 10,000 and we’ll see if that helps with the deadlocks.
thanks

cstoneba · October 15, 2019, 1:23pm

We’re still getting BES Root API responses where there is no relevant task list for a computer ID, yet we have no deadlock occurrences in the SQL logs.
Any other suggestions?

JasonWalker · October 15, 2019, 1:37pm

I think you should work through the support ticket, so they can look at your specific configuration.

cstoneba · October 15, 2019, 1:45pm

I am doing that in parallel. Nothing there yet though.

dcosenti · October 17, 2019, 8:10am

hi @cstoneba ,
when you drive the REST API, do you know which is the concurrency degree of the requests?
I.e. how many requests are you driving in parallel.

cstoneba · October 17, 2019, 4:30pm

it’s a little hard for me to tell but when I look at my root servers server_audit.log, I just picked a time and there are 94 API Connections within a 1 minute window.

cstoneba · October 24, 2019, 1:33pm

I think we have this resolved. “server too busy” was just one symptom of the root problem. Our DSA SQL server was out of disk space so when it tried to replicate from the Primary Root server every 2 hours, it caused a db lock on the primary SQL. That in turn caused things like the filldb to fill up, webreports to hang, BES root server service API to hang, etc.

JasonWalker · October 24, 2019, 3:13pm

Thanks so much for posting the update! I’ve been trying to get my head around what could be a problem as I have some customers doing very heavy API calls without seeing that message.

(I hadn’t even thought about asking about DSA).