Have you ever racked your brain as to why a server looks busy, but the overall health of the server is fine? There is no obvious answer for what could be causing the issue you are experiencing.
This issue has plagued me for a month regarding one server. We have a server that was built as a Data Warehouse, but has evolved into a DW/OLTP/Reporting/ODS/UDM and any other acronym you can think of box. The plan for the box was to take data points from all different products and convert the data into a single language and single point of truth for data. Once that was in place, our product teams decided to take the easy way out and take what they needed from this server. Now all products are dependent on a DW server. That is not practical for many, many reasons and we have been trying to sever the dependencies for over a year. However, following recent outages and performance issues, we have been working to break the dependencies.
Now, the focus of this blog: We have hundreds of SSIS packages running on this server every 5 minutes, every 15 minutes, every hour and some jobs run once a day. The overnight jobs are time dependent and must be done by certain times to meet financial reporting and other dependencies. Our Operations Center has a run book that shows jobs should be complete by a certain time. One minute after the scheduled end time has expired they notify the on-call. I was on-call this past week and was not going to get woken up in the middle of the night.
I have not been able to find a specific reason for the sporadic nature of the long running jobs I was going to investigate and attempt to resolve the issue(s) this week.
The DBA rookie came out in me and I scheduled a SQL Agent Job to run at 10:00 PM before the nightly jobs kicked off.
Note/Caution: This is not an ideal fix as all execution plans will be cleared and can cause issues on busy production systems because the execution plans will have to be recompiled on execution.
Guess what? I did not get any calls relating to long running jobs!!!
Now that got me to thinking deeper about the issue. When running sp_blitzcache it always shows “parameter sniffing” for these stored procedures. While I am familiar with the concept and term, I have had a hard time understanding why it happens and how to fix.
In this particular situation we use the same SSIS packages and underlying stored procedures for the 15 minute jobs and the daily jobs. I thought maybe the 15 minute runs, which are incremental and only pull partial data were generating an execution plan for the small data runs. Then when the longer nightly runs attempt to execute they get a bad plan because they are running for larger volumes of data.
I read The Elephant and the Mouse, or, Parameter Sniffing in SQL Server from Jes Borland (t | b) previously and went through the examples. It was very informative, but I still did not know what caused the errors or solution.
Summary – Working Towards a Solution
Now that I understand the issue more thoroughly, I can discuss my findings with the ETL development team and work with them to resolve the issue.
And keep the rookie/temporary fix in place so we do not miss any SLA’s in the short term until we can identify and test a long term solution.