Two weeks ago I was on-call and at 4:45AM Saturday morning my phone rang and the dreaded contact “NOC” was on the screen. After finding my phone and answering, the NOC technician explained that we had a Tier 2 server that had the log drive out of space. After letting them know I would check in to it and hanging up, I collected myself and continued to wake up.
After getting my laptop booted up and VPN connected, I was able to login to the server and see what was going on. Indeed, the log drive was almost full. Crap, what is going on with this server. TLog backup job is not failing. The differential backup job is currently running. Let me take a look and see what database is currently being backed up. After checking sp_whoisactive I show that our SolarWinds network monitor database was currently being backed up. I look and that is the same database logfile that filled up the drive.
Let me shrink that sucker and go back to sleep. Voila, log file size back to normal size, good night for a couple more hours.
I go on about my weekend with my family, what a great Saturday we had. Then it happened, 4:45AM on Sunday morning my phone rang with that same pesky contact on the screen. OMG, what is going on!!
Guess what, the same servers log drive has run out of space. Ok, let me get up and see what is going on. Same scenario as Saturday morning. This time I did some other digging. I found that our Index Maintenance job had failed at 4:45AM the past two mornings. Well that is interesting, this is about the same time I am getting called. This time the weekly full backup is running for the SolarWinds database.
Ok, lets spend a little more time figuring this out. So I query the CommandLog table where index maintenance history is stored. Sure enough, we had a table hit the threshold to kick in an index reorganize. This operation uses the log file to complete. But that does not explain to me why the transaction log backup job is not freeing the log records to be reused.
For a workaround, let me reorganize this table after the full backup finishes. This means I should not get called Monday morning because the table will be below the threshold for reorganize and the log file should not fill up.
Issue
Ok, so now it is Monday morning. I am awake and actually ready to use my mind to figure this issue out.
So we have a database that is being backed up by either a full or differential. The log records are getting used in the transaction log and not being marked inactive after the transaction log backup finishes.
Let us do some research to see if I am missing something pretty obvious.
I ran across this TechNet Magazine article from 2009 by Paul Randal (b |t) that covered the topic and a similar blog on the SQLskills site. He is the foremost expert in the topic of transaction log file in my opinion.
So after reading the article, I have a little more understanding on what was happening in this scenario.
When a full or differential backup is started there is a seven step timeline of events that happen during the full/differential backup and any transaction logs that occur.
Normally I’d size my log files to the size of the largest index in the database. That way you can do a full rebuild/reorg without filling it. If the space isn’t used 80% of the time that’s fine, the other 20% I’m probably not around, or don’t want to be disturbed 🙂
I also ensure that the log file is on a seperate filesystem, with some headroom for emergency growth. I’ve come across logs so full you can’t so anything. Having that extra 10-20gb can be a life save if your SAN team aren’t around at 2 in the morning.
Great info Stuart, thanks for the suggestion.