04 March 2011

OMG What Do I Do? - "The database is slow!" - Part II

Okay, when we last left our heroine, she was bravely dodging the arrows of slow databasity.  Well, here's the thing.  When someone says their database is slow, what does this mean?  I don't mean what's causing it, I mean what is the client really trying to say?  I have complained of what I have called incorrectly a "slow database".  My last job, and in some ways my part-time current job, is Senior Administrator of our enterprise scheduling system.  This system is Applications Manager by UC4, (formerly AppWorx) and I have gone through a couple different application upgrades and two database upgrades.  And I have complained about a slow "database".  My personal experience when *I* was a client went as follows:

Me:  "Oh my god! My database!  It's just... creeping!" (note: complete tone of desperation in my voice. also note: this scheduling system, of which I am/was the Senior Administrator, runs ALL of our billing batch jobs, runs ALL of our payroll, including the top management, runs ALL of our financials, and well, pretty much runs everything in our company.  Normally, I block this all out of my day to day thoughts which helps me cope with all the awesome and scary responsibility.)

Sr. DBA DE:  "Hmm.  When did this start? (As he types on something completely unrelated in the background.)

Me: "Um, when? Um, it just FLIPPED, it was running okay and now everything is going really really slow, like 10 times as slow as it normally does!!"

Sr. DBA DE: "Is the database slow or is your application slow?"

(kim comment - See, this is what DBAs do.  Blame the application.  Typical comment:  It's not the database.  It's the application.  The database is perfectly healthy! All the data is there, the listener is up and running, the archive logs are shuffling off as planned, there is no heavy I/O or cpu, and nothing is crashed. So therefore the database is FINE, except of course the frickin' phone is ringing off the hook because everything that actually touches the database, YES, application wise, is going 100x slower than normal.)

Me:  Trying not to have the panic show in my voice, I say: "Okay, you're right, it's the application, the database is fine.  Can't you put some indexes in or something? Or run statistics or something?"

Although this sounded perfectly reasonable to me, my Sr. DBA snorts.

Me: "Okay, here's the situation, I can make it run at normal speed.  I can stop the application stack, truncate one table, delete all the data out of another table, start it again and then it runs at normal speed.  UNTIL the truncated table starts filling up the table again.  So it's recreatable!"

Sr. DBA DE:  Swivels around his chair, ears all perked up and suddenly shows interest.  "Really?  Well, then we can probably do something that could fix this... have you ever heard of OEM?"

Me: "No. What does it do?"

Sr. DBA DE: "Magical stuff.  Basically, if we can find that sql statement that is causing us trouble, we can lock it into the good twin behavior when it is playing nice and keep it running fast.  Maybe.  No promises.  But maybe, if you can recreate it.  But really, no promises."

(kim comment - there is an art to giving a client hope but not actually promising anything. A skill I need to work on.)

Me: "Okay, let's do it!  I can totally recreate it!"

I use words like totally a lot because I grew up on the East Coast on the beach.  It's sort of mid 80's Valley Talk.  Some people find it endearing.

So then I found out about OEM and magical things like SQL Advisor, which really is magical.  Of course, it took some time to get access hooked up and in the meantime I did a lot of truncating and bouncing.  But I watched my now Hero Sr DBA DE trap the offending sql statement, tune it by locking it in as its good twin so NO MATTER WHAT, it ALWAYS runs this way!  And just like that my application loved my database again, and they danced together as fast as they were supposed to.

Now wouldn't it be great if I could do that with this database that my clients were meeping about?  Well, a couple of things.  We don't have everything in OEM yet, and of course this one particular database is not in OEM.  So that tool is ruled right out.  It will all be manual and eyeballing.  But here are some magical questions that I followed up with.  I think, that in any troubleshooting scenarios there are a couple of basic fundamentals that will work when trying to figure out what's going on.

  • Journalistic Questions.
  • Sherlock Holmes Most Famous Quotation.
Remember journalism in middle school?  There were basic questions that were taught in that class that have helped me troubleshoot practically everything.

Who?
Who is this affecting? Is it affecting everyone? Are you sure? Have you tried it with different permissions?

What?
What is it affecting? Is it doing the same thing on every query? What is it doing exactly? I need times, dates, milliseconds. What is your definition of "it's taking forever"? I mean, forever is a long time.

Where?
Is it doing this everywhere or only on one desktop or one floor?

When?
When did it start happening exactly? Does it happen all the time? Is it recreatable?

Why?
You know, the client is the closest thing to their application other than the database.  Ask them what their opinion is as to why it's happening.  They have feelings towards their app.  They may have seen it before. Clients have actually pretty good intuition.  And besides which, it makes them feel valued.

How?
How did you discover it? How does it make you feel?  Okay the last question is kinda touchy feely, but I think empathy is something all database administrators should cultivate. I mean, you can go to the best doctor in the world, but there is something lacking if, when you are really sick, he doesn't hold your hand and murmurs, "There, there...we will try and make it better..."

And then Sherlock Holmes Most Famous Quotation.

When you have eliminated the impossible, whatever remains, however improbable, must be the truth.

So using these basic tools, I found out the following:

The whole application is fine, except for two update statements that occur every day at about 1:30 p.m.  These are the only things that are slow. I also learned that this didn't start happening last week, this started happening a couple of weeks ago. Interesting!  You know why?  We have something called Veritas Clustering for high availability databases (databases that have to be up all the time).  To analogize, it's something like having a duplex where there is apartment A on the left side and apartment B on the right side. Usually, each database has their own duplex, own fridge, own bathroom and their individual space and privacy.  Which is important to databases. They like being by themselves as much as possible.  But sometimes, something happens to one side of the duplex that forces that database to temporarily move over to the same side that the other database is living in.  Not ideal, but good in a pinch.  But just like any crunched living situation, it can get a little crowded.

Guess what happened a couple of weeks right when the database started getting slow?  Yup, one database moved in with another and never moved back.  Could this be the problem?  Maybe. No-one likes sharing the bathroom.  Even if it doesn't fix the slowness issue, we can't really start analyzing the situation in great depth until we get the databases back to their respective homes.  So that's happening on Monday.  So everybody, cross your fingers for me!!

In the meantime give good thoughts to all the DBAs in Egypt, Tunisia, and Libya.  They exist and are going through a lot right now, good and bad.  My thoughts and well wishes to them all.

No comments:

Post a Comment