any database admins? [Archive] - Glock Talk

PDA

View Full Version : any database admins?


Nyper
05-11-2007, 07:14
I have a rather large database that I'm loading up in SQL Server (2005).

It's approximately 300 million rows.

We are going to be running a lot of queries on this database. Nothing but counts & selects. No updates or deletes.

The problem I have is that some of the fields have multiple values in the field.

For instance, one field my be your GlockTalk code that looks something like DDDDDDDDGAAC.

In this case, DDDDDDDD might be the date you registered. G would be your Gender. AA would be your Age. C would be a true/false flag of whether you are a contributor. So, an actual GlockTalk code might look like 20011215M33F... meaning you registered 12/15/2001, are Male, 33 years old, and not a contributor.

Many of our queries are going to be things like select * where GlockTalkCode LIKE '%M%'. So, if there is an M anywhere in the GlockTalkCode, that record gets returned. We cant use a substring to look for the M in the 9th position because the registration day may not be present, and the M may be in the first position, etc. If substring would provide a performance increase I could write a program to read all of the data in and format it properly, so instead of a blank date, it would just be like 00000000M33F so gender would always be in the 9th position. Obviously this would be a lot of work and processing and isn't the first choice.

Does anyone have suggestions for ways to speed up counts/selects for fields like this? I've indexed those fields, and performance isn't horrible when selecting on a single field... but when you start doing the same thing on multiple fields like that in a single query performance reeeaaally dies.

Nyper
05-11-2007, 07:30
Edit: After running some very elementary performance tests, substring seems much faster than LIKE.

Nyper
05-11-2007, 08:28
Another update -

Does SQL Server 2005 support fuction based indexes? I know Oracle does, but that doesn't do me any good.

In other words, I have a function that returns a substring called CustomFunction1.

Can I build an index on the GlockTalkCode field using the CustomFuction1 function which might return the 3rd character of that field? Yes, it's exactly like the SubString function.. but I'm just using this for testing purposes at the moment.

WhatYouWant
05-11-2007, 11:57
Is it possible for you to split the field into multiple fields? Then you could always do SEX = 'M' AND CONTRIBUTOR = 'F' and it would be much faster.

stooxie
05-13-2007, 13:00
You need to be careful using scalar functions like "LIKE". A database is most efficient when it can use an index to find data and is least efficient when it had to do a full table scan. When you use %LIKE% you force the db to apply that function to that field in every row (unless you otherwise limit the rows) and that takes forever.

You best bet, as WhatYouWant suggested, is to split the field out and expose the data. Why make the database do all that work? An efficient, scalable database is one where smart queries result in as little work as possible to determine the result set.

Joins, inner joins, outer joins, SQL divisions, scalars, etc, etc. All of it is great exercise for your db. Just go get yourself a plate of chicken wings while you wait for the query to finish.

-Stooxie

Deanster
05-13-2007, 16:46
Rule #1 for databases - each field represents one and only one thing...

Having a mixed field like that makes lookups MUCH slower than they need to be, makes any kind of global changes very dicey, makes doing mathematical operations impossible without doing an extract first, etc...

Splitting the fields into their appropriate component parts is the correct solution. However, if this is a pre-existing DB, I'm sure there's all kinds of code, queries, etc. that depend on the current formats, which makes changes a pain.

You may be stuck with a truly terrible fundamental design, and all you can do is work around it...

As Stooxie noted, "LIKE" is a nightmare for your current usage - the DB has to search every character of 300M rows for 'M', and since it's a semi-random string of numbers (at least as you've described it) you get minimal advantage from indexing. You're essentially doing a full search every time.

I'd say you're MUCH better off doing the extract you described, and breaking everything out once - even with crappy unoptimized code for doing the extraction, you're only doing a single run on the whole database - you probably make that time up after a half-dozen seaches on the resulting properly organized, fully-indexed database.

stooxie
05-13-2007, 18:15
Does SQL Server do triggers (I'm more familiar with Oracle)? If so, set a trigger for that field on the insert. When a new row comes in, trigger a stored procedure that splits out the info into different fields. Assuming that you aren't inserting that much that could work nicely.

-Stooxie