Wednesday, August 26, 2009

Approach to Estimate SQL Query RunTime



Disclaimer:

This blog is not from an expert who claims to know everything, but is from a curious developer who wants to find a way to know more…

The abstract, written here may sound really weird to experts, but at the same time they may feel good about the curiosity I have for the subject.

The content of this blog is just a theoretical/conceptual description of the solution, which uses the concepts of statistics and mathematics.

Please feel free to add your comments/criticism, so that we can either discard this whole theory with some other suggestion from the readers or move forward in order to implement this theory in practical.

Thanks in advance for your comments/criticism, if you bother to write one….


Problem:

To estimate the time, which a query may take when we execute it over a huge database.

Concepts:

Divide and Rule, Sampling, Basic Mathematics


Approach/Theory:

Assumptions: We have updated estimate of the size of sample space.


Lets, say we have a finite sample space (S1)

And there be an experiment (E) that we want to carry out on this space.

(S2) be a sample space and has been derived (randomly or intentionally, based on our requirement) from (S1) in such a way that (S2) divides (S1) completely.

i.e. There be a finite number (N) of (S2) sample spaces present in (S1) and completely constituting (S1).

We can denote this relationship as

(N)*(S2) ~ (S1)

Now, lets say that (T2) be the time required by the experiment (E) in order to get executed completely over (S2).

(This, we will calculate by executing the experiment (E) on (S2), as (S2) is very small so response time should be very small too.)

Now based on relationship (N)*(S2) ~ (S1)


We may say that if (T1) were the total time which experiment (E) would have taken, if it would have been executed completely over sample space S1 then


(N)*(T2) ~ (T1)


And if we have a constant (C)(behavior of which should be defined latter), then we can say that


((N)*(T2)) – (C) = (T1)

Proposed/Hypothetical Practical Use:

Think the whole table/tables on which we want to run the query as sample space (S1)

Constitute (S2) based on it

(More explanation on how to derive should be properly discussed/ would be given latter, as it is only the first draft of the abstract of the concept)

Run the query on (S2) and get the Time (T2)

Since (N) will be known so we can estimate the time using the formula above.

Constant (C), need more research to decide and is expected to vary for a particular type of DB etc.


Challenges:

The proper consideration of database behavior has to be concerned

Need to research a lot before going for practical implementation (Which in turn will require proper Time and tools)

The theory is very simple and not at all takes into consideration of a lot of related factors, which may come into picture while implementation.

Active participation from experts/readers who really want to solve this problem, which, at most of the sites, has been declared as not possible to break.

List is long as it is just the first draft…let see, how is the response from you guys!