Consultor Eletrônico

Status: Verified

GOAL:

SQL Optimization for performance in SQL-89

FACT(s) (Environment):

Progress 7.x
Progress 8.x

FIX:

The SQL optimizer is used by the Progress 4GL as well as by Report Builder.
There are essentially three different SQL optimizers in existence in the V7 and
V8 releases - each characterized by different limitations and performance
characteristics. Each are described briefly below.

I. Pre-V7, 7.3a through 7.3c01

In this code base, optimization worked as follows.

Each AND field in the WHERE clause is scanned to identify which tables it
references. Then the tables are reordered, placing those that are referenced in
AND fields that reference the fewest tables first.
Using the example query:

SELECT ..
FROM cust, order
WHERE cust.cust-num = order.cust-num AND order-num < 10.

The FROM clause will be rearranged so that the order table comes first.
After this is done, the AND fields in the WHERE clause are moved to be attached
to the highest place in the FROM clause possible, that is, immediately after the
last table referenced in the WHERE clause.
In the example, the branch for ORDER-NUM < 10 would be moved up to be attached
to the ORDER table. This is why we put tables referenced in AND fields that
reference fewer tables first.

A fundamental restriction in the algorithm's implementation in these code bases
is that it could handle up to about 28 ANDed fields in total. This meant that at
some point, the optimizer would "give up"and not optimize the query at all.
As a result, the join and all record selection would be process "manually", by
the client. To give you some indication of the magnitude of the problem of
processing a query manually, consider the following query:

SELECT customer.cust-num
FROM customer,order,order-line,item
WHERE order.cust-num = customer.cust-num
AND order-line.order-num = order.order-num
AND item.item-num = order-line.item-num
AND (order-line.qty = 1
AND order-line.price = 100
AND (customer.postal-code BEGINS "75"
OR (customer.postal-code >= "91" AND
customer.postal-code <= "95"
)
)
AND customer.contact = "Sabine" AND
(item.item-name <> "003" AND
item.item-name <> "009" AND
item.item-name <> "019" AND
item.item-name <> "060" AND
item.item-name <> "069" AND
item.item-name <> "007" AND
item.item-name <> "008" AND
item.item-name <> "068" AND
item.item-name <> "018" AND
item.item-name <> "018" AND
&nb.sp; item.item-name <> "058" AND
item.item-name <> "318" AND
item.item-name <> "317" AND
item.item-name <> "320"
)
).

This query core dumps in 7.2e and 7.3a and "hangs" in 7.3c01.
Indeed, in 7.3c01, it doesn't hang, it takes a long time.
The reason is that the optimizer has exceeded it's algorithm and gave up - it
resorts to a sequential read of each table in an attempt to resolve the query.

In the above example, this means that it reads an item record, then an order-line
record, then an order record, then a customer record, and then applies the WHERE
criteria to the result. If that doesn't succeed, it reads the second item
record, the first order-line record again, the first order record again and the
first customer record again, and so on. With the following number of records in
the sports database:

customer 83
order 207
order-line 873
item 55

this will take a grand total of 824,945,715 record reads!
Now, assuming 200 record reads per second (a reasonable assumption), this works
out to be 4,124,728 seconds, or 68,745 minutes, or 1145 hours, or 47.74 days.

Thus, it doesn't really hang, it just takes an awful long time!
It's IMPORTANT to point out that the same query above can be rewritten using
different SQL constructs (the IN-LIST clause) to reduce the number of entries in
the WHERE clause (under the limit mentioned above) and thereby work with the
old optimizer:

SELECT customer.cust-num
FROM customer, order, order-line, item
WHERE order.cust-num = customer.cust-num
AND order-line.order-num = order.order-num
AND item.item-num = order-line.item-num
AND (order-line.qty = 1
AND order-line.price = 100
AND (customer.postal-code BEGINS "75"
OR (customer.postal-code >= "91" AND
customer.postal-code <= "95"
)
)
AND customer.contact = "Sabine" AND
item.item-name NOT IN ( "003", "009","019", "060",
"069", "007", "008", "068", "018", "018",
"058", "318", "317", "320")
).

Same query, near instantaneous results...

II. Versions 8, 9

As a result of problems such as the above, the SQL optimizer algorithm was
completely rewritten for version 8 -it now behaves as follows:

First, if it's a join, it tries to re-order the FROM list.
This is done by going through all the ANDed expressions in the WHERE clause and
picking ones with the form: <from-column operator "external">.
Something is "exter.nal" if it refers to elements above or outside the current
tables in the join. E.G. customer.cust-num = 5. (The 5 is external; constant is
another way to consider it).
If there are two or more such ANDed items, we pick the best -- the equals
operator("=") is best.
If there are two ANDed items with the equals operator, the one with unique index
is best. Then range operators, such as greater than ("<"), or less than
(">"), IN etc. are chosen.
Once the best ANDed item is chosen, its table is moved to the "top". At that
point, all references to that table in the remaining AND-items are marked
"external" and we iterate the same logic on the remaining tables.

Here is an example for the re-ordering of the FROM clause:

SELECT .. FROM item,customer,order,order-line WHERE
order.order-num = order-line.order-num AND
item.item-num = order-line.item-num AND
customer.cust-num = order.cust-num AND
customer.cust-num = 5.

this will become:

SELECT .. FROM customer,order, order-line, item ...

because "customer.cust-num = 5" is marked as external (the field is marked as a
global reference) for the customer file, so customer becomes first.

then "customer.cust-num = order.cust-num" gets changed to "external" =
order.cust-num so order comes second...

and so on.

Next, all ANDed expressions in the WHERE clause are examined, and associated each
one with the highest position possible among the tables in the join. For example:

SELECT customer.cust-num, order.order-num
FROM customer,order
WHERE customer.cust-num = order.cust-num AND
customer.cust-num > 10.

The above gets translated to the QUERY:

OPEN QUERY _SQL-qry1 FOR EACH customer
WHERE customer.cust-num > 10,
EACH order WHERE customer.cust-num =
order.cust-num.

Then the PROGRESS optimization takes over -- each WHERE clause is inspected for
index references and the best index is chosen according to documented rules (e.g..
equality of unique index fields is best, then equality of non-unique index
fields, then range specifications, then sort order, etc.).
The QUERY is assumed to be distributed in PROGRESS, so if there is no PASS-THRU,
on execution, the first record from the first level is retrieved (with the server
having done bracketing and selection using the first level's WHERE expression).
Then the join fields are extracted by the client, and at that point the first
record from the second level is retrieved (with a potentially different server
having done bracketing and selection using the second level's WHERE expression).
Only when all second level+records have been retrieved, the second record from
the first level is retrieved, and so on.

III. Version 7.3c02 through 7.3d

Some of the V8 optimizer was retrofitted into this code base. The issue is that
the work which was occurring in V8 was not complete, so the optimizer which was
retrofitted was only partially complete.

The positive effect is that there's no real restriction on the number of AND
conditions (the large query in section I does execute to completion), the
downside is that the overall performance. will not be the same as for queries in

7.3a - that is, customers may experience anoverall performance degradation
between 7.3a and 7.3c02 (through 7.3d).
In this instance where we have a "partial optimizer", people who are badly
affected may re-order the FROM clause themselves. If the SQL statement writer
re-orders according to the algorithms described under version 8 above, they will
gain the benefits of the V8 optimizer.

IV. Other Performance Problems

Another performance slowdown was detected by the Report Builder in the SQL
implementation, this involved the processing of fields with extents, so-called
vector references.
For example, given a vector reference such as month-sales[5], the compiler will
"explore" all syntax permutations of an <expression> in order to determine
that "5" is a simple digit literal. This led to excessive compilation times
(compare to the above sections which were concerned with execution times). This
has been fixed.
This fix is only for vector references and will only show marked improvement if
there are a large number of such references in the same query. If there's only
one or a few vector references, you'll probably not notice the difference.

.