Consultor Eletrônico

SQL Optimization for Performance in v7 and v8, Index

Progress/SQL Optimization
A Historical Perspective

The purpose of this knowledgebase entry is to provide some
understanding of the SQL optimizer as a function of
releases, especially in V7 and V8. The SQL optimizer is
used by the Progress 4GL as well as by Report Builder.
There are essentially three different SQL optimizers in
existence in the V7 and V8 releases - each characterized by
different limitations and performance characteristics. Each
are described briefly below.

I. Pre-V7, 7.3a through 7.3c01

In this code base, optimization worked as follows.

Each AND field in the WHERE clause is scanned to identify
which tables it references. Then the tables are reordered,
placing those that are referenced in AND fields that
reference the fewest tables first.
Using the example query:

SELECT ..
FROM cust, order
WHERE cust.cust-num = order.cust-num AND order-num < 10.

The FROM clause will be rearranged so that the order table
comes first. After this is done, the AND fields in the
WHERE clause are moved to be attached to the highest place
in the FROM clause possible, that is, immediately after the
last table referenced in the WHERE clause. In the example,
the branch for ORDER-NUM < 10 would be moved up to be
attached to the ORDER table. This is why we put tables
referenced in AND fields that reference fewer tables first.
Note:

A fundamental restriction in the algorithm's implementation
in these code bases is that it could handle up to about 28
ANDed fields in total. This meant that at some point, the
optimizer would "give up"and not optimize the query at all.
As a result, the join and all record selection would be
process "manually", by the client. To give you some indication of
the magnitude of the problem of processing a query manually,
consider the following query:

SELECT customer.cust-num
FROM customer,order,order-line,item
WHERE order.cust-num = customer.cust-num
AND order-line.order-num = order.order-num
AND item.item-num = order-line.item-num
AND (order-line.qty = 1
AND order-line.price = 100
AND (customer.postal-code BEGINS "75"
OR (customer.postal-code >= "91" AND
customer.postal-code <= "95"
)
)
AND customer.contact = "Sabine" AND
(item.item-name <> "003" AND
item.item-name <> "009" AND
item.item-name <> "019" AND
item.item-name <> "060" AND
item.item-name <> "069" AND
item.item-name <> "007" AND
item.item-name <> "008" AND
item.item-name <> "068" AND
item.item-name <> "018" AND
item.item-name <> "018" AND
item.item-name <> "058" AND
item.item-name <> "318" AND
item.item-name <> "317" AND
item.item-name <> "320"
)
).

This query core dumps in 7.2e and 7.3a and "hangs" in
7.3c01.
Indeed, in 7.3c01, it doesn't hang, it takes a long time.
The reason is that the optimizer has exceeded it's algorithm
and gave up - it resorts to a sequential read of each table
in an attempt to resolve the query.

In the above example, this means that it reads an item
record, then an order-line record, then an order record,
then a customer record, and then applies the WHERE criteria
to the result. If that doesn't succeed, it reads the second
item record, the first order-line record again, the first
order record again and the first customer record again, and
so on. With the following number of records in the
sports database:

customer 83
order 207
order-line 873
item 55

this will take a grand totoal of 824,945,715 record reads!
Now, assuming 200 record reads per second (a reasonable
assumption), this works out to be 4,124,728 seconds, or
68,745 minutes, or 1145 hours, or 47.74 days.

Thus, it doesn't really hang, it just takes an awful long
time!
It's IMPORTANT to point out that the same query above can
be rewritten using different SQL constructs (the IN-LIST
clause) to reduce the number of entries in the WHERE clause
(under the limit mentioned above) and thereby work with the
old optimizer:

SELECT customer.cust-num
FROM customer, order, order-line, item
WHERE order.cust-num = customer.cust-num
AND order-line.order-num = order.order-num
AND item.item-num = order-line.item-num
AND (order-line.qty = 1
AND order-line.price = 100
AND (customer.postal-code BEGINS "75"
OR (customer.postal-code >= "91" AND
customer.postal-code <= "95"
)
)
AND customer.contact = "Sabine" AND
item.item-name NOT IN ( "003", "009","019", "060",
"069", "007", "008", "068", "018", "018",
"058", "318", "317", "320")
).

Same query, near instantaneous results...

II. Version 8

As a result of problems such as the above,
the SQL optimizer algorithm was completely rewritten for
version 8 -it now behaves as follows:

First, if it's a join, it tries to re-order the FROM list.
This is done by going through all the ANDed expressions in
the WHERE clause and picking ones with the form:
<from-column operator "external">. Something is "external"
if it refers to elements above or outside the current tables
in the join. E.G. customer.cust-num = 5. (The 5
is external; constant is another way to consider it). If
there are two or more such ANDed items, we pick the best --
the equals operator("=") is best. If there are two ANDed
items with the equals operator, the one with unique index
is best. Then range operators, such as greater than ("<"),
or less than (">"), IN etc. are chosen.
Once the best ANDed item is chosen, its table is moved to
the "top". At that point, all references to that table in
the remaining AND-items are marked "external" and we iterate
the same logic on the remaining tables.

Here is an example for the re-ordering of the FROM clause:

SELECT .. FROM item,customer,order,order-line WHERE
order.order-num = order-line.order-num AND
item.item-num = order-line.item-num AND
customer.cust-num = order.cust-num AND
customer.cust-num = 5.

this will become:

SELECT .. FROM customer,order, order-line, item ...

because "customer.cust-num = 5" is marked as external (the
field is marked as a global reference) for the customer
file, so customer becomes first.

then "customer.cust-num = order.cust-num" gets changed to
"external" = order.cust-num so order comes second...

and so on.

Next, all ANDed expressions in the WHERE clause are
examined, and associated each one with the highest position
possible among the tables in the join. For example:

SELECT customer.cust-num, order.order-num
FROM customer,order
WHERE customer.cust-num = order.cust-num AND
customer.cust-num > 10.

The above gets translated to the QUERY:

OPEN QUERY _SQL-qry1 FOR EACH customer
WHERE customer.cust-num > 10,
EACH order WHERE customer.cust-num =
order.cust-num.

Then the PROGRESS optimization takes over -- each WHERE
clause is inspected for index references and the best index
is chosen according to documented rules (eg. equality of
unique index fields is best, then equality of non-unique
index fields, then range specifications, then sort order,
etc.). The QUERY is assumed to be distributed in PROGRESS,
so if there is no PASS-THRU, on execution, the first record
from the first level is retrieved (with the server having
done bracketing and selection using the first level's WHERE
expression). Then the join fields are extracted by the
client, and at that point the first record from the second
level is retrieved (with a potentially different server
having done bracketing and selection using the second
level's WHERE expression). Only when all second level+
records have been retrieved, the second record from the
first level is retrieved, and so on.

III. Version 7.3c02 through 7.3d

Some of the V8 optimizer was retrofitted into this
code base. The issue is that the work which was occurring
in V8 was not complete, so the optimizer which was
retrofitted was only partially complete.

The positive effect is that there's no real restriction on
the number of AND conditions (the large query in section I
does execute to completion), the downside is that the
overall performance will not be the same as for queries in
7.3a - that is, customers may experience anoverall
performance degradation between 7.3a and 7.3c02
(through 7.3d).
In this instance where we have a "partial optimizer",
people who are badly affected may re-order the FROM clause
themselves. If the SQL statement writer re-orders according
to the algorithms described under version 8 above, they will
gain the benefits of the V8 optimizer.

IV. Other Performance Problems

Another performance slowdown was detected by the Report
Builder in the SQL implementation, this involved the
processing of fields with extents, so-called vector
references.
For example, given a vector reference such as
month-sales[5], the compiler will "explore" all syntax
permutations of an <expression> in order to determine that
"5" is a simple digit literal. This led to excessive
compilation times (compare to the above sections which
were concerned with execution times). This has been fixed.
This fix is only for vector references and will only show
marked improvement if there are a large number of such
references in the same query. If there's only one or a few
vector references, you'll probably not notice the difference.

Progress Software Technical Support Note # 15624