SQL is a fairly complicated language with a steep learning curve. For a large number of people who make use of SQL, learning to apply it efficiently takes lots of trials and errors. Here are some tips on how you can make your SELECT queries better. The majority of tips should be applicable to any relational database management system, but the terminology and exact namings will be taken from PostgreSQL.
A considerable amount of time can be spent optimizing SQL query, but it may not be needed: for example, if the predicted amount of data is not that high. So, before diving into the optimization routine, please make sure that the queried data will really come in a large amount. Also, make sure that the business domain even allows a large amount of data to be created. For example, in your “Online Shop“ data model there may be a “Delivery“ entity with a foreign key to an “Order“ entity. Despite having a foreign key, it is highly unlikely that there will be more than two or three deliveries for one order.
Also, don’t forget about the software engineering rule of thumb - “profile before you optimize“. PostgreSQL comes with a native tool for profiling - EXPLAIN statement. It can be used both in the basic psql
shell and in more advanced environments.
When adding EXPLAIN statement to the query, by default it shows how RDBMS will interpret and execute the query, outputting step-by-step actions and their cost in abstract units. By monitoring cost increase, it is possible to detect a bottleneck in the query.
It also has several useful parameters: all of them are useful and can be seen in the doc, but the most notorious are:
ANALYZE - this parameter not only builds query execution plan, but also runs the actual query. This gives us several benefits, such as being able to see the exact timing.
VERBOSE - this parameter gives some extra information about each step of execution plan. This is very useful if you are working with a large query and having a hard time mapping the execution plan step and specific part of query. VERBOSE parameter will help to figure this out.
I have highlighted only these two parameters, but you should also check the other ones. They may be extremely helpful for your use-cases!
Let us consider a simple example to overview basic principles of SQL profiling. We will be using this schema:
And will generate data as follows:
So, currently we are having one million records in our table, which will allow us to show some optimization use-cases.
First, let us see the sample output of profiling commands. The simplest query with EXPLAIN will look like this:
And with EXPLAIN ANALYZE:
And now let us overview some real case of profiling. Suppose we have to order the records by the updated_at
field and we know that there will be lots of records.
Please note the “Index Scan“ in the query plan. This means that RDBMS engine understood that the query can be executed faster if using index. But suppose we have records with NULL values at updated_at
column, and we want them at the very end of the selection:
Here you can see that the execution time increased more than in 300 times, and no index has been scanned. This may not seem a problem since the data will still be available in about a second, but suppose we have larger table with several JOINs and more records - this can get out of control really fast. The simplest solution is to add index that handles this case:
And now the query execution plan:
This is much better!
Indices are commonly used for optimizing JOIN, WHERE and ORDER BY clauses. Some ORMs, such as Django ORM, even automatically adds indices to all the foreign keys. Properly planned and well-placed indices dramatically increase the performance. However, if you put many indices on the table, it will make modify behavior slower - it will have to update all indices when updating, inserting or deleting records. So, the best approach is to find out how the data will be used most often, and apply indices accordingly.
Optimizing SELECTs involving multiple tables
Usually, business data can not fit into a single table, so let us consider an example with second table with foreign key to test_table
.
Please note the index! Without it, any SELECTs involving JOIN between two tables will be extremely slow on large data samples. However, as mentioned above, the majority of ORMs will generate this index for you.
And now let us consider a simple use-case: we want a list of test_table
records, with a mean value of related_nested_table.amount
entries. The formula of mean value is: SUM(related_nested_table.amount * test_table.value) / SUM(test_table.value)
. The straightforward way to implement this is:
This query needs to include a LATERAL JOIN
, since we need to have an access to the test_table
field within join
. However, this means that the related_nested_table
is selected as many times as we have entires of test_table
. Therefore, for every entry of test_table
(and we have about a million of entries), we select the entire related_nested_table
, group it by test_table_id
(so, we have a million of such groups) and join only one of them. We can optimize it by adding a filter inside the LATERAL JOIN
subquery:
The estimated cost is 10,000 times lower now.
SQL is difficult to learn and even more difficult to master, but with these tips you are closer to designing efficient DB schema and writing efficient queries. Using EXPLAIN will be of much help for you, because it allows to adapt query as a set of basic operation that database engine performs. This helps to understand how the data is actually extracted and arranged, therefore allowing to query for it in a more efficient manner.