Back
Apr 19, 2022

Improve efficiency of your SELECT queries

SQL is a fairly complicated language with a steep learning curve. For a large number of people who make use of SQL, learning to apply it efficiently takes lots of trials and errors. Here are some tips on how you can make your SELECT queries better. The majority of tips should be applicable to any relational database management system, but the terminology and exact namings will be taken from PostgreSQL.

Before you optimize

A considerable amount of time can be spent optimizing SQL query, but it may not be needed: for example, if the predicted amount of data is not that high. So, before diving into the optimization routine, please make sure that the queried data will really come in a large amount. Also, make sure that the business domain even allows a large amount of data to be created. For example, in your “Online Shop“ data model there may be a “Delivery“ entity with a foreign key to an “Order“ entity. Despite having a foreign key, it is highly unlikely that there will be more than two or three deliveries for one order.

Also, don’t forget about the software engineering rule of thumb - “profile before you optimize“. PostgreSQL comes with a native tool for profiling - EXPLAIN statement. It can be used both in the basic psql shell and in more advanced environments.

When adding EXPLAIN statement to the query, by default it shows how RDBMS will interpret and execute the query, outputting step-by-step actions and their cost in abstract units. By monitoring cost increase, it is possible to detect a bottleneck in the query.

It also has several useful parameters: all of them are useful and can be seen in the doc, but the most notorious are:

ANALYZE - this parameter not only builds query execution plan, but also runs the actual query. This gives us several benefits, such as being able to see the exact timing.

VERBOSE - this parameter gives some extra information about each step of execution plan. This is very useful if you are working with a large query and having a hard time mapping the execution plan step and specific part of query. VERBOSE parameter will help to figure this out.

I have highlighted only these two parameters, but you should also check the other ones. They may be extremely helpful for your use-cases!

Profiling basics

Let us consider a simple example to overview basic principles of SQL profiling. We will be using this schema:

CREATE TABLE test_table (
 id          serial PRIMARY KEY,
 name        text UNIQUE NOT NULL,
 value       numeric(10, 3) NOT NULL,
 updated_at  timestamp DEFAULT current_timestamp
);
CREATE INDEX updated_at__test_table__idx ON test_table (updated_at);

And will generate data as follows:

INSERT INTO test_table (
   name, value
)
SELECT
   md5(random()::text),
   random()
FROM generate_series(1, 1000000) s(i);

So, currently we are having one million records in our table, which will allow us to show some optimization use-cases.

Indices and ordering

First, let us see the sample output of profiling commands. The simplest query with EXPLAIN will look like this:

EXPLAIN SELECT * FROM test_table;
                            QUERY PLAN                              
---------------------------------------------------------------------
Seq Scan on test_table  (cost=0.00..20310.00 rows=1000000 width=49)
(1 row)                        ^     ^
                      best-case     worst-case

And with EXPLAIN ANALYZE:

EXPLAIN ANALYZE SELECT * FROM test_table;
                                                     QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on test_table  (cost=0.00..20310.00 rows=1000000 width=49) (actual time=0.055..298.395 rows=1000000 loops=1)
Planning Time: 0.076 ms
Execution Time: 360.520 ms
(3 rows)

And now let us overview some real case of profiling. Suppose we have to order the records by the updated_at field and we know that there will be lots of records.

EXPLAIN ANALYZE SELECT * FROM test_table ORDER BY updated_at DESC;
                                                                            QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan Backward using updated_at__test_table__idx on test_table  (cost=0.42..35633.43 rows=1000000 width=49) (actual time=0.015..274.276 rows=1000000 loops=1)
Planning Time: 0.131 ms
Execution Time: 337.893 ms

Please note the “Index Scan“ in the query plan. This means that RDBMS engine understood that the query can be executed faster if using index. But suppose we have records with NULL values at updated_at column, and we want them at the very end of the selection:

EXPLAIN ANALYZE SELECT * FROM test_table ORDER BY updated_at DESC NULLS LAST;
                                                              QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
Gather Merge  (cost=68614.50..165843.59 rows=833334 width=49) (actual time=350.385..1024.717 rows=1000000 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Sort  (cost=67614.48..68656.15 rows=416667 width=49) (actual time=338.529..441.954 rows=333333 loops=3)
        Sort Key: updated_at DESC NULLS LAST
        Sort Method: external merge  Disk: 21120kB
        Worker 0:  Sort Method: external merge  Disk: 21904kB
        Worker 1:  Sort Method: external merge  Disk: 21808kB
        ->  Parallel Seq Scan on test_table  (cost=0.00..14476.67 rows=416667 width=49) (actual time=0.022..90.912 rows=333333 loops=3)
Planning Time: 0.137 ms
Execution Time: 1099.671 ms
(11 rows)

Here you can see that the execution time increased more than in 300 times, and no index has been scanned. This may not seem a problem since the data will still be available in about a second, but suppose we have larger table with several JOINs and more records - this can get out of control really fast. The simplest solution is to add index that handles this case:

CREATE INDEX updated_at_nulls_last__test_table__idx ON test_table (updated_at DESC NULLS LAST);

And now the query execution plan:

EXPLAIN ANALYZE SELECT * FROM test_table ORDER BY updated_at DESC NULLS LAST;
                                                                             QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using updated_at_nulls_last__test_table__idx on test_table  (cost=0.42..36309.43 rows=1000000 width=49) (actual time=0.115..315.642 rows=1000000 loops=1)
Planning Time: 0.510 ms
Execution Time: 378.828 ms
(3 rows)

This is much better!

Indices are commonly used for optimizing JOIN, WHERE and ORDER BY clauses. Some ORMs, such as Django ORM, even automatically adds indices to all the foreign keys. Properly planned and well-placed indices dramatically increase the performance. However, if you put many indices on the table, it will make modify behavior slower - it will have to update all indices when updating, inserting or deleting records. So, the best approach is to find out how the data will be used most often, and apply indices accordingly.

Optimizing SELECTs involving multiple tables

Usually, business data can not fit into a single table, so let us consider an example with second table with foreign key to test_table.

CREATE TABLE related_nested_table (
 id serial PRIMARY KEY,
 test_table_id integer references test_table(id),
 amount numeric(10, 3) NOT NULL,
 created_at timestamp DEFAULT current_timestamp
);
CREATE INDEX test_table_id__related_nested_table__idx ON related_nested_table (test_table_id);
INSERT INTO related_nested_table (
 test_table_id, amount
)
SELECT
 id,
 random()
FROM test_table
INNER JOIN generate_series(1, 100) s(i)
 ON True
--feel free to remove this WHERE block, it is written to generate less data
WHERE test_table.id < 1000;

Please note the index! Without it, any SELECTs involving JOIN between two tables will be extremely slow on large data samples. However, as mentioned above, the majority of ORMs will generate this index for you.

And now let us consider a simple use-case: we want a list of test_table records, with a mean value of related_nested_table.amount entries. The formula of mean value is: SUM(related_nested_table.amount * test_table.value) / SUM(test_table.value). The straightforward way to implement this is:

EXPLAIN SELECT
 test_table.*,
 related_nested_table_sum.amount_sum
FROM test_table
LEFT OUTER JOIN LATERAL (
 SELECT
   related_nested_table.test_table_id,
   SUM(related_nested_table.amount * test_table.value) / SUM(related_nested_table.amount) AS amount_sum
 FROM related_nested_table
 GROUP BY related_nested_table.test_table_id
) related_nested_table_sum
 ON test_table.id = related_nested_table_sum.test_table_id;
 
                                             QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
Nested Loop Left Join  (cost=2162422.93..2289465367100.63 rows=1000000 width=81)
  ->  Seq Scan on test_table  (cost=0.00..20310.00 rows=1000000 width=49)
  ->  Subquery Scan on related_nested_table_sum  (cost=2162422.93..2289460.61 rows=486 width=36)
        Filter: (test_table.id = related_nested_table_sum.test_table_id)
        ->  GroupAggregate  (cost=2162422.93..2288246.53 rows=97127 width=36)
              Group Key: related_nested_table.test_table_id
              ->  Sort  (cost=2162422.93..2187247.70 rows=9929910 width=8)
                    Sort Key: related_nested_table.test_table_id
                    ->  Seq Scan on related_nested_table  (cost=0.00..736878.10 rows=9929910 width=8)
(9 rows)

This query needs to include a LATERAL JOIN, since we need to have an access to the test_table field within join. However, this means that the related_nested_table is selected as many times as we have entires of test_table. Therefore, for every entry of test_table (and we have about a million of entries), we select the entire related_nested_table, group it by test_table_id (so, we have a million of such groups) and join only one of them. We can optimize it by adding a filter inside the LATERAL JOIN subquery:

EXPLAIN SELECT
 test_table.*,
 related_nested_table_sum.amount_sum
FROM test_table
LEFT OUTER JOIN LATERAL (
 SELECT
   related_nested_table.test_table_id,
   SUM(related_nested_table.amount * test_table.value) / SUM(related_nested_table.amount) AS amount_sum
 FROM related_nested_table
 WHERE related_nested_table.test_table_id = test_table.id
 GROUP BY related_nested_table.test_table_id
) related_nested_table_sum
 ON test_table.id = related_nested_table_sum.test_table_id;
 
                                             QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join  (cost=0.43..399185767.60 rows=1000000 width=81)
  ->  Seq Scan on test_table  (cost=0.00..20310.00 rows=1000000 width=49)
  ->  Subquery Scan on related_nested_table_sum  (cost=0.43..399.16 rows=1 width=36)
        Filter: (test_table.id = related_nested_table_sum.test_table_id)
        ->  GroupAggregate  (cost=0.43..397.88 rows=102 width=36)
              Group Key: related_nested_table.test_table_id
              ->  Index Scan using test_table_id__related_nested_table__idx on related_nested_table  (cost=0.43..395.08 rows=102 width=8)
                    Index Cond: (test_table_id = test_table.id)
(8 rows)

The estimated cost is 10,000 times lower now.

Conclusion

SQL is difficult to learn and even more difficult to master, but with these tips you are closer to designing efficient DB schema and writing efficient queries. Using EXPLAIN will be of much help for you, because it allows to adapt query as a set of basic operation that database engine performs. This helps to understand how the data is actually extracted and arranged, therefore allowing to query for it in a more efficient manner.

Subscribe for the news and updates

More thoughts
Aug 27, 2020Technology
5 tips for designing database architecture

Designing database architecture is a challenging task, and it gets even more difficult when your app keeps getting bigger. Here are several tips on how to manage your data structure in a more efficient way.

Jun 27, 2018Technology
How to Work With Legacy Code: Code Refactoring Techniques

In this article we'll review general approach to working with the best kind of projects - the ones with old untested and undocumented spaghetti code and a tight schedule. We'll review anger management techniques, coping mechanisms and some refactoring tips that might come in handy.

Aug 25, 2017Technology
How to Upload Files With Django

File upload works differently from simple form inputs, which can be somewhat troublesome for beginners. Here I'll show you how to handle uploads with ease.

Mar 4, 2011Technology
Css sprite generation

I've created this small sprite to create css sprites. It glues images from directory directory into single file and generates corresponding css.

Jul 1, 2010Technology
Overriding QuerySet in Django

As you know, model managers can be overriden in Django. It's convenient to add custom filtration method.

Mar 6, 2010TechnologyManagement
Supplementing settings in settings_local

For local project settings, I use old trick with settings_local file:try:from settings_local import \*except ImportError:passSo in settings_local.py we can override variables from settings.py. I didn't know how to supplement them. For example how to add line to INSTALLED_APPS without copying whole list.Yesterday I finally understood that I can import settings from settings_local:# settings_local.pyfrom settings import \*INSTALLED_APPS += (# ...)