The PostgreSQL Query Cost Model

Published on

Slow database queries harm your organization in many ways. They can damage the reputation of otherwise great applications, make background processing painfully slow, and drastically increase the cost of your infrastructure. As a seasoned web developer, it is absolutely essential to learn about optimization strategies for your data layer.

The PostgreSQL Query Cost Model

In this article we will explore the cost model of PostgreSQL, how to understand the output of the explain command, and most importantly, how to use the data to improve the throughput of your applications.

Making Use of the Explain PostgreSQL Command

Before deploying a new query in your application, it is a good practice to run it through the explain command in PostgreSQL to get an estimate of the performance impact that the new query will have on your system.

We’ll start with an example database table to illustrate the usage of explain. This table will have a million records.

db # CREATE TABLE users (id serial, name varchar);

db # INSERT INTO users (name) SELECT 'John'
     FROM generate_series(1, 1000000);

db # SELECT count(*) FROM users;
  count
---------
 1000000
(1 row)

db # SELECT id, name FROM users LIMIT 10;
 id | name
----+------
  1 | John
  2 | John
  3 | John
  4 | John
  5 | John
  6 | John
  7 | John
  8 | John
  9 | John
 10 | John
(10 rows)

Suppose we need to look up a user with a given id, but before we deploy the new code we want to estimate the cost of that operation. Let’s run an explain cause with our desired query:

db # EXPLAIN SELECT * FROM users WHERE id = 870123;

                               QUERY PLAN
------------------------------------------------------------------------
 Gather  (cost=1000.00..11614.43 rows=1 width=9)
   Workers Planned: 2
   ->  Parallel Seq Scan on users  (cost=0.00..10614.33 rows=1 width=9)
         Filter: (id = 870123)

(4 rows)

There is a lot of output in the above example, but we can get the gist of it. To run this query, PostgreSQL plans to fire up two parallel workers. Each worker will run sequential scan on the table, and finally, the gatherer will merge the results from the two workers.

In this article, we will focus on the cost in the above output and how PostgreSQL calculates it.

To simplify our cost exploration, let’s run the above query, but limit the number of parallel workers to 0.

db # SET max_parallel_workers_per_gather = 0;

db # EXPLAIN SELECT * FROM users WHERE id = 870123;

                       QUERY PLAN
---------------------------------------------------------
 Seq Scan on users  (cost=0.00..17906.00 rows=1 width=9)
   Filter: (id = 870123)

(2 rows)

This is a bit simpler. With only one CPU core, the estimated cost is 17906.

The Math Behind The Cost Value

The cost, or penalty points, is mostly an abstract concept in PostgreSQL. There are many ways in which PostgreSQL can execute a query, and PostgreSQL always chooses the execution plan with the lowest possible cost value.

The calculate the cost, PostgreSQL first looks at the size of your table in bytes. Let’s find out the size of the users table.

db # select pg_relation_size('users');

 pg_relation_size
------------------
         44285952
(1 row)

PostgreSQL will add cost points to for each block it has to read sequentially. If we know that each block consists of 8kb we can calculate the cost value for the sequential block read from our table.

block_size = 8192 # block size in bytes
relation_size = 44285952

blocks = relation_size / block_size # => 5406

Now, that we know the number of block, let’s find out how many points will PostgreSQL allocate for each block read.

db # SHOW seq_page_cost;

 seq_page_cost
---------------
 1

(1 row)

In other words, PostgreSQL allocates one cost point for each block. That gives 5406 cost points to read the values from the table.

Reading values from the disk is not everything that PostgreSQL needs to do. It has to send those values to the CPU and to apply a WHERE filter. Two values are interesting for this calculation.

db # SHOW cpu_tuple_cost;

 cpu_tuple_cost
----------------
  0.01

db # SHOW cpu_operator_cost;

 cpu_operator_cost
-------------------
  0.0025

Now, we have all the values to calculate the value that we got in our explain clause.

number_of_records = 1000000

block_size    = 8192     # block size in bytes
relation_size = 44285952

blocks = relation_size / block_size # => 5406

seq_page_cost   = 1
cpu_tuple_cost  = 0.01
cpu_filter_cost = 0.0025;

cost = blocks * seq_page_cost +
       number_of_records * cpu_tuple_cost +
       number_of_records * cpu_filter_cost

cost # => 17546

Indexes and The PostgreSQL Cost Model

Indexing is and will most probably remain the most important topic in a life of a database engineer. Does adding an index reduces the cost of our select statements? Let’s find out.

First, we’ll add an index to the users table:

db # CREATE INDEX idx_users_id ON users (id);

Let’s observe the query plan with the new index in place.

db # EXPLAIN SELECT * FROM users WHERE id = 870123;

                                QUERY PLAN
--------------------------------------------------------------------------
 Index Scan using idx_users_id on users  (cost=0.42..8.44 rows=1 width=9)
   Index Cond: (id = 870123)
(2 rows)

The cost function dropped significantly. The calculation for the index scan is a bit more involved than the calculation for sequential scans. It consists of two phases.

PostgreSQL will consider the random_page_cost and cpu_index_tuple_cost variables, and return a value based on the height of the index tree.

db # SHOW random_page_cost;

 random_page_cost
------------------
  4

db # SHOW cpu_index_tuple_cost;
 cpu_index_tuple_cost
----------------------
  0.005

For the actual calculation, please consider reading through the source code of the cost index calculator.

The Cost Of Workers

PostgreSQL can start parallel workers to execute the query. However, starting a new worker has a performance penalty.

To calculate the cost of utilizing parallel queries, PostgreSQL uses the parallel_tuple_cost that defines the cost of transferring tuples from one worker to another, and parallel_setup_cost that signifies the cost of starting up a new worker.

db # SHOW parallel_tuple_cost;

 parallel_tuple_cost
---------------------
  0.1

db # SHOW parallel_setup_cost;

 parallel_setup_cost
---------------------
 1000

Did you like this article? Or, do you maybe have a helpful hint to share? Please leave it in the comment section bellow.