Monday, March 7, 2022

PostgreSQL estimate matching rows for values beyond known boundary (low_value and high_value)

Last time we have analyzed behavior in Oracle and found that Oracle treat values outside boundaries in less disruptive manner. (link)

What is the behavior for PostgreSQL optimizer to estimate matching rows for values beyond known boundary? 

Setup demo table

create table t (d date, v int);
create index t_n1 on t(d);

Populate the data (2022-03-01 to 2022-03-31, each day with 10,000 rows)

insert into t 
select '2022-03-01'::date + floor(generate_series/10000)*(interval '1 day') as d,
       generate_series as v
from generate_series(1,31*10000);

Analyze the statistics, and block autovacuum on this table

vacuum analyze t;
alter table t set (autovacuum_analyze_scale_factor=0,autovacuum_analyze_threshold=1000000000);

Populate the data (2022-04-01 to 2022-04-07, each day with 10,000 rows)

insert into t 
select '2022-04-01'::date + floor(generate_series/10000)*(interval '1 day') as d,
       generate_series as v
from generate_series(1,7*10000);
select relname, relkind, reltuples, relpages
from pg_class where relname in ('t','t_n1');
relnamerelkindreltuplesrelpages
tr3799281682
t_n1i310000250
select attname,n_distinct,
       array_to_string(most_common_vals, E'\n') as most_common_vals, 
       array_to_string(most_common_freqs, E'\n') as most_common_freqs
from pg_stats where tablename='t';
attnamen_distinctmost_common_valsmost_common_freqs
d312022-03-200.0344
2022-03-240.033833332
2022-03-190.0332
2022-03-290.0332
2022-03-230.033166666
2022-03-050.032833334
2022-03-170.03266667
2022-03-210.0326
2022-03-120.032566667
2022-03-060.032433335
2022-03-130.0324
2022-03-280.0324
2022-03-110.032233335
2022-03-150.032166667
2022-03-100.032133333
2022-03-250.032133333
2022-03-270.0321
2022-03-160.032066666
2022-03-180.03203333
2022-03-310.032
2022-03-140.031866666
2022-03-080.031833332
2022-03-260.031833332
2022-03-020.0318
2022-03-030.031733334
2022-03-040.031533334
2022-03-090.031433333
2022-03-220.031433333
2022-03-070.031366665
2022-03-010.0313
2022-03-300.0313
v-1
(2 rows)

Now test PostgreSQL optimizer estimation with different values for column "D"

Test 1: Within Boundary (D = 2022-03-15)

explain (analyze, verbose) 
select count(*) from t where d = '2022-03-15'::date;
mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-03-15'::date;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=284.84..284.85 rows=1 width=8) (actual time=1.259..1.260 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using t_n1 on public.t  (cost=0.42..254.29 rows=12221 width=0) (actual time=0.026..0.758 rows=10000 loops=1)
         Output: d
         Index Cond: (t.d = '2022-03-15'::date)
         Heap Fetches: 0
 Planning Time: 0.149 ms
 Execution Time: 1.317 ms
(8 rows)

Math: 379928 * 0.032166667 = 12221.017459976 ~ 12221 rows.

Test 2: At the Boundary (D = 2022-03-31)

explain (analyze, verbose) 
select count(*) from t where d = '2022-03-31'::date;
mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-03-31'::date;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=283.58..283.59 rows=1 width=8) (actual time=1.180..1.181 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using t_n1 on public.t  (cost=0.42..253.19 rows=12158 width=0) (actual time=0.023..0.687 rows=10000 loops=1)
         Output: d
         Index Cond: (t.d = '2022-03-31'::date)
         Heap Fetches: 0
 Planning Time: 0.065 ms
 Execution Time: 1.241 ms
(8 rows)

Math: 379928 * 0.032 = 12157.696 ~ 12158 rows.

Test 3: Outside the Boundary (D = 2022-04-01)

explain (analyze, verbose) 
select count(*) from t where d = '2022-04-01'::date;
mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-04-01'::date;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.44..4.45 rows=1 width=8) (actual time=1.194..1.194 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using t_n1 on public.t  (cost=0.42..4.44 rows=1 width=0) (actual time=0.024..0.731 rows=10000 loops=1)
         Output: d
         Index Cond: (t.d = '2022-04-01'::date)
         Heap Fetches: 0
 Planning Time: 0.061 ms
 Execution Time: 1.274 ms
(8 rows)

Test 4: Outside the Boundary further (D = 2022-04-07)

explain (analyze, verbose) 
select count(*) from t where d = '2022-04-07'::date;
mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-04-07'::date;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.44..4.45 rows=1 width=8) (actual time=2.368..2.369 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using t_n1 on public.t  (cost=0.42..4.44 rows=1 width=0) (actual time=0.024..1.537 rows=10000 loops=1)
         Output: d
         Index Cond: (t.d = '2022-04-07'::date)
         Heap Fetches: 0
 Planning Time: 0.081 ms
 Execution Time: 2.427 ms
(8 rows)

Test 5: Outside the Boundary even further (D = 2022-04-30)

explain (analyze, verbose) 
select count(*) from t where d = '2022-04-30'::date;
mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-04-30'::date;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.44..4.45 rows=1 width=8) (actual time=0.018..0.019 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using t_n1 on public.t  (cost=0.42..4.44 rows=1 width=0) (actual time=0.015..0.015 rows=0 loops=1)
         Output: d
         Index Cond: (t.d = '2022-04-30'::date)
         Heap Fetches: 0
 Planning Time: 0.061 ms
 Execution Time: 0.074 ms
(8 rows)

Observation:

The amount of estimated matching rows becomes 1 for values outside the low_value and high_value boundary.

TestValueEstimate RowsActual Rows
Test 12022-03-151222110000
Test 22022-03-311215810000
Test 32022-04-01110000
Test 42022-04-07110000
Test 52022-04-3010

No comments:

Post a Comment