ORDER BY limit 10比ORDER BY limit 100更慢

问题分析

pg数据库中执行sql时，ORDER BY limit 10比ORDER BY limit 100更慢

执行计划分析

sql 复制代码

 SELECT
			*,
        (select cl.ITEM_DESC from tablelzl2 cl where item_name='name' and cl.ITEM_NO='abcdefg') AS "item"
        FROM
        tablelzl1 RI
        WHERE  RI.column1='AAAA'
		AND  RI.column2 = 'applyno20231112'
        ORDER BY
        RI.column3 DESC    limit 10

复制代码

 Limit  (cost=0.43..1522.66 rows=10 width=990)
   ->  Index Scan Backward using idx_tablelzl1_column3 on tablelzl1 ri  (cost=0.43..158007.45 rows=1038 width=990)
         Filter: (((column1)::text = 'AAAA'::text) AND ((column2)::text = 'applyno20231112'::text))
         SubPlan 1
           ->  Index Scan using uk_tablelzl2_ii on tablelzl2 cl  (cost=0.27..5.29 rows=1 width=18)
                 Index Cond: (((item_no)::text = 'manualSign'::text) AND ((item_name)::text = (ri.manual_sign)::text))

主表没有走到column2索引，而是走column3排序字段索引的Index Scan Backward，scan index的cost非常高，而最终的cost比较低，实际执行需要9s

如果把limit 10改成limit 100，执行计划正常：

sql 复制代码

 SELECT
			*,
        (select cl.ITEM_DESC from tablelzl2 cl where cl.ITEM_NAME = RI.MANUAL_SIGN AND   cl.ITEM_NO='manualSign') AS "manualSign"
        FROM
        tablelzl1 RI
        WHERE  RI.column1='AAAA'
		AND  RI.column2 = 'applyno20231112'
        ORDER BY
        RI.column3 DESC limit 100

复制代码

                                                     QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
Limit  (cost=2632.28..3162.78 rows=100 width=990)
  ->  Result  (cost=2632.28..8138.87 rows=1038 width=990)
        ->  Sort  (cost=2632.28..2634.87 rows=1038 width=474)
              Sort Key: ri.column3 DESC
              ->  Index Scan using idx_cri_column2 on tablelzl1 ri  (cost=0.43..2592.61 rows=1038 width=474)
                    Index Cond: ((column2)::text = 'applyno20231112'::text)
                    Filter: ((column1)::text = 'AAAA'::text)
        SubPlan 1
          ->  Index Scan using uk_tablelzl2_ii on tablelzl2 cl  (cost=0.27..5.29 rows=1 width=18)
                Index Cond: (((item_no)::text = 'manualSign'::text) AND ((item_name)::text = (ri.manual_sign)::text))
(10 rows)

子查询执行计划不变，主表走到column2单列索引，回表后排序再limit，执行非常快。

不仅是limit，如果原sql仅更换column2的值，执行计划也正常。也就是说这个生产的sql只有极个别的column2的值时执行计划是异常的。

执行计划分析：

子查询前后没变可以不用分析，主要是索引选择上的不同。column2是过滤字段，column3是排序字段，两个执行计划分别选择了这2个字段的索引。

异常的limit 10执行计划：反向扫描排序字段索引->回表 ->limit。因为不需要额外排序，反向扫描索引时找到limit个数据就可以不用继续扫描了；扫描排序字段索引的预估代价非常高，最上层的limit最终代价预估很低。
正常的limit 100执行计划：访问过滤字段索引->回表 ->以排序字段排序 ->limit。因为要排序，需要把符合条件的所有索引条目全部找出来；本身访问过滤字段的索引代价预估低。

所以问题的关键在于部分反向扫描排序索引时，代价预估的过低

真实的执行情况

explain (analyze,buffers) 看下真实的执行情况

复制代码

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..1521.93 rows=10 width=990) (actual time=23.311..8122.516 rows=10 loops=1)
   Buffers: shared hit=861100 read=42985 dirtied=7
   I/O Timings: read=6741.003
   ->  Index Scan Backward using idx_tablelzl1_column3 on tablelzl1 ri  (cost=0.43..157932.45 rows=1038 width=990) (actual time=23.309..8122.505 rows=10 loops=1)
         Filter: (((column1)::text = 'AAAA'::text) AND ((column2)::text = 'applyno20231112'::text))
         Rows Removed by Filter: 1521796
         Buffers: shared hit=861100 read=42985 dirtied=7
         I/O Timings: read=6741.003
         SubPlan 1
           ->  Index Scan using uk_tablelzl2_ii on tablelzl2 cl  (cost=0.27..5.29 rows=1 width=18) (actual time=0.005..0.005 rows=0 loops=10)
                 Index Cond: (((item_no)::text = 'manualSign'::text) AND ((item_name)::text = (ri.manual_sign)::text))
                 Buffers: shared hit=6
 Planning:
   Buffers: shared hit=121 read=28
   I/O Timings: read=1.476
 Planning Time: 2.314 ms
 Execution Time: 8122.658 ms

 Limit  (cost=2632.28..3162.78 rows=100 width=990) (actual time=150.101..150.122 rows=14 loops=1)
   Buffers: shared hit=700 read=274
   I/O Timings: read=146.903
   ->  Result  (cost=2632.28..8138.87 rows=1038 width=990) (actual time=150.100..150.119 rows=14 loops=1)
         Buffers: shared hit=700 read=274
         I/O Timings: read=146.903
         ->  Sort  (cost=2632.28..2634.87 rows=1038 width=474) (actual time=150.072..150.073 rows=14 loops=1)
               Sort Key: ri.column3 DESC
               Sort Method: quicksort  Memory: 30kB
               Buffers: shared hit=694 read=274
               I/O Timings: read=146.903
               ->  Index Scan using idx_cri_column2 on tablelzl1 ri  (cost=0.43..2592.61 rows=1038 width=474) (actual time=0.418..149.973 rows=14 loops=1)
                     Index Cond: ((column2)::text = 'applyno20231112'::text)
                     Filter: ((column1)::text = 'AAAA'::text)
                     Rows Removed by Filter: 1218
                     Buffers: shared hit=691 read=274
                     I/O Timings: read=146.903
         SubPlan 1
           ->  Index Scan using uk_tablelzl2_ii on tablelzl2 cl  (cost=0.27..5.29 rows=1 width=18) (actual time=0.002..0.002 rows=0 loops=14)
                 Index Cond: (((item_no)::text = 'manualSign'::text) AND ((item_name)::text = (ri.manual_sign)::text))
                 Buffers: shared hit=6
 Planning Time: 0.334 ms
 Execution Time: 150.257 ms

limit 10的执行计划，执行8s，内存读shared hit=861100 磁盘读read=42985 ，丢弃了1521796行

limit 100的执行计划执行0.1s shared hit=694 read=274，丢弃了1218行

limit 10的执行计划明显是不正常，读了太多的数据才找到符合条件的行，这是sql执行过慢的原因

统计信息分析

本身预估的代价不高，但是实际上需要扫描非常多的索引行，首先想到是否是统计信息是否准确

表的统计信息：

复制代码

[postgres@cnsz381785:7169/(rasesql)phmamp][10-30.15:01:26]M=#  select relpages,reltuples::bigint from pg_class where relname='tablelzl1';
 relpages | reltuples 
----------+-----------
    91172 |   2280874 --count出来差不多

字段的统计信息：

复制代码

[phmampopr@cnsz381785:7169/(rasesql)phmamp][10-27.17:08:48]M=>  select * from pg_stats where tablename='tablelzl1' and attname='column2';
-[ RECORD 1 ]----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
schemaname             | public
tablename              | tablelzl1
attname                | column2
inherited              | f
null_frac              | 0
avg_width              | 18
n_distinct             | -0.11990886
most_common_vals       | {applyno20231112,DY20190723006650,DY20200102012899,DY20180827000557,DY20190524001304,DY20190529001885,DY20190728002359}
most_common_freqs      | {0.0005,0.00026666667,0.00023333334,0.0002,0.0002,0.0002,0.0002}
histogram_bounds       | {CULZF0000121605605,DSNEW0000126854232,DSNEW0000137652871,DY20160516001057,DY20161104005509,DY20170306002677,DY20170703010428,DY20170928013517,DY20180410007383,DY20180615002936,DY20180
correlation            | 0.3131596
most_common_elems      | [null]
most_common_elem_freqs | [null]
elem_count_histogram   | [null]

这个column2 applyno20231112刚好就是排第一的most_common_vals，出现预估概率是0.0005，用预估的行2280874*0.0005=1140，与实际的行数1232差不多

复制代码

[postgres@cnsz381785:7169/(rasesql)phmamp][10-30.15:05:28]M=# select count(*) from tablelzl1 where column2 = 'applyno20231112';
 count 
-------
  1232

说明统计信息是准确的，实际上运行analze收集统计信息也不会解决这个问题

数据分布不均的计算

用当前统计信息计算出来的符合条件的行有1140个，那么预计从排序字段的索引上找到第一条数据平均要扫描2280874/1140=2000个索引行。如果找10条便是20000个索引行，100条便是200000个索引行。

把sort禁用，让limit 100语句强行走排序字段的索引

sql 复制代码

M=# set enable_sort=off;
SET

复制代码

--limit 100的执行计划
 Limit  (cost=0.43..15222.69 rows=100 width=990)
   ->  Index Scan Backward using idx_tablelzl1_column3 on tablelzl1 ri  (cost=0.43..158007.45 rows=1038 width=990)
         Filter: (((column1)::text = 'AAAA'::text) AND ((column2)::text = 'applyno20231112'::text))
         SubPlan 1
           ->  Index Scan using uk_tablelzl2_ii on tablelzl2 cl  (cost=0.27..5.29 rows=1 width=18)
                 Index Cond: (((item_no)::text = 'manualSign'::text) AND ((item_name)::text = (ri.manual_sign)::text))

limit 10改成limit 100后的执行计划，代价从1522.66升到了15222.69，基本上只是简单的*10。limit 100的代价15222.69大于了走过滤字段索引的执行计划cost 3162.78，所以limit 10和limit 100执行计划不同，选择了不同的索引。

以上的估算都是以数据零散的放在排序列的索引上为前提的，实际情况有可能数据在最后一条（反向扫描索引），很快就能找到；也有可能数据全部在索引叶节点前面的几个pages，此时几乎是扫描全部索引并回表，代价便非常高。

那么两个字段的关联度，数据在索引上的分布情况，决定了使用排序字段的索引的效率。

再看下真实的执行扫描了多少行数据：

复制代码

   ->  Index Scan Backward using idx_tablelzl1_column3 on tablelzl1 ri  (cost=0.43..157932.45 rows=1038 width=990) (actual time=23.309..8122.505 rows=10 loops=1)
         Filter: (((column1)::text = 'AAAA'::text) AND ((column2)::text = 'applyno20231112'::text))
         Rows Removed by Filter: 1521796

实际上差不多扫描了1521796行才找到这10条数据，本来预估的是20000，整整相差了76倍！

触发场景

必须有where +order by+limit语句
排序字段和过滤字段都必须有索引
一般limit不会特别大
数据分布不均

解决办法

改写sql语句：添加表达式，不让order by字段走索引即可

sql 复制代码

 SELECT
			*,
        (select cl.ITEM_DESC from tablelzl2 cl where cl.ITEM_NAME = RI.MANUAL_SIGN AND   cl.ITEM_NO='manualSign') AS "manualSign"
        FROM
        tablelzl1 RI
        WHERE  RI.column1='AAAA'
		AND  RI.column2 = 'applyno20231112'
        ORDER BY
        RI.column3 +'0' DESC    limit 10

oracle是怎么做的

执行计划cost的预估差异

从上面的执行计划分析，pg的执行计划cost看起来不太适应，上层的cost小于内层的cost，不像oracle这样阶梯式的累加计算

这里做一个oracle和pg的实验，一张表仅存储colname='x'的数据，看下pg和oracle的对cost计算的区别：

复制代码

[postgres@cnsz381785:7169/(rasesql)dbmgr][10-31.14:32:19]M=# explain select * from testlzl where col1='x' limit 1;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Limit  (cost=0.00..0.02 rows=1 width=2)
   ->  Seq Scan on testlzl  (cost=0.00..17747.20 rows=1048576 width=2)
         Filter: ((col1)::text = 'x'::text)

[postgres@cnsz381785:7169/(rasesql)dbmgr][10-31.14:32:30]M=# explain select * from testlzl where col1='xx' limit 1;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Limit  (cost=0.00..17747.20 rows=1 width=2)
   ->  Seq Scan on testlzl  (cost=0.00..17747.20 rows=1 width=2)
         Filter: ((col1)::text = 'xx'::text)

col1='x'立马就能找到，limit的算法没有推入到全表扫描的成本中，total cost是17747.20，跟扫描完表的成本是一样的。limit的成本cost虽然没有下推到内层的cost做计算，但是rows计算了！

来看下oracle是执行计划是怎么做的：

复制代码

SYS@t8icss1> select * from dbmgr.testlzl where a='x' and rownum<=1;

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2045386539

------------------------------------------------------------------------------
| Id  | Operation          | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |         |     1 |     2 |     2   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY     |         |       |       |            |          |
|*  2 |   TABLE ACCESS FULL| TESTLZL |     1 |     2 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=1)
   2 - filter("A"='x')


SYS@t8icss1> select * from dbmgr.testlzl where a='xx' and rownum<=1;

no rows selected


Execution Plan
----------------------------------------------------------
Plan hash value: 2045386539

------------------------------------------------------------------------------
| Id  | Operation          | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |         |     1 |     2 |   302   (2)| 00:00:01 |
|*  1 |  COUNT STOPKEY     |         |       |       |            |          |
|*  2 |   TABLE ACCESS FULL| TESTLZL |     1 |     2 |   302   (2)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=1)
   2 - filter("A"='xx')

对于oracle的计划，a='x'的数据可以立即找到的话，STOPKEY的代价算进了内层的cost中，cost只有2，实际上扫描全表的代价比较高302。

这一点是oracle与pg关于cost计算的一个重要区别：

oracle的外层cost必然>=内层cost；pg则不一定
oracle的内层cost计算包含了外层的算子（比如stopkey）；但是pg不会包含，直接给子路径的全部成本

oracle的数据分布不均问题

知道了数据分布不均的原理，造一条数据把他放在排序索引的开头即可

sql 复制代码

create table tlzl(a char(100) not null,b char(100) not null);
--插入批量数据
begin
for i in 1..100000 loop
insert into tlzl values('test','test');
end loop;
end;
/
--插入特殊数据
insert into tlzl values('aaaa','aaaa');
insert into tlzl values('zzzz','zzzz');
--创建索引
create index idx_a on tlzl(a);
create index idx_b on tlzl(b);
--收集统计信息
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>'SYS',TABNAME=>'TLZL',estimate_percent => 10, degree=>1,METHOD_OPT=>'FOR ALL COLUMNS SIZE AUTO',cascade=>true);

sql 复制代码

select * from (select /*+ index(tlzl idx_a)*/* from tlzl where b='aaaa' order by a) where rownum<=1; 
select * from (select /*+ index(tlzl idx_a)*/* from tlzl where b='zzzz' order by a) where rownum<=1;

复制代码

SYS@t8icss1> select * from (select /*+ index(tlzl idx_a)*/* from tlzl where b='aaaa' order by a) where rownum<=1; 

Execution Plan
----------------------------------------------------------
Plan hash value: 3674066029

---------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |     1 |   204 |  2210   (1)| 00:00:01 |
|*  1 |  COUNT STOPKEY                |       |       |       |            |          |
|   2 |   VIEW                        |       |     1 |   204 |  2210   (1)| 00:00:01 |
|*  3 |    TABLE ACCESS BY INDEX ROWID| TLZL  |     1 |   202 |  2210   (1)| 00:00:01 |
|   4 |     INDEX FULL SCAN           | IDX_A | 98830 |       |   779   (1)| 00:00:01 |
---------------------------------------------------------------------------------------

SYS@t8icss1> select * from (select /*+ index(tlzl idx_a)*/* from tlzl where b='zzzz' order by a) where rownum<=1; 

Execution Plan
----------------------------------------------------------
Plan hash value: 3674066029

---------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |     1 |   204 |  2210   (1)| 00:00:01 |
|*  1 |  COUNT STOPKEY                |       |       |       |            |          |
|   2 |   VIEW                        |       |     1 |   204 |  2210   (1)| 00:00:01 |
|*  3 |    TABLE ACCESS BY INDEX ROWID| TLZL  |     1 |   202 |  2210   (1)| 00:00:01 |
|   4 |     INDEX FULL SCAN           | IDX_A | 98830 |       |   779   (1)| 00:00:01 |
---------------------------------------------------------------------------------------

oracle的优化器也是一样的，优化器并不知道数据到底放在索引的哪个地方，没有办法，放在索引的第一条和最后一条都是估算的同一代价。

不过oracle有很多方法可以解决这个问题，如extended statistic、Automatic Column Group Detection、固化执行计划等。

参考

http://www.postgres.cn/v2/news/viewone/1/717

https://oracle-base.com/articles/12c/automatic-column-group-detection-extended-statistics-12cr1