TiDB 无统计信息时执行计划如何生成

作者： weiyinghua 原文来源： https://tidb.net/blog/4c49ac0d

一、Pseudo 统计信息总体生成规则

TiDB 在表无统计信息时，不会进行动态采样，而是用静态的、预设规则以及经验假设来生成计划。用函数 PseudoTable 创建一个伪统计表对象，通过默认 1万行，并填充列和索引 pseudo 统计信息，提供一个粗略的估算依据。

二、使用 pseudo 统计信息原因

表或索引刚创建还没收集完统计信息，或数据刚导入未收集统计信息；
修改过的行数 / 表总行数比值超过 pseudo-estimate-ratio 默认值 0.8 时会认为统计信息过期而用 pseudo 统计信息；
在 TiDB Server 启动时，由于统计信息加载未完成，可能会使用 pseudo 统计信息。

三、Pseudo 统计信息生成过程

PseudoTable 函数创建一个伪表统计信息对象，用于无统计信息情况下生成统计信息对象。

初始化 pseudo 统计信息

创建一个 HistColl 对象，设置其属性，如行数 RealtimeCount 、物理ID PhysicalID 、是否允许触发加载 CanNotTriggerLoad 等。
初始化列和索引的映射表 columns 和 indices 。

创建统计信息表对象

创建一个 Table 对象，并将其 HistColl 属性设置为前面创建的 HistColl 对象。
初始化列和索引的存在映射表 ColAndIdxExistenceMap 。

生成列信息

遍历表的列信息 tblInfo.Columns 。
对于状态为 StatePublic 且不是隐藏列的列：
- 在存在映射表中插入列信息。
- 如果 allowFillHistMeta 为 true ，则为该列创建一个 Column 对象，并填充直方图元数据。

生成索引信息

遍历表的索引信息 tblInfo.Indices 。
对于状态为 StatePublic 的索引：
- 在存在映射表中插入索引信息。
- 如果 allowFillHistMeta 为 true ，则为该索引创建一个 Index 对象，并填充直方图元数据。

返回统计信息表对象

最后返回创建好的 Table 对象。

源码地址

SQL 复制代码

https://github.com/pingcap/tidb/blob/426ce3e57069afbd8f061d7ae39c79d3f9e2ff5d/pkg/statistics/table.go#L1004

四、Pseudo 估算规则示例

示例表结构如下，并删除统计信息：

sql 复制代码

CREATE TABLE t1 (
    id bigint not null auto_random primary key,
    k char(64),
    v varchar(255),
    update_time datetime,
    key idx_k_update_time(k, update_time),
    key idx_update_time(update_time)
);
drop stats t1;

索引等值查询

普通索引等值查询，基于经验的假设，固定估算为10行，源码中常量 pseudoEqualRate 控制：

sql 复制代码

mysql> explain select count(1) from t1 where k='tom';
+-----------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+
| id                          | estRows | task      | access object                                     | operator info                                       |
+-----------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+
| StreamAgg_17                | 1.00    | root      |                                                   | funcs:count(Column#10)->Column#5                    |
| └─IndexReader_18            | 1.00    | root      |                                                   | index:StreamAgg_9                                   |
|   └─StreamAgg_9             | 1.00    | cop[tikv] |                                                   | funcs:count(1)->Column#10                           |
|     └─IndexRangeScan_16     | 10.00   | cop[tikv] | table:t1, index:idx_k_update_time(k, update_time) | range:["tom","tom"], keep order:false, stats:pseudo |
+-----------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+

如果是组合索引，第一个字段固定估算为全表的 1/1000 ，第二个字段固定估算为 1/100：

sql 复制代码

mysql> explain select count(1) from t1 where k='tom' and update_time='2024-11-25';
+---------------------------+---------+-----------+---------------------------------------------------+---------------------------------------------------------------------------------------------+
| id                        | estRows | task      | access object                                     | operator info                                                                               |
+---------------------------+---------+-----------+---------------------------------------------------+---------------------------------------------------------------------------------------------+
| StreamAgg_10              | 1.00    | root      |                                                   | funcs:count(1)->Column#5                                                                    |
| └─IndexReader_15          | 0.10    | root      |                                                   | index:IndexRangeScan_14                                                                     |
|   └─IndexRangeScan_14     | 0.10    | cop[tikv] | table:t1, index:idx_k_update_time(k, update_time) | range:["tom" 2024-11-25 00:00:00,"tom" 2024-11-25 00:00:00], keep order:false, stats:pseudo |
+---------------------------+---------+-----------+---------------------------------------------------+---------------------------------------------------------------------------------------------+

索引大于或小于

对于大于或小于查询，固定估算为1万行的三分之一，由源码中常量 pseudoLessRate 控制：

sql 复制代码

mysql> explain select count(1) from t1 where update_time < '2024-11-25';
+-----------------------------+---------+-----------+----------------------------------------------+------------------------------------------------------------------+
| id                          | estRows | task      | access object                                | operator info                                                    |
+-----------------------------+---------+-----------+----------------------------------------------+------------------------------------------------------------------+
| HashAgg_12                  | 1.00    | root      |                                              | funcs:count(Column#6)->Column#5                                  |
| └─IndexReader_13            | 1.00    | root      |                                              | index:HashAgg_6                                                  |
|   └─HashAgg_6               | 1.00    | cop[tikv] |                                              | funcs:count(1)->Column#6                                         |
|     └─IndexRangeScan_11     | 3323.33 | cop[tikv] | table:t1, index:idx_update_time(update_time) | range:[-inf,2024-11-25 00:00:00), keep order:false, stats:pseudo |
+-----------------------------+---------+-----------+----------------------------------------------+------------------------------------------------------------------+

如果是组合索引，第一个字段估算为全表的 1/100，第二个字段再估算1/3，即：10000 * 1/100 * 1/3：

sql 复制代码

mysql> explain select count(1) from t1 where k='tom' and update_time>'2024-11-25';
+-----------------------------+---------+-----------+---------------------------------------------------+------------------------------------------------------------------------------+
| id                          | estRows | task      | access object                                     | operator info                                                                |
+-----------------------------+---------+-----------+---------------------------------------------------+------------------------------------------------------------------------------+
| StreamAgg_17                | 1.00    | root      |                                                   | funcs:count(Column#7)->Column#5                                              |
| └─IndexReader_18            | 1.00    | root      |                                                   | index:StreamAgg_9                                                            |
|   └─StreamAgg_9             | 1.00    | cop[tikv] |                                                   | funcs:count(1)->Column#7                                                     |
|     └─IndexRangeScan_16     | 33.33   | cop[tikv] | table:t1, index:idx_k_update_time(k, update_time) | range:("tom" 2024-11-25 00:00:00,"tom" +inf], keep order:false, stats:pseudo |
+-----------------------------+---------+-----------+---------------------------------------------------+------------------------------------------------------------------------------+

索引范围查询

对于 between 写法，固定估算为1万行的 1/40，由源码中常量 pseudoBetweenRate 控制：

sql 复制代码

mysql> explain select count(1) from t1 where update_time BETWEEN '2024-11-25' and '2024-11-26';
+-----------------------------+---------+-----------+----------------------------------------------+---------------------------------------------------------------------------------+
| id                          | estRows | task      | access object                                | operator info                                                                   |
+-----------------------------+---------+-----------+----------------------------------------------+---------------------------------------------------------------------------------+
| StreamAgg_17                | 1.00    | root      |                                              | funcs:count(Column#7)->Column#5                                                 |
| └─IndexReader_18            | 1.00    | root      |                                              | index:StreamAgg_9                                                               |
|   └─StreamAgg_9             | 1.00    | cop[tikv] |                                              | funcs:count(1)->Column#7                                                        |
|     └─IndexRangeScan_16     | 250.00  | cop[tikv] | table:t1, index:idx_update_time(update_time) | range:[2024-11-25 00:00:00,2024-11-26 00:00:00], keep order:false, stats:pseudo |
+-----------------------------+---------+-----------+----------------------------------------------+---------------------------------------------------------------------------------+

普通列

对于非索引的普通列，固定估算全表扫描1万行，并预估过滤10行：

sql 复制代码

mysql> explain select count(1) from t1 where v = 'happy';
+------------------------------+----------+-----------+---------------+---------------------------------+
| id                           | estRows  | task      | access object | operator info                   |
+------------------------------+----------+-----------+---------------+---------------------------------+
| StreamAgg_20                 | 1.00     | root      |               | funcs:count(Column#7)->Column#5 |
| └─TableReader_21             | 1.00     | root      |               | data:StreamAgg_9                |
|   └─StreamAgg_9              | 1.00     | cop[tikv] |               | funcs:count(1)->Column#7        |
|     └─Selection_19           | 10.00    | cop[tikv] |               | eq(test.t1.v, "happy")          |
|       └─TableFullScan_18     | 10000.00 | cop[tikv] | table:t1      | keep order:false, stats:pseudo  |
+------------------------------+----------+-----------+---------------+---------------------------------+

唯一键

对于唯一键，固定返回1行：

sql 复制代码

mysql> explain select count(1) from t1 where id = 100;
+--------------------+---------+------+---------------+--------------------------+
| id                 | estRows | task | access object | operator info            |
+--------------------+---------+------+---------------+--------------------------+
| StreamAgg_9        | 1.00    | root |               | funcs:count(1)->Column#5 |
| └─Point_Get_11     | 1.00    | root | table:t1      | handle:100               |
+--------------------+---------+------+---------------+--------------------------+

其它情形

对于索引第一个字段 k 估算10行，对于索引第二个字符型字段 v 传入数字进行范围查找，在执行计划中出现了 cast(test.t1.v, double BINARY) 表示有隐式类型转换，不属于任何一种预设规则，在10行基础上乘以 0.8 估算得到8行，源码中常量 selectionFactor 控制：

sql 复制代码

mysql> explain select count(1) from t1 where k='tom' and v >10;
+----------------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+
| id                               | estRows | task      | access object                                     | operator info                                       |
+----------------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+
| StreamAgg_10                     | 1.00    | root      |                                                   | funcs:count(1)->Column#5                            |
| └─IndexLookUp_37                 | 8.00    | root      |                                                   |                                                     |
|   ├─IndexRangeScan_34(Build)     | 10.00   | cop[tikv] | table:t1, index:idx_k_update_time(k, update_time) | range:["tom","tom"], keep order:false, stats:pseudo |
|   └─Selection_36(Probe)          | 8.00    | cop[tikv] |                                                   | gt(cast(test.t1.v, double BINARY), 10)              |
|     └─TableRowIDScan_35          | 10.00   | cop[tikv] | table:t1                                          | keep order:false, stats:pseudo                      |
+----------------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+

源码地址

SQL 复制代码

https://github.com/pingcap/tidb/blob/426ce3e57069afbd8f061d7ae39c79d3f9e2ff5d/pkg/planner/cardinality/pseudo.go

五、总结

与 MySQL 8.0 类似，TiDB 7.1 在表缺乏统计信息时，并不会像 Oracle 那样动态采样生成执行计划。总的来说，无统计信息时 TiDB 生成的执行计划基于假设数据规模和经验规则，可能存在较大的误差。TiDB 8.2 将引入统计信息并发加载功能，加载统计信息效率将大幅度提升，有效减少使用 Pseudo 统计信息的可能。
询问AI

TiDB 无统计信息时执行计划如何生成

一、Pseudo 统计信息总体生成规则

二、 使用 pseudo 统计信息原因

三、Pseudo 统计信息生成过程

初始化 pseudo 统计信息

创建统计信息表对象

生成列信息

生成索引信息

返回统计信息表对象

源码地址

四、Pseudo 估算规则示例

索引等值查询

索引大于或小于

索引范围查询

普通列

唯一键

其它情形

源码地址

五、总结

二、使用 pseudo 统计信息原因