读TiDB源码聊设计:浅析HTAP的SQL优化器

版本 日期 备注
1.0 2024.2.18 文章首发

本文的的源码分析全部基于TiDB6.5来做分析。

1.引子

如果让你做一个分布式数据库的优化器,面对以下的SQL,你会想到什么好的方法去执行他们呢?

  • SELECT id, name FROM person WHERE age >= 18 or height > 180 limit 100;:从条件上看,我们看到条件其实是二选一的: age >= 18 or height > 180。基于这种情况,我们肯定会去选择有索引的数据,如果都有索引or都没有,那么肯定选择扫描行数最少的数据。如果有一些算子在里面的话,则额外需要考虑数据的就近原则------有些算子在部分架构下可以充分利用MPP的能力,而有些则不行。
  • SELECT orders.order_id, customers.customer_name, orders.order_date FROM orders LEFT JOIN customers ON orders.customer_id=customers.customer_id;分布式数据库中的join,最优的方式就是小表广播到大表数据所在的地方。那么首先我们得知道谁是小表,谁是大表。

2.统计信息收集

根据前面的两个例子,我们可以发现------如果我们需要找到SQL对应的最佳计划,我们会需要一些表的元数据,或者说是统计信息。从常规的角度来说,以下统计信息是必须收集的:

  • 表的总行数
  • 每列数据的平均大小
  • 每列数据的基数:即NDV(distinct value)
  • 列的NULL值个数

如果是事务型的(行式存储),那么还要考虑索引平均长度、值的分布等等。

如果是分析型的(列式存储),那么还需要考虑相关列的最大值、最小值等等。

而统计方式的收集,会有多种方式。主要是考虑资源和准确性之间的Trade off。常见的有:

  • TopN:相关数据出现次数前 n 的值。
  • 直方图:用于描述数据分布图的工具。按照数据的值大小进行分桶,并用一些简单的数据来描述每个桶,比如落在桶里的值的个数。
  • 2D直方图:由于直方图无法反应列之间的关联,可以用2D直方图(联合分布)做到,但空间占用也比较多。
  • Count-Min Sketch:类似哈希表加上计算器的实现。可以用很少的数据来描述全体数据的特性。
  • HyperLogLog:一种估算海量数据基数的方法。很多情况下,统计并不需要那么精确。工程方面要在使用资源和准确性里找平衡。因此有人提出用HLL,这是一种占用资源少,但能给出较为准确的近似结果的算法。

TiDB收集的统计信息见:docs.pingcap.com/zh/tidb/v6....

3.代价的评估

一个SQL真正的物理执行计划可能是有多个的。在以统计信息为基础之上,我们还需要加入相应的权重,举个例子:

  1. 如果能够在内存中计算完成,就不用去反复发起本地IO
  2. 如果能在本地IO中完成,就不要去发起网络请求

等等...

这在TiDB的代码中也有对应的默认值。

go 复制代码
DefOptCPUFactor                                = 3.0
DefOptCopCPUFactor                             = 3.0
DefOptTiFlashConcurrencyFactor                 = 24.0
DefOptNetworkFactor                            = 1.0
DefOptScanFactor                               = 1.5
DefOptDescScanFactor                           = 3.0
DefOptSeekFactor                               = 20.0
DefOptMemoryFactor                             = 0.001
DefOptDiskFactor                               = 1.5
DefOptConcurrencyFactor                        = 3.0
go 复制代码
var defaultVer2Factors = costVer2Factors{
	TiDBTemp:      costVer2Factor{"tidb_temp_table_factor", 0.00},
	TiKVScan:      costVer2Factor{"tikv_scan_factor", 40.70},
	TiKVDescScan:  costVer2Factor{"tikv_desc_scan_factor", 61.05},
	TiFlashScan:   costVer2Factor{"tiflash_scan_factor", 11.60},
	TiDBCPU:       costVer2Factor{"tidb_cpu_factor", 49.90},
	TiKVCPU:       costVer2Factor{"tikv_cpu_factor", 49.90},
	TiFlashCPU:    costVer2Factor{"tiflash_cpu_factor", 2.40},
	TiDB2KVNet:    costVer2Factor{"tidb_kv_net_factor", 3.96},
	TiDB2FlashNet: costVer2Factor{"tidb_flash_net_factor", 2.20},
	TiFlashMPPNet: costVer2Factor{"tiflash_mpp_net_factor", 1.00},
	TiDBMem:       costVer2Factor{"tidb_mem_factor", 0.20},
	TiKVMem:       costVer2Factor{"tikv_mem_factor", 0.20},
	TiFlashMem:    costVer2Factor{"tiflash_mem_factor", 0.05},
	TiDBDisk:      costVer2Factor{"tidb_disk_factor", 200.00},
	TiDBRequest:   costVer2Factor{"tidb_request_factor", 6000000.00},
}

4.执行计划枚举与择优

当我们可以评估出物理执行计划的代价时,将会枚举所有可以用上物理执行计划,并在其中选择代价最小的物理执行计划。一般枚举分为两个流派:

  1. 自底向上:代表有System R。
  2. 自顶向下:代表有Cascade。

自底向上没法解决一个问题。当上层对下层的输出结果顺序感兴趣时,那就不能只能从底层的视角来寻找局部最优。

举个例子,多表Join。一开始两个表Join可能HashJoin代价很低,但是Join下一个表时,用MergeJoin从整体来看代价更低。从这个case来看,自底向上来做计划取优并不合适。

所以有了Cascade:

  1. 搜索方案是自顶向下的。这意味着它可以避免局部最优而导致全局不优。从Operator Tree 自顶向下遍历时,可以做一系列变换:
    • Implementation:逻辑算子可以转换成物理算子;例如Join转换成NestLoop或者HashJoin等
    • Exploration:逻辑算子可以做等价变换;例如交换Inner Join的两个子节点,即可枚举Join顺序

图片来自于:Cascades Optimizer------zhuanlan.zhihu.com/p/73545345

  1. 它是基于Volcano模型演进而来的。
  2. 用面向对象的方式进行建模,编写规则时不需要关心搜索过程。相比传统优化器中面向过程去一条条的编写,的确是很大的改进。

5.TiDB的实现

大致的代码调用链为:

log 复制代码
-- session/session.go

\-- ExecuteStmt //SQL执行的入口

|-- executor/compiler.go

\-- Compile //将SQL变成可执行的计划

|--planner/planner/optmize.go

\-- Optimize //优化的入口

\-- optimize //这里有两个入口。一种是新的优化器入口,一种是老的优化器入口。根据配置来选择。最终都会返回最优的物理执行计划。

    |-- planner/cascades/optmize.go

        \--FindBestPlan 见5.1
    
        \-- onPhasePreprocessing //见5.3

        \-- implGroup

        |--planner/core/optmizer.go //见5.4

            \-- DoOptimize

            \-- physicalOptimize

            |--planner/core/find_best_task.go

            \-- findBestTask

            \-- enumeratePhysicalPlans4Task

            \-- compareTaskCost
    
            \-- getTaskPlanCost
    
            |-- planner/core/plan_cost_ver2.go

            \-- getPlanCost

5.1 逻辑优化

核心入口为:

go 复制代码
// FindBestPlan is the optimization entrance of the cascades planner. The
// optimization is composed of 3 phases: preprocessing, exploration and implementation.
//
// ------------------------------------------------------------------------------
// Phase 1: Preprocessing
// ------------------------------------------------------------------------------
//
// The target of this phase is to preprocess the plan tree by some heuristic
// rules which should always be beneficial, for example Column Pruning.
//
// ------------------------------------------------------------------------------
// Phase 2: Exploration
// ------------------------------------------------------------------------------
//
// The target of this phase is to explore all the logically equivalent
// expressions by exploring all the equivalent group expressions of each group.
//
// At the very beginning, there is only one group expression in a Group. After
// applying some transformation rules on certain expressions of the Group, all
// the equivalent expressions are found and stored in the Group. This procedure
// can be regarded as searching for a weak connected component in a directed
// graph, where nodes are expressions and directed edges are the transformation
// rules.
//
// ------------------------------------------------------------------------------
// Phase 3: Implementation
// ------------------------------------------------------------------------------
//
// The target of this phase is to search the best physical plan for a Group
// which satisfies a certain required physical property.
//
// In this phase, we need to enumerate all the applicable implementation rules
// for each expression in each group under the required physical property. A
// memo structure is used for a group to reduce the repeated search on the same
// required physical property.
func (opt *Optimizer) FindBestPlan(sctx sessionctx.Context, logical plannercore.LogicalPlan) (p plannercore.PhysicalPlan, cost float64, err error) {
	logical, err = opt.onPhasePreprocessing(sctx, logical)
	if err != nil {
		return nil, 0, err
	}
	rootGroup := memo.Convert2Group(logical)
	err = opt.onPhaseExploration(sctx, rootGroup)
	if err != nil {
		return nil, 0, err
	}
	p, cost, err = opt.onPhaseImplementation(sctx, rootGroup)
	if err != nil {
		return nil, 0, err
	}
	err = p.ResolveIndices()
	return p, cost, err
}

注释+代码很干净,这里一共做三件事

  1. onPhasePreprocessing:注释很实在,说for example Column Pruning,结果里面真的只做了列裁剪。
  2. onPhaseExploration:探索所有逻辑等价存在的可能
  3. onPhaseImplementation:根据代价寻找最优的物理执行计划

这块官网的博客写的非常好,我就不重复了:cn.pingcap.com/blog/tidb-c...

5.2 统计信息收集

这块代码主要集中在stats.go里,里面的核心数据结构是stats_info.go里的StatsInfo。调用链大致为:

log 复制代码
|-- planner/cascades/optimizer.go

\--fillGroupStats

|-- planner/core/stats.go

\--DeriveStats
go 复制代码
type LogicalPlan interface {
	Plan
    //......忽略一些代码
	// DeriveStats derives statistic info for current plan node given child stats.
	// We need selfSchema, childSchema here because it makes this method can be used in
	// cascades planner, where LogicalPlan might not record its children or schema.
	DeriveStats(childStats []*property.StatsInfo, selfSchema *expression.Schema, childSchema []*expression.Schema, colGroups [][]*expression.Column) (*property.StatsInfo, error)
    //......忽略一些代码
}

有很多结构体实现了这个方法:

  1. 收集统计信息主要是analyze.go#Next方法中调用的#analyzeWorker。

5.3 新版本 物理执行计划的选择

代码入口为:

go 复制代码
// implGroup finds the best Implementation which satisfies the required
// physical property for a Group. The best Implementation should have the
// lowest cost among all the applicable Implementations.
//
// g:			the Group to be implemented.
// reqPhysProp: the required physical property.
// costLimit:   the maximum cost of all the Implementations.
func (opt *Optimizer) implGroup(g *memo.Group, reqPhysProp *property.PhysicalProperty, costLimit float64) (memo.Implementation, error) {
	groupImpl := g.GetImpl(reqPhysProp)
	if groupImpl != nil {
		if groupImpl.GetCost() <= costLimit {
			return groupImpl, nil
		}
		return nil, nil
	}
	// Handle implementation rules for each equivalent GroupExpr.
	var childImpls []memo.Implementation
	err := opt.fillGroupStats(g)
	if err != nil {
		return nil, err
	}
	outCount := math.Min(g.Prop.Stats.RowCount, reqPhysProp.ExpectedCnt)
	for elem := g.Equivalents.Front(); elem != nil; elem = elem.Next() {
		curExpr := elem.Value.(*memo.GroupExpr)
		impls, err := opt.implGroupExpr(curExpr, reqPhysProp)
		if err != nil {
			return nil, err
		}
		for _, impl := range impls {
			childImpls = childImpls[:0]
			for i, childGroup := range curExpr.Children {
				childImpl, err := opt.implGroup(childGroup, impl.GetPlan().GetChildReqProps(i), impl.GetCostLimit(costLimit, childImpls...))
				if err != nil {
					return nil, err
				}
				if childImpl == nil {
					impl.SetCost(math.MaxFloat64)
					break
				}
				childImpls = append(childImpls, childImpl)
			}
			if impl.GetCost() == math.MaxFloat64 {
				continue
			}
			implCost := impl.CalcCost(outCount, childImpls...)
			if implCost > costLimit {
				continue
			}
			if groupImpl == nil || groupImpl.GetCost() > implCost {
				groupImpl = impl.AttachChildren(childImpls...)
				costLimit = implCost
			}
		}
	}
	// Handle enforcer rules for required physical property.
	for _, rule := range GetEnforcerRules(g, reqPhysProp) {
		newReqPhysProp := rule.NewProperty(reqPhysProp)
		enforceCost := rule.GetEnforceCost(g)
		childImpl, err := opt.implGroup(g, newReqPhysProp, costLimit-enforceCost)
		if err != nil {
			return nil, err
		}
		if childImpl == nil {
			continue
		}
		impl := rule.OnEnforce(reqPhysProp, childImpl)
		implCost := enforceCost + childImpl.GetCost()
		impl.SetCost(implCost)
		if groupImpl == nil || groupImpl.GetCost() > implCost {
			groupImpl = impl
			costLimit = implCost
		}
	}
	if groupImpl == nil || groupImpl.GetCost() == math.MaxFloat64 {
		return nil, nil
	}
	g.InsertImpl(reqPhysProp, groupImpl)
	return groupImpl, nil
}

这里个函数会找出潜在符合条件的物理执行计划,并不断的搜索。但它有一个剪枝逻辑------会记录当前最小的cost,如果一个执行计划自上向下搜索时,如果超过了这个cost,则直接返回。这就是在第3节提到的自顶向下的优化。

接下来我们看一下memo.Implementation的定义:

go 复制代码
package memo

import (
	plannercore "github.com/pingcap/tidb/planner/core"
)

// Implementation defines the interface for cost of physical plan.
type Implementation interface {
	CalcCost(outCount float64, children ...Implementation) float64
	SetCost(cost float64)
	GetCost() float64
	GetPlan() plannercore.PhysicalPlan

	// AttachChildren is used to attach children implementations and returns it self.
	AttachChildren(children ...Implementation) Implementation

	// GetCostLimit gets the costLimit for implementing the next childGroup.
	GetCostLimit(costLimit float64, children ...Implementation) float64
}

其中CalcCost方法就是用来计算物理执行计划的代价。一共有这么多结构体实现了它:

5.3.1 代价的评估

我们以开头的例子,讲解代价的评估。

select代价计算方式

go 复制代码
// CalcCost calculates the cost of the table scan Implementation.
func (impl *TableScanImpl) CalcCost(outCount float64, _ ...memo.Implementation) float64 {
	ts := impl.plan.(*plannercore.PhysicalTableScan)
	width := impl.tblColHists.GetTableAvgRowSize(impl.plan.SCtx(), impl.tblCols, kv.TiKV, true)
	sessVars := ts.SCtx().GetSessionVars()
	impl.cost = outCount * sessVars.GetScanFactor(ts.Table) * width
	if ts.Desc {
		impl.cost = outCount * sessVars.GetDescScanFactor(ts.Table) * width
	}
	return impl.cost
}

// GetScanFactor returns the session variable scanFactor
// returns 0 when tbl is a temporary table.
func (s *SessionVars) GetScanFactor(tbl *model.TableInfo) float64 {
	if tbl != nil {
		if tbl.TempTableType != model.TempTableNone {
			return 0
		}
	}
	return s.scanFactor 
}


// CalcCost implements Implementation interface.
func (impl *IndexScanImpl) CalcCost(outCount float64, _ ...memo.Implementation) float64 {
	is := impl.plan.(*plannercore.PhysicalIndexScan)
	sessVars := is.SCtx().GetSessionVars()
	rowSize := impl.tblColHists.GetIndexAvgRowSize(is.SCtx(), is.Schema().Columns, is.Index.Unique)
	cost := outCount * rowSize * sessVars.GetScanFactor(is.Table)
	if is.Desc {
		cost = outCount * rowSize * sessVars.GetDescScanFactor(is.Table)
	}
	cost += float64(len(is.Ranges)) * sessVars.GetSeekFactor(is.Table)
	impl.cost = cost
	return impl.cost
}

这里我们以全表扫描表和命中索引的代码为例子。其中默认的scanFactor是1.5。这里可以看到indexScan和tableScan少了一个因数:width。这个变量代表了所需列的平均大小。这么看基本上是indexScan最优了。

这里的代码笔者认为有点不优雅------当Desc时,其实之前的Cost是没必要算一遍的,浪费CPU资源。

join代价计算方式

go 复制代码
// CalcCost implements Implementation CalcCost interface.
func (impl *HashJoinImpl) CalcCost(_ float64, children ...memo.Implementation) float64 {
	hashJoin := impl.plan.(*plannercore.PhysicalHashJoin)
	// The children here are only used to calculate the cost.
	hashJoin.SetChildren(children[0].GetPlan(), children[1].GetPlan())
	selfCost := hashJoin.GetCost(children[0].GetPlan().StatsCount(), children[1].GetPlan().StatsCount(), false, 0, nil)
	impl.cost = selfCost + children[0].GetCost() + children[1].GetCost()
	return impl.cost
}

// CalcCost implements Implementation CalcCost interface.
func (impl *MergeJoinImpl) CalcCost(_ float64, children ...memo.Implementation) float64 {
	mergeJoin := impl.plan.(*plannercore.PhysicalMergeJoin)
	// The children here are only used to calculate the cost.
	mergeJoin.SetChildren(children[0].GetPlan(), children[1].GetPlan())
	selfCost := mergeJoin.GetCost(children[0].GetPlan().StatsCount(), children[1].GetPlan().StatsCount(), 0)
	impl.cost = selfCost + children[0].GetCost() + children[1].GetCost()
	return impl.cost
}

具体的计算都在plan_cost_v1.go里:

go 复制代码
// GetCost computes cost of hash join operator itself.
func (p *PhysicalHashJoin) GetCost(lCnt, rCnt float64, isMPP bool, costFlag uint64, op *physicalOptimizeOp) float64 {
	buildCnt, probeCnt := lCnt, rCnt
	build := p.children[0]
	// Taking the right as the inner for right join or using the outer to build a hash table.
	if (p.InnerChildIdx == 1 && !p.UseOuterToBuild) || (p.InnerChildIdx == 0 && p.UseOuterToBuild) {
		buildCnt, probeCnt = rCnt, lCnt
		build = p.children[1]
	}
	sessVars := p.ctx.GetSessionVars()
	oomUseTmpStorage := variable.EnableTmpStorageOnOOM.Load()
	memQuota := sessVars.MemTracker.GetBytesLimit() // sessVars.MemQuotaQuery && hint
	rowSize := getAvgRowSize(build.statsInfo(), build.Schema().Columns)
	spill := oomUseTmpStorage && memQuota > 0 && rowSize*buildCnt > float64(memQuota) && p.storeTp != kv.TiFlash
	// Cost of building hash table.
	cpuFactor := sessVars.GetCPUFactor()
	diskFactor := sessVars.GetDiskFactor()
	memoryFactor := sessVars.GetMemoryFactor()
	concurrencyFactor := sessVars.GetConcurrencyFactor()

	cpuCost := buildCnt * cpuFactor
	memoryCost := buildCnt * memoryFactor
	diskCost := buildCnt * diskFactor * rowSize
	// Number of matched row pairs regarding the equal join conditions.
	helper := &fullJoinRowCountHelper{
		sctx:            p.SCtx(),
		cartesian:       false,
		leftProfile:     p.children[0].statsInfo(),
		rightProfile:    p.children[1].statsInfo(),
		leftJoinKeys:    p.LeftJoinKeys,
		rightJoinKeys:   p.RightJoinKeys,
		leftSchema:      p.children[0].Schema(),
		rightSchema:     p.children[1].Schema(),
		leftNAJoinKeys:  p.LeftNAJoinKeys,
		rightNAJoinKeys: p.RightNAJoinKeys,
	}
	numPairs := helper.estimate()
	// For semi-join class, if `OtherConditions` is empty, we already know
	// the join results after querying hash table, otherwise, we have to
	// evaluate those resulted row pairs after querying hash table; if we
	// find one pair satisfying the `OtherConditions`, we then know the
	// join result for this given outer row, otherwise we have to iterate
	// to the end of those pairs; since we have no idea about when we can
	// terminate the iteration, we assume that we need to iterate half of
	// those pairs in average.
	if p.JoinType == SemiJoin || p.JoinType == AntiSemiJoin ||
		p.JoinType == LeftOuterSemiJoin || p.JoinType == AntiLeftOuterSemiJoin {
		if len(p.OtherConditions) > 0 {
			numPairs *= 0.5
		} else {
			numPairs = 0
		}
	}
	if hasCostFlag(costFlag, CostFlagUseTrueCardinality) {
		numPairs = getOperatorActRows(p)
	}
	// Cost of querying hash table is cheap actually, so we just compute the cost of
	// evaluating `OtherConditions` and joining row pairs.
	probeCost := numPairs * cpuFactor
	probeDiskCost := numPairs * diskFactor * rowSize
	// Cost of evaluating outer filter.
	if len(p.LeftConditions)+len(p.RightConditions) > 0 {
		// Input outer count for the above compution should be adjusted by SelectionFactor.
		probeCost *= SelectionFactor
		probeDiskCost *= SelectionFactor
		probeCost += probeCnt * cpuFactor
	}
	diskCost += probeDiskCost
	probeCost /= float64(p.Concurrency)
	// Cost of additional concurrent goroutines.
	cpuCost += probeCost + float64(p.Concurrency+1)*concurrencyFactor
	// Cost of traveling the hash table to resolve missing matched cases when building the hash table from the outer table
	if p.UseOuterToBuild {
		if spill {
			// It runs in sequence when build data is on disk. See handleUnmatchedRowsFromHashTableInDisk
			cpuCost += buildCnt * cpuFactor
		} else {
			cpuCost += buildCnt * cpuFactor / float64(p.Concurrency)
		}
		diskCost += buildCnt * diskFactor * rowSize
	}

	if spill {
		memoryCost *= float64(memQuota) / (rowSize * buildCnt)
	} else {
		diskCost = 0
	}
	if op != nil {
		setPhysicalHashJoinCostDetail(p, op, spill, buildCnt, probeCnt, cpuFactor, rowSize, numPairs,
			cpuCost, probeCost, memoryCost, diskCost, probeDiskCost,
			diskFactor, memoryFactor, concurrencyFactor,
			memQuota)
	}
	return cpuCost + memoryCost + diskCost
}


// GetCost computes cost of merge join operator itself.
func (p *PhysicalMergeJoin) GetCost(lCnt, rCnt float64, costFlag uint64) float64 {
	outerCnt := lCnt
	innerCnt := rCnt
	innerKeys := p.RightJoinKeys
	innerSchema := p.children[1].Schema()
	innerStats := p.children[1].statsInfo()
	if p.JoinType == RightOuterJoin {
		outerCnt = rCnt
		innerCnt = lCnt
		innerKeys = p.LeftJoinKeys
		innerSchema = p.children[0].Schema()
		innerStats = p.children[0].statsInfo()
	}
	helper := &fullJoinRowCountHelper{
		sctx:          p.SCtx(),
		cartesian:     false,
		leftProfile:   p.children[0].statsInfo(),
		rightProfile:  p.children[1].statsInfo(),
		leftJoinKeys:  p.LeftJoinKeys,
		rightJoinKeys: p.RightJoinKeys,
		leftSchema:    p.children[0].Schema(),
		rightSchema:   p.children[1].Schema(),
	}
	numPairs := helper.estimate()
	if p.JoinType == SemiJoin || p.JoinType == AntiSemiJoin ||
		p.JoinType == LeftOuterSemiJoin || p.JoinType == AntiLeftOuterSemiJoin {
		if len(p.OtherConditions) > 0 {
			numPairs *= 0.5
		} else {
			numPairs = 0
		}
	}
	if hasCostFlag(costFlag, CostFlagUseTrueCardinality) {
		numPairs = getOperatorActRows(p)
	}
	sessVars := p.ctx.GetSessionVars()
	probeCost := numPairs * sessVars.GetCPUFactor()
	// Cost of evaluating outer filters.
	var cpuCost float64
	if len(p.LeftConditions)+len(p.RightConditions) > 0 {
		probeCost *= SelectionFactor
		cpuCost += outerCnt * sessVars.GetCPUFactor()
	}
	cpuCost += probeCost
	// For merge join, only one group of rows with same join key(not null) are cached,
	// we compute average memory cost using estimated group size.
	NDV, _ := getColsNDVWithMatchedLen(innerKeys, innerSchema, innerStats)
	memoryCost := (innerCnt / NDV) * sessVars.GetMemoryFactor()
	return cpuCost + memoryCost
}

HashJoin要考虑到内存不够的情况,因此在计算到数据不够时,会将对应的数据压入硬盘。但它对数据的顺序并无要求,因此可以并发的去做。见:

go 复制代码
	// Cost of traveling the hash table to resolve missing matched cases when building the hash table from the outer table
	if p.UseOuterToBuild {
		if spill {
			// It runs in sequence when build data is on disk. See handleUnmatchedRowsFromHashTableInDisk
			cpuCost += buildCnt * cpuFactor
		} else {
			cpuCost += buildCnt * cpuFactor / float64(p.Concurrency)
		}
		diskCost += buildCnt * diskFactor * rowSize
	}

而MergeJoin的代价显然会更小,但能够选则到这个计划也有较高的要求:当两个关联表要 Join 的字段需要按排好的顺序读取时,适用 Merge Join 算法。

5.4 老版本 物理执行计划的选择

5.4.1 代价的评估

这块代码主要是在plan_cost_ver1.goplan_cost_ver2.go。v2对代价公式进行了更精确的回归校准,调整了部分代价公式,比此前版本的代价公式更加准确。代码上也更为简洁:v2只暴露出了一个公共方法,内部通过不同的类型做转发。

go 复制代码
// GetPlanCost returns the cost of this plan.
func GetPlanCost(p PhysicalPlan, taskType property.TaskType, option *PlanCostOption) (float64, error) {
	return getPlanCost(p, taskType, option)
}

func getPlanCost(p PhysicalPlan, taskType property.TaskType, option *PlanCostOption) (float64, error) {
	if p.SCtx().GetSessionVars().CostModelVersion == modelVer2 {
		planCost, err := p.getPlanCostVer2(taskType, option)
		return planCost.cost, err
	}
	return p.getPlanCostVer1(taskType, option)
}

根据不同的PhysicalPlan类型,会找到不同绑定方法:

v1的部分方法展示:

select代价计算方式

go 复制代码
// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = child-cost + filter-cost
func (p *PhysicalSelection) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
	if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
		return p.planCostVer2, nil
	}

	inputRows := getCardinality(p.children[0], option.CostFlag)
	cpuFactor := getTaskCPUFactorVer2(p, taskType)

	filterCost := filterCostVer2(option, inputRows, p.Conditions, cpuFactor)

	childCost, err := p.children[0].getPlanCostVer2(taskType, option)
	if err != nil {
		return zeroCostVer2, err
	}

	p.planCostVer2 = sumCostVer2(filterCost, childCost)
	p.planCostInit = true
	return p.planCostVer2, nil
}

这部分代码简单易读。代价就是子查询的代价+筛选的代价。

那么问题来了,中索引的和不中索引的代价应该是不一样的。这里没有体现出来啊。仔细看childCost, err := p.children[0].getPlanCostVer2(taskType, option),这里是会去获取子物理执行计划的代价。

go 复制代码
// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
func (p *PointGetPlan) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
	if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
		return p.planCostVer2, nil
	}

	if p.accessCols == nil { // from fast plan code path
		p.planCostVer2 = zeroCostVer2
		p.planCostInit = true
		return zeroCostVer2, nil
	}
	rowSize := getAvgRowSize(p.stats, p.schema.Columns)
	netFactor := getTaskNetFactorVer2(p, taskType)

	p.planCostVer2 = netCostVer2(option, 1, rowSize, netFactor)
	p.planCostInit = true
	return p.planCostVer2, nil
}

func netCostVer2(option *PlanCostOption, rows, rowSize float64, netFactor costVer2Factor) costVer2 {
	return newCostVer2(option, netFactor,
		rows*rowSize*netFactor.Value,
		func() string { return fmt.Sprintf("net(%v*rowsize(%v)*%v)", rows, rowSize, netFactor) })
}



// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = rows * log2(row-size) * scan-factor
// log2(row-size) is from experiments.
func (p *PhysicalTableScan) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
	if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
		return p.planCostVer2, nil
	}

	rows := getCardinality(p, option.CostFlag)
	var rowSize float64
	if p.StoreType == kv.TiKV {
		rowSize = getAvgRowSize(p.stats, p.tblCols) // consider all columns if TiKV
	} else { // TiFlash
		rowSize = getAvgRowSize(p.stats, p.schema.Columns)
	}
	rowSize = math.Max(rowSize, 2.0)
	scanFactor := getTaskScanFactorVer2(p, p.StoreType, taskType)

	p.planCostVer2 = scanCostVer2(option, rows, rowSize, scanFactor)

	// give TiFlash a start-up cost to let the optimizer prefers to use TiKV to process small table scans.
	if p.StoreType == kv.TiFlash {
		p.planCostVer2 = sumCostVer2(p.planCostVer2, scanCostVer2(option, 10000, rowSize, scanFactor))
	}

	p.planCostInit = true
	return p.planCostVer2, nil
}

func scanCostVer2(option *PlanCostOption, rows, rowSize float64, scanFactor costVer2Factor) costVer2 {
	if rowSize < 1 {
		rowSize = 1
	}
	return newCostVer2(option, scanFactor,
		// rows * log(row-size) * scanFactor, log2 from experiments
		rows*math.Log2(rowSize)*scanFactor.Value,
		func() string { return fmt.Sprintf("scan(%v*logrowsize(%v)*%v)", rows, rowSize, scanFactor) })
}

scanFactor的代价默认是40.7,netFactor的代价默认是3.96。结合代码来看,命中索引的代价更优。

join代价计算方式

go 复制代码
// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = build-child-cost + build-filter-cost +
// (probe-cost + probe-filter-cost) / concurrency
// probe-cost = probe-child-cost * build-rows / batchRatio
func (p *PhysicalIndexJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
	return p.getIndexJoinCostVer2(taskType, option, 0)
}

func (p *PhysicalIndexHashJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
	return p.getIndexJoinCostVer2(taskType, option, 1)
}

func (p *PhysicalIndexMergeJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
	return p.getIndexJoinCostVer2(taskType, option, 2)
}

func (p *PhysicalIndexJoin) getIndexJoinCostVer2(taskType property.TaskType, option *PlanCostOption, indexJoinType int) (costVer2, error) {
	if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
		return p.planCostVer2, nil
	}

	build, probe := p.children[1-p.InnerChildIdx], p.children[p.InnerChildIdx]
	buildRows := getCardinality(build, option.CostFlag)
	buildRowSize := getAvgRowSize(build.Stats(), build.Schema().Columns)
	probeRowsOne := getCardinality(probe, option.CostFlag)
	probeRowsTot := probeRowsOne * buildRows
	probeRowSize := getAvgRowSize(probe.Stats(), probe.Schema().Columns)
	buildFilters, probeFilters := p.LeftConditions, p.RightConditions
	probeConcurrency := float64(p.ctx.GetSessionVars().IndexLookupJoinConcurrency())
	cpuFactor := getTaskCPUFactorVer2(p, taskType)
	memFactor := getTaskMemFactorVer2(p, taskType)
	requestFactor := getTaskRequestFactorVer2(p, taskType)

	buildFilterCost := filterCostVer2(option, buildRows, buildFilters, cpuFactor)
	buildChildCost, err := build.getPlanCostVer2(taskType, option)
	if err != nil {
		return zeroCostVer2, err
	}
	buildTaskCost := newCostVer2(option, cpuFactor,
		buildRows*10*cpuFactor.Value,
		func() string { return fmt.Sprintf("cpu(%v*10*%v)", buildRows, cpuFactor) })
	startCost := newCostVer2(option, cpuFactor,
		10*3*cpuFactor.Value,
		func() string { return fmt.Sprintf("cpu(10*3*%v)", cpuFactor) })

	probeFilterCost := filterCostVer2(option, probeRowsTot, probeFilters, cpuFactor)
	probeChildCost, err := probe.getPlanCostVer2(taskType, option)
	if err != nil {
		return zeroCostVer2, err
	}

	var hashTableCost costVer2
	switch indexJoinType {
	case 1: // IndexHashJoin
		hashTableCost = hashBuildCostVer2(option, buildRows, buildRowSize, float64(len(p.RightJoinKeys)), cpuFactor, memFactor)
	case 2: // IndexMergeJoin
		hashTableCost = newZeroCostVer2(traceCost(option))
	default: // IndexJoin
		hashTableCost = hashBuildCostVer2(option, probeRowsTot, probeRowSize, float64(len(p.LeftJoinKeys)), cpuFactor, memFactor)
	}

	// IndexJoin executes a batch of rows at a time, so the actual cost of this part should be
	//  `innerCostPerBatch * numberOfBatches` instead of `innerCostPerRow * numberOfOuterRow`.
	// Use an empirical value batchRatio to handle this now.
	// TODO: remove this empirical value.
	batchRatio := 6.0
	probeCost := divCostVer2(mulCostVer2(probeChildCost, buildRows), batchRatio)

	// Double Read Cost
	doubleReadCost := newZeroCostVer2(traceCost(option))
	if p.ctx.GetSessionVars().IndexJoinDoubleReadPenaltyCostRate > 0 {
		batchSize := float64(p.ctx.GetSessionVars().IndexJoinBatchSize)
		taskPerBatch := 1024.0 // TODO: remove this magic number
		doubleReadTasks := buildRows / batchSize * taskPerBatch
		doubleReadCost = doubleReadCostVer2(option, doubleReadTasks, requestFactor)
		doubleReadCost = mulCostVer2(doubleReadCost, p.ctx.GetSessionVars().IndexJoinDoubleReadPenaltyCostRate)
	}

	p.planCostVer2 = sumCostVer2(startCost, buildChildCost, buildFilterCost, buildTaskCost, divCostVer2(sumCostVer2(doubleReadCost, probeCost, probeFilterCost, hashTableCost), probeConcurrency))
	p.planCostInit = true
	return p.planCostVer2, nil
}

关键在于:

go 复制代码
	switch indexJoinType {
	case 1: // IndexHashJoin
		hashTableCost = hashBuildCostVer2(option, buildRows, buildRowSize, float64(len(p.RightJoinKeys)), cpuFactor, memFactor)
	case 2: // IndexMergeJoin
		hashTableCost = newZeroCostVer2(traceCost(option))
	default: // IndexJoin
		hashTableCost = hashBuildCostVer2(option, probeRowsTot, probeRowSize, float64(len(p.LeftJoinKeys)), cpuFactor, memFactor)
	}

对应方法:

go 复制代码
func hashBuildCostVer2(option *PlanCostOption, buildRows, buildRowSize, nKeys float64, cpuFactor, memFactor costVer2Factor) costVer2 {
	// TODO: 1) consider types of keys, 2) dedicated factor for build-probe hash table
	hashKeyCost := newCostVer2(option, cpuFactor,
		buildRows*nKeys*cpuFactor.Value,
		func() string { return fmt.Sprintf("hashkey(%v*%v*%v)", buildRows, nKeys, cpuFactor) })
	hashMemCost := newCostVer2(option, memFactor,
		buildRows*buildRowSize*memFactor.Value,
		func() string { return fmt.Sprintf("hashmem(%v*%v*%v)", buildRows, buildRowSize, memFactor) })
	hashBuildCost := newCostVer2(option, cpuFactor,
		buildRows*cpuFactor.Value,
		func() string { return fmt.Sprintf("hashbuild(%v*%v)", buildRows, cpuFactor) })
	return sumCostVer2(hashKeyCost, hashMemCost, hashBuildCost)
}

func newZeroCostVer2(trace bool) (ret costVer2) {
	if trace {
		ret.trace = &costTrace{make(map[string]float64), ""}
	}
	return
}

简单的看一下代码,我们可以发现,从大多数的场景来看,按照代价从小到大来排,这几种Join是MergeJoin<HashJoin<IndexJoin。

5.4.2执行计划枚举与择优

总得来说这块代码较为简单,本质就是枚举所有符合条件的物理执行计划,并挑选出代价最小的执行计划,故不再列举代码。有兴趣的同学可以根据以下大纲自行翻阅:

log 复制代码
|--planner/core/find_best_task.go

\-- findBestTask

\-- enumeratePhysicalPlans4Task

\-- compareTaskCost

\-- getTaskPlanCost

|-- planner/core/plan_cost_ver2.go

\-- getPlanCost

6.其他

6.1 参考与引用的文章

6.2 知识补充:code generation && vectorized execution

数据库引擎执行器中非常出名的两种优化方式,code generation和 vectorized execution。

code generation主要是根据上下文来生成一整段优化过的代码,这与那种嵌套大量if...else、for循环、虚方法的代码完全相反,完全面向性能考虑。

vectorized execution基于拉模型。相比于一次拉一个tuple来说,它的批量拉取减少了多次拉取的开销,同时还可以使用到SIMD。基于这种场景,vectorized execution的优化更加适用于列式数据库。

相关推荐
超级小忍1 小时前
如何配置 MySQL 允许远程连接
数据库·mysql·adb
吹牛不交税1 小时前
sqlsugar WhereIF条件的大于等于和等于查出来的坑
数据库·mysql
hshpy2 小时前
setting up Activiti BPMN Workflow Engine with Spring Boot
数据库·spring boot·后端
文牧之3 小时前
Oracle 审计参数:AUDIT_TRAIL 和 AUDIT_SYS_OPERATIONS
运维·数据库·oracle
篱笆院的狗3 小时前
如何使用 Redis 快速实现布隆过滤器?
数据库·redis·缓存
G探险者4 小时前
《深入理解 Nacos 集群与 Raft 协议》系列五:为什么集群未过半,系统就不可用?从 Raft 的投票机制说起
分布式·后端
G探险者4 小时前
《深入理解 Nacos 集群与 Raft 协议》系列一:为什么 Nacos 集群必须过半节点存活?从 Raft 协议说起
分布式·后端
G探险者4 小时前
《深入理解 Nacos 集群与 Raft 协议》系列四:日志复制机制:Raft 如何确保提交可靠且幂等
分布式·后端
G探险者4 小时前
《深入理解 Nacos 集群与 Raft 协议》系列三:日志对比机制:Raft 如何防止数据丢失与错误选主
分布式·后端
G探险者4 小时前
《深入理解 Nacos 集群与 Raft 协议》系列二:Raft 为什么要“选主”?选主的触发条件与机制详解
分布式·后端