最近遇到一个Oracle的SQL语句引发的线上故障,数据库服务被重启。主机288核,1.5T也没抗住。
select count(*) from dev_db.t1 t1,dev_db.t2 t2 where t1.object_name = t2.object_name or t1.DATA_OBJECT_ID = t2.DATA_OBJECT_ID and t1.OBJECT_ID = 123 and rownum = 1;
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 77 | 31M (1)| 00:20:50 |
| 1 | SORT AGGREGATE | | 1 | 77 | | |
| 2 | COUNT | | | | | |
|* 3 | FILTER | | | | | |
| 4 | MERGE JOIN CARTESIAN| | 6131M| 439G| 31M (1)| 00:20:50 |
| 5 | TABLE ACCESS FULL | T1 | 78306 | 3135K| 411 (1)| 00:00:01 |
| 6 | BUFFER SORT | | 78307 | 2752K| 31M (1)| 00:20:50 |
| 7 | TABLE ACCESS FULL | T2 | 78307 | 2752K| 409 (1)| 00:00:01 |
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("T1"."OBJECT_NAME"="T2"."OBJECT_NAME" OR
"T1"."DATA_OBJECT_ID"="T2"."DATA_OBJECT_ID" AND "T2"."DATA_OBJECT_ID"
IS NOT NULL AND "T1"."DATA_OBJECT_ID" IS NOT NULL AND
"T1"."OBJECT_ID"=123 AND ROWNUM=1)
在执行计划中看到步骤4产生了笛卡尔积连接,成本相当的高,这两个表仅有7万行数据,执行时间超过了5分钟。
数据库中关于NOT、AND、OR的逻辑运算优先级就是按这个顺序进行的,判断是开发人员失误造成的。下面是经过优化后的执行计划。
select count(*) from dev_db.t1 t1,dev_db.t2 t2 where (t1.object_name = t2.object_name or t1.DATA_OBJECT_ID = t2.DATA_OBJECT_ID) and t1.OBJECT_ID = 123 and rownum = 1;
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 77 | 822 (1)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 77 | | |
|* 2 | COUNT STOPKEY | | | | | |
| 3 | NESTED LOOPS | | 1 | 77 | 822 (1)| 00:00:01 |
|* 4 | TABLE ACCESS FULL| T1 | 1 | 41 | 411 (1)| 00:00:01 |
|* 5 | TABLE ACCESS FULL| T2 | 1 | 36 | 411 (1)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(ROWNUM=1)
4 - filter("T1"."OBJECT_ID"=123)
5 - filter("T1"."OBJECT_NAME"="T2"."OBJECT_NAME" OR
"T2"."DATA_OBJECT_ID" IS NOT NULL AND
"T1"."DATA_OBJECT_ID"="T2"."DATA_OBJECT_ID" AND "T1"."DATA_OBJECT_ID"
IS NOT NULL)
在这个执行计划中,t1表作为驱动表,过滤条件是object_id=123,线上这个条件是主键,因为懒没有创建这个主键,所以是全表扫描。
t2表作为被驱动表只匹配到了一条数据。因为STOPKEY的优化,在取到一条数据后SQL停止,无需扫描全部数据,因此执行时间非常短。
总结
本次问题原因是因为一个逻辑运算的优先级判断错误,以至于数据库资源被使用完导致数据库节点被重启。
对于自己拿不准的SQL语句,查阅官方文档是最权威的参考依据。
其次,在测试环境就算只有少量数据也可以看到执行计划是否合理。
最后,上线时SQL语句经过把关,可以最大限度减少SQL语句引发的线上故障。