理论+应用,带你了解数据库资源池

本文分享自华为云社区《GaussDB(DWS)监控工具指南(三)资源池级监控【绽放吧!GaussDB(DWS)云原生数仓】》,作者:幕后小黑爪。

一、资源池

在数据库最初阶段,是没有资源概念的,给数据库输入SQL语句,数据库输出结果,在简单业务场景下,用户独占数据库是不存在资源争抢问题的。随着数据库业务增长,用户也越来越多,此时不同用户间的SQL会抢占操作系统的资源(CPU、内存、IO、网络等),如果不加限制的话就会影响整个集群的用户,造成集群不可用的情况。为了防止在这种场景发生,需要对用户业务SQL进行区分,对不同的用户需要资源分配和管控。为此,资源池应运而生,资源池作为一种逻辑媒介,连接用户和系统资源,管控每个用户的资源使用,保证集群的可用状态。

二、GaussDB(DWS)中的资源池

当管理员创建用户后,会自动绑定在默认资源池default_pool上,从网页上,可以再创建资源池,然后绑定相应的用户在对应的资源池上。此时用户下发SQL语句执行,下发的语句就会收到资源池配置参数的管控。通过资源池可以划分不同用户的资源使用情况,简化了集群的管理,它可以统一管理所有的系统计算资源。这意味着管理员只需要管理资源池,而不是每个节点上的资源。

GaussDB(DWS)资源池(租户)功能支持通过管控CPU、并发、内存、网络等手段对用户下发的业务语句进行管控,在不同的维度对用户语句进行管控。具体来说支持设置资源池参数的方式控制并发、内存、CPU利用率等能力。当用户的语句是未知新语句时,也可以通过一些异常规则来控制语句情况,比如查杀超过限制的烂SQL等,在此基础上,还支持设置黑名单的方式严格管控用户下发的语句,这些功能后续会进行说明,本文聚焦资源池先不展开。

通过下列语句可查询资源池信息

复制代码
postgres=# select * from pg_resource_pool;

respool_name | mem_percent | cpu_affinity | control_group | active_statements | max_dop | memory_limit | parentid | io_limits | io_priority | nodegroup | is_foreign | short_acc | except_rule | weight

----------------------+-------------+--------------+---------------------+-------------------+---------+--------------+------------+-----------+-------------+------------------+------------+-----------+-------------+--------

default_pool | 0 | -1 | DefaultClass:Medium | -1 | -1 | default | 0 | 0 | None | installation | f | t | None | -1

respool_1 | 0 | -1 | ClassN1:wn1 | 10 | -1 | default | 0 | 0 | None | logical_cluster1 | f | t | None | -1

respool_grp_1 | 20 | -1 | ClassG1 | 10 | -1 | default | 0 | 0 | None | logical_cluster1 | f | t | None | -1

respool_g1_job_1 | 20 | -1 | ClassG1:wg1_1 | 10 | -1 | default | 2147484586 | 0 | None | logical_cluster1 | f | t | None | -1

respool_g1_job_2 | 20 | -1 | ClassG1:wg1_2 | 10 | -1 | default | 2147484586 | 0 | None | logical_cluster1 | f | t | None | -1

respool_0_mempercent | 0 | -1 | DefaultClass:Medium | 10 | -1 | default | 0 | 0 | None | logical_cluster1 | f | t | None | -1

(6 rows)

对于资源池的相关参数,用户可通过管控面进行配置,同时也支持管理员通过语句来修改资源池配置,如下所示,可修改默认资源池的快车道并发限制。其他参数同理,不过该操作有风险,建议用户在GaussDB(DWS)运维人员确认后执行。

复制代码
alter resource pool default_pool with (max_dop=1);

三、资源池监控

GaussDB(DWS)为用户提供了多维度的资源监控视图,可支持从不同维度查询集群状态。

GaussDB(DWS)提供资源池级别的监控能力,监控SQL语句的运行情况,主要包含实时监控和历史监控,跟用户监控类似,资源池监控展示了每个资源池的运行作业数、排队作业数、内存使用、内存使用上限、 CPU使用情况、读写IO情况等,通过下列语句可进行查询获取

复制代码
postgres=# select * from gs_respool_resource_info;

nodegroup | rpname | cgroup | ref_count | fast_run | fast_wait | fast_limit | slow_run | slow_wait | slow_limit | used_cpu | cpu_limit | used_mem | estimate_mem | mem_limit | read_kbytes | write_kbytes | read_counts | write_counts | read_speed | write_speed

-----------+--------------+---------------------+-----------+----------+-----------+------------+----------+-----------+------------+----------+-----------+----------+--------------+-----------+-------------+--------------+-------------+--------------+------------+-------------

lc1 | pool_group | ClassN | 0 | 0 | 0 | -1 | 0 | 0 | 10 | 0 | 312 | 0 | 0 | 116844 | 0 | 0 | 0 | 0 | 0 | 0

lc1 | pool_work | ClassN:wg1 | 0 | 0 | 0 | 10 | 0 | 0 | 10 | 0 | 312 | 0 | 0 | 23364 | 0 | 0 | 0 | 0 | 0 | 0

lc2 | default_pool | DefaultClass:Medium | 0 | 0 | 0 | -1 | 0 | 0 | -1 | 0 | 208 | 0 | 0 | 584220 | 0 | 0 | 0 | 0 | 0 | 0

lc1 | resp_other | DefaultClass:Medium | 0 | 0 | 0 | -1 | 0 | 0 | 100 | 0 | 312 | 0 | 0 | 175260 | 0 | 0 | 0 | 0 | 0 | 0

lc1 | default_pool | DefaultClass:Medium | 0 | 0 | 0 | -1 | 0 | 0 | -1 | 0 | 312 | 0 | 0 | 584220 | 0 | 0 | 0 | 0 | 0 | 0

(5 rows)

其中,nodegroup 代表资源池所属的逻辑集群信息。fast_run、slow_run代表资源池快车道运行数、慢车道运行数,fast_wait、slow_wait为快车道和慢车道的排队作业数。其中,slow_wait也包含了CCN排队的作业。fast_limit和slow_limit代表了快慢车道的并发上限,cpu_limit代表了资源池配置的cpu设置的限额是多少,通过used_cpu可以看到不同资源池的CPU使用率情况。estimate_mem为资源池内用户下发作业的估算内存总和,used_mem为资源池实际使用的内存,mem_limit为资源池设置的可用内存上限。

同样,历史资源监控也提供了历史问题定位的能力,该视图会30s采集一次,timestamp为采集的时刻。

复制代码
postgres=# select * from gs_respool_resource_history;

timestamp | nodegroup | rpname | cgroup | ref_count | fast_run | fast_wait | fast_limit | slow_run | slow_wait | slow_limit | used_cpu | cpu_limit | used_mem | estimate_mem | mem_limit | read_kbytes | write_kbytes | read_counts | write_counts | read_speed | write_speed

-------------------------------+-----------+--------------+---------------------+-----------+----------+-----------+------------+----------+-----------+------------+----------+-----------+----------+--------------+-----------+-------------+--------------+-------------+--------------+------------+-------------

2023-10-20 20:24:14.715107+08 | lc1 | pool_group | ClassN | 0 | 0 | 0 | -1 | 0 | 0 | 10 | 0 | 312 | 0 | 0 | 116844 | 0 | 0 | 0 | 0 | 0 | 0

2023-10-20 20:24:14.715107+08 | lc1 | pool_work | ClassN:wg1 | 0 | 0 | 0 | 10 | 0 | 0 | 10 | 0 | 312 | 0 | 0 | 23364 | 0 | 0 | 0 | 0 | 0 | 0

2023-10-20 20:24:14.715107+08 | lc2 | default_pool | DefaultClass:Medium | 0 | 0 | 0 | -1 | 0 | 0 | -1 | 0 | 208 | 0 | 0 | 584220 | 0 | 0 | 0 | 0 | 0 | 0

2023-10-20 20:24:14.715107+08 | lc1 | resp_other | DefaultClass:Medium | 0 | 0 | 0 | -1 | 0 | 0 | 100 | 0 | 312 | 0 | 0 | 175260 | 0 | 0 | 0 | 0 | 0 | 0

2023-10-20 20:24:14.715107+08 | lc1 | default_pool | DefaultClass:Medium | 0 | 0 | 0 | -1 | 0 | 0 | -1 | 0 | 312 | 0 | 0 | 584220 | 0 | 0 | 0 | 0 | 0 | 0

2023-10-20 20:24:44.791512+08 | lc1 | pool_group | ClassN | 0 | 0 | 0 | -1 | 0 | 0 | 10 | 0 | 312 | 0 | 0 | 116844 | 0 | 0 | 0 | 0 | 0 | 0

2023-10-20 20:24:44.791512+08 | lc1 | pool_work | ClassN:wg1 | 0 | 0 | 0 | 10 | 0 | 0 | 10 | 0 | 312 | 0 | 0 | 23364 | 0 | 0 | 0 | 0 | 0 | 0

2023-10-20 20:24:44.791512+08 | lc2 | default_pool | DefaultClass:Medium | 0 | 0 | 0 | -1 | 0 | 0 | -1 | 0 | 208 | 0 | 0 | 584220 | 0 | 0 | 0 | 0 | 0 | 0

2023-10-20 20:24:44.791512+08 | lc1 | resp_other | DefaultClass:Medium | 0 | 0 | 0 | -1 | 0 | 0 | 100 | 0 | 312 | 0 | 0 | 175260 | 0 | 0 | 0 | 0 | 0 | 0

2023-10-20 20:24:44.791512+08 | lc1 | default_pool | DefaultClass:Medium | 0 | 0 | 0 | -1 | 0 | 0 | -1 | 0 | 312 | 0 | 0 | 584220 | 0 | 0 | 0 | 0 | 0 | 0

...

四、通过资源池监控分析定位问题(待补充)

1、当发现业务反馈语句阻塞严重,不执行作业,可查询实时资源池监控或者历史资源池监控进行分析,看是否作业堆积排队,如果fast_limit上的限制为10,,fast_run字段也为10,然后fast_wait较多,此时可尝试修改资源池参数max_dop,适当调大并发上限。

2、业务反馈跑批业务在正常运行的情况下批量变慢,此时可观察历史资源池监控对比观察,统计劣化前后同一批作业一段时间的内存资源使用情况,也可通过当时资源池作业排队现象来定位问题。

五、更好用的监控视图

为了提升系统可用性,GaussDB(DWS)也提供了更便捷,更易用的视图用以帮用户进行观察系统状态和定位问题。

在内核821版本中,用户可使用gs_query_monitor、gs_user_monitor、gs_respool_monitor视图进行语句级、用户级、资源池的资源监控,这些视图以GaussDB(DWS)监控工具指南系列中所讲的视图为基础,选取常用的定位字段,为现网用户提供更易用的一套实时监控脚本。

具体效果如下:

1. 作业监控

复制代码
postgres=# select * from gs_query_monitor;

usename | nodename | nodegroup | rpname | priority | xact_start | query_start | block_time | duration | query_band | attribute | lane | status | queue | used_mem | estimate_mem | used_cpu | read_speed | write_speed | send_speed | recv_speed | dn_count | stream_count | pid | lw

tid | query_id | unique_sql_id | query

--------------+----------+------------------+--------------+----------+-------------------------------+-------------------------------+------------+----------+------------+-------------+------+---------+-------+----------+--------------+----------+------------+-------------+------------+------------+----------+--------------+-----------------+---

-----+-------------------+---------------+--------------------------------------------------

user_default | cn_5001 | logical_cluster1 | default_pool | Medium | 2023-10-30 16:39:28.754207+08 | 2023-10-30 16:39:28.748855+08 | 59 | 0 | | Complicated | slow | pending | CCN | 0 | 4360 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 139906878865264 | 98

2280 | 72902018968076864 | 2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);

user_default | cn_5001 | logical_cluster1 | default_pool | Medium | 2023-10-30 16:39:28.760305+08 | 2023-10-30 16:39:28.754861+08 | 59 | 0 | | Complicated | slow | pending | CCN | 0 | 4360 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 139906878866632 | 98

2283 | 72902018968076871 | 2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);

user_default | cn_5001 | logical_cluster1 | default_pool | Medium | 2023-10-30 16:39:28.761491+08 | 2023-10-30 16:39:28.756124+08 | 59 | 0 | | Complicated | slow | pending | CCN | 0 | 4360 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 139906878865720 | 98

2281 | 72902018968076872 | 2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);

user_default | cn_5001 | logical_cluster1 | default_pool | Medium | 2023-10-30 16:39:28.768333+08 | 2023-10-30 16:39:28.762653+08 | 59 | 0 | | Complicated | slow | pending | CCN | 0 | 4360 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 139906878867544 | 98

2285 | 72902018968076877 | 2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);

user_default | cn_5001 | logical_cluster1 | default_pool | Medium | 2023-10-30 16:39:28.772288+08 | 2023-10-30 16:39:28.766933+08 | 59 | 0 | | Complicated | slow | pending | CCN | 0 | 4360 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 139906878868912 | 98

2288 | 72902018968076881 | 2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);

user_default | cn_5001 | logical_cluster1 | default_pool | Medium | 2023-10-30 16:39:28.772304+08 | 2023-10-30 16:39:28.766966+08 | 59 | 0 | | Complicated | slow | pending | CCN | 0 | 4360 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 139906878867088 | 98

2284 | 72902018968076882 | 2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);

user_default | cn_5001 | logical_cluster1 | default_pool | Medium | 2023-10-30 16:39:28.777958+08 | 2023-10-30 16:39:28.772572+08 | 59 | 0 | | Complicated | slow | pending | CCN | 0 | 4360 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 139906878868000 | 98

2286 | 72902018968076888 | 2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);

user_default | cn_5001 | logical_cluster1 | default_pool | Medium | 2023-10-30 16:39:28.779373+08 | 2023-10-30 16:39:28.773997+08 | 59 | 0 | | Complicated | slow | pending | CCN | 0 | 4360 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 139906878868456 | 98

2287 | 72902018968076889 | 2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);

user_default | cn_5001 | logical_cluster1 | default_pool | Medium | 2023-10-30 16:39:28.753845+08 | 2023-10-30 16:39:28.748498+08 | 0 | 59 | | Complicated | slow | running | None | 4 | 4360 | .289 | 0 | 0 | 0 | 0 | 0 | 0 | 139906878864808 | 98

2279 | 72902018968076862 | 2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);

user_default | cn_5001 | logical_cluster1 | default_pool | Medium | 2023-10-30 16:39:28.753957+08 | 2023-10-30 16:39:28.748609+08 | 0 | 59 | | Complicated | slow | running | None | 4 | 4360 | .288 | 0 | 0 | 17 | 23 | 0 | 0 | 139906878866176 | 98

2282 | 72902018968076863 | 2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);

(10 rows)

2. 用户监控

复制代码
postgres=# select * from gs_user_monitor;

usename | rpname | nodegroup | session_count | active_count | global_wait | fast_run | fast_wait | slow_run | slow_wait | used_mem | estimate_mem | used_cpu | read_speed | write_speed | send_speed | recv_speed | used_space | space_limit | used_temp_space | temp_space_limit | used_spill_space | spill_space_limit

------------------+---------------+------------------+---------------+--------------+-------------+----------+-----------+----------+-----------+----------+--------------+----------+------------+-------------+------------+------------+------------+-------------+-----------------+------------------+------------------+-------------------

logical_cluster2 | default_pool | logical_cluster2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | -1 | 0 | -1

user_grp_1 | respool_grp_1 | logical_cluster1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | -1 | 0 | -1

logical_cluster1 | default_pool | logical_cluster1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1834424 | -1 | 0 | -1 | 0 | -1

user_normal | respool_1 | logical_cluster1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | -1 | 0 | -1

user_default | default_pool | logical_cluster1 | 10 | 10 | 0 | 0 | 0 | 2 | 8 | 8 | 8720 | .563 | 0 | 15 | 0 | 0 | 640080 | -1 | 0 | -1 | 0 | -1

(5 rows)

3. 资源池监控

复制代码
postgres=# select * from gs_respool_monitor;

rpname | nodegroup | cn_count | short_acc | session_count | active_count | global_wait | fast_run | fast_wait | fast_limit | slow_run | slow_wait | slow_limit | used_mem | estimate_mem | mem_limit | query_mem_limit | used_cpu | cpu_limit | read_speed | write_speed | send_speed | recv_speed

----------------------+------------------+----------+-----------+---------------+--------------+-------------+----------+-----------+------------+----------+-----------+------------+----------+--------------+-----------+-----------------+----------+-----------+------------+-------------+------------+------------

default_pool | logical_cluster2 | 3 | t | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | -1 | 0 bytes | 0 bytes | 11 GB | 4376 MB | 0 | 8 | 0 bytes/s | 0 bytes/s | 0 bytes/s | 0 bytes/s

respool_g1_job_1 | logical_cluster1 | 3 | t | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 10 | 0 bytes | 0 bytes | 437 MB | 175 MB | 0 | 8 | 0 bytes/s | 0 bytes/s | 0 bytes/s | 0 bytes/s

respool_1 | logical_cluster1 | 3 | t | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 10 | 0 bytes | 0 bytes | 11 GB | 4376 MB | 0 | 8 | 0 bytes/s | 0 bytes/s | 0 bytes/s | 0 bytes/s

respool_0_mempercent | logical_cluster1 | 3 | t | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 10 | 0 bytes | 0 bytes | 11 GB | 4376 MB | 0 | 8 | 0 bytes/s | 0 bytes/s | 0 bytes/s | 0 bytes/s

respool_g1_job_2 | logical_cluster1 | 3 | t | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 10 | 0 bytes | 0 bytes | 437 MB | 175 MB | 0 | 8 | 0 bytes/s | 0 bytes/s | 0 bytes/s | 0 bytes/s

default_pool | logical_cluster1 | 3 | t | 10 | 10 | 0 | 0 | 0 | -1 | 2 | 8 | -1 | 8192 KB | 8720 MB | 11 GB | 4376 MB | .577 | 8 | 0 bytes/s | 17 KB/s | 0 bytes/s | 0 bytes/s

(6 rows)

点击关注,第一时间了解华为云新鲜技术~

相关推荐
Gauss松鼠会2 天前
GaussDB的向量化处理技术
运维·数据库·database·gaussdb
Gauss松鼠会2 天前
GaussDB Ustore存储引擎解读
java·前端·数据库·gaussdb
hong1616884 天前
GaussDB Ustore存储引擎解读
java·数据库·gaussdb
工业甲酰苯胺6 天前
深度解读GaussDB逻辑解码技术原理
数据库·oracle·gaussdb
莫凡的博客7 天前
GaussDB和Oracle的语法对比
数据库·oracle·gaussdb
倾歌为君的018 天前
openGauss开源数据库实战十四
数据库·开源·centos·gaussdb
API开发8 天前
测试华为GaussDB(DWS)数仓,并通过APISQL快速将(表、视图、存储过程)发布为API
数据仓库·sql·postgresql·华为云·api·gaussdb·apisql
kiwixing10 天前
gaussdb hccdp实验练习03 GaussDB数据库应用程序开发实验
数据库·gaussdb
华为云开发者联盟11 天前
深度解读GaussDB逻辑解码技术原理
数据库·sql·gaussdb·华为云drs
kiwixing12 天前
gaussdb hccdp实验练习02 GaussDB数据库开发设计
数据库·数据库开发·gaussdb