试用postgresql的pg_duckdb插件

存储库地址:https://github.com/duckdb/pg_duckdb/blob/main/README.md

  1. 拉取pgduckdb镜像并运行容器,登录容器

    aaa@kylin-pc:~ sudo docker pull docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1 输入密码 18-v1.1.1: Pulling from pgduckdb/pgduckdb 3ac51941b358: Pull complete 8146abf1dff1: Pull complete 8fad3e9ec6e0: Pull complete b6ba526dc589: Pull complete 8a4a7306158c: Pull complete dbfb9c3db61f: Pull complete 3de8d92fb268: Pull complete 8dec8597fc9a: Pull complete ba2d457458a5: Pull complete 34c6fdfe850f: Pull complete ae8eb2cf7a2d: Pull complete ff8919aaa347: Pull complete f05c91f4b5ea: Pull complete a326b33c168a: Pull complete a05b0326ab0a: Pull complete 435f1ce1604d: Pull complete 6c5039352746: Pull complete ce9e9cc819e9: Download complete Digest: sha256:44c88eb9207971af1d7f9804a37429b1bb36f413cd8d7118c81e1288ddde85d1 Status: Downloaded newer image for docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1 docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1 aaa@kylin-pc:~ sudo docker run -d -e POSTGRES_PASSWORD=duckdb -v /home/aaa/par:/par --network host --name pgduckdb docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1
    522c80d17fa1a863f14b99c6ac0d8c865a511b5498d9dd6613c04de4be4799be
    aaa@kylin-pc:~ sudo docker exec -it pgduckdb bash postgres@kylin-pc:/ psql
    psql (18.1 (Debian 18.1-1.pgdg12+2))
    Type "help" for help.

2.测试1:查询postgresql表

sql 复制代码
postgres=# -- This is a standard PostgreSQL table
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    product_name TEXT,
    amount NUMERIC,
    order_date DATE
);

INSERT INTO orders (product_name, amount, order_date)
VALUES ('Laptop', 1200.00, '2024-07-01'),
       ('Keyboard', 75.50, '2024-07-01'),
       ('Mouse', 25.00, '2024-07-02');
CREATE TABLE
INSERT 0 3


postgres=# SET duckdb.force_execution = true;
SELECT
    order_date,
    COUNT(*) AS number_of_orders,
    SUM(amount) AS total_revenue
FROM
    orders
GROUP BY
    order_date
ORDER BY
    order_date;
SET
WARNING:  (PGDuckDB/CreatePlan) Prepared query returned an error: Binder Error: No function matches the given name and argument types 'sum(UnsupportedPostgresType(DuckDB requires the precision of a NUMERIC to be set. You can choose to convert these NUMERICs to a DOUBLE by using 'SET duckdb.convert_unsupported_numeric_to_double = true'))'. You might need to add explicit type casts.
	Candidate functions:
	sum(DECIMAL) -> DECIMAL
	sum(BOOLEAN) -> HUGEINT
	sum(SMALLINT) -> HUGEINT
	sum(INTEGER) -> HUGEINT
	sum(BIGINT) -> HUGEINT
	sum(HUGEINT) -> HUGEINT
	sum(DOUBLE) -> DOUBLE
	sum(BIGNUM) -> BIGNUM


LINE 1: SELECT order_date, count(*) AS number_of_orders, sum(amount) AS total_revenue FROM pgduckdb.public.orders...
                                                         ^
 order_date | number_of_orders | total_revenue 
------------+------------------+---------------
 2024-07-01 |                2 |       1275.50
 2024-07-02 |                1 |         25.00
(2 rows)

警告类型绑定失败,DuckDB需要设置NUMERIC的精度,回退到postgresql引擎执行,但可以用设置参数的方法解决。

sql 复制代码
postgres=# 

postgres=# SET duckdb.convert_unsupported_numeric_to_double = true;
SET
postgres=# explain SELECT
    order_date,
    COUNT(*) AS number_of_orders,
    SUM(amount) AS total_revenue
FROM
    orders
GROUP BY
    order_date
ORDER BY
    order_date;
                         QUERY PLAN                         
------------------------------------------------------------
 Custom Scan (DuckDBScan)  (cost=0.00..0.00 rows=0 width=0)
   DuckDB Execution Plan: 
 
 ┌───────────────────────────┐
 │          ORDER_BY         │
 │    ────────────────────   │
 │   orders.order_date ASC   │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │       HASH_GROUP_BY       │
 │    ────────────────────   │
 │         Groups: #0        │
 │                           │
 │        Aggregates:        │
 │        count_star()       │
 │          sum(#1)          │
 │                           │
 │         ~512 rows         │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         PROJECTION        │
 │    ────────────────────   │
 │         order_date        │
 │           amount          │
 │                           │
 │         ~810 rows         │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │  PGDUCKDB_POSTGRES_SCAN   │
 │    ────────────────────   │
 │       Table: orders       │
 │                           │
 │        Projections:       │
 │         order_date        │
 │           amount          │
 │                           │
 │         ~810 rows         │
 └───────────────────────────┘
 
 
(40 rows)

设置duckdb.convert_unsupported_numeric_to_double 以后,警告消失,执行计划也变成了DuckDB。

pg_duckdb插件这么处理的原因是,两种数据库numeric类型的范围不同。

duckdb文档指出的范围

复制代码
--- https://duckdb.org/docs/current/sql/data_types/numeric#fixed-point-decimals

Fixed-Point Decimals
The data type DECIMAL(WIDTH, SCALE) (also available under the alias NUMERIC(WIDTH, SCALE)) represents an exact fixed-point decimal value. When creating a value of type DECIMAL, the WIDTH and SCALE can be specified to define which size of decimal values can be held in the field. The WIDTH field determines how many digits can be held, and the scale determines the number of digits after the decimal point. For example, the type DECIMAL(3, 2) can fit the value 1.23, but cannot fit the value 12.3 or the value 1.234. The default WIDTH and SCALE is DECIMAL(18, 3), if none are specified.

Addition, subtraction and multiplication of two fixed-point decimals returns another fixed-point decimal with the required WIDTH and SCALE to contain the exact result, or throws an error if the required WIDTH would exceed the maximal supported WIDTH, which is currently 38.

Division of fixed-point decimals does not typically produce numbers with finite decimal expansion. Therefore, DuckDB uses approximate floating-point arithmetic for all divisions that involve fixed-point decimals and accordingly returns floating-point data types.

Internally, decimals are represented as integers depending on their specified WIDTH.

Width	Internal	Size (bytes)
1-4	INT16	2
5-9	INT32	4
10-18	INT64	8
19-38	INT128	16
Performance can be impacted by using too large decimals when not required. In particular, decimal values with a width above 19 are slow, as arithmetic involving the INT128 type is much more expensive than operations involving the INT32 or INT64 types. It is therefore recommended to stick with a WIDTH of 18 or below, unless there is a good reason for why this is insufficient.

postgresql文档指出的范围

复制代码
--- https://www.postgresql.org/docs/current/datatype-numeric.html

Table 8.2. Numeric Types

Name	Storage Size	Description	Range
smallint	2 bytes	small-range integer	-32768 to +32767
integer	4 bytes	typical choice for integer	-2147483648 to +2147483647
bigint	8 bytes	large-range integer	-9223372036854775808 to +9223372036854775807
decimal	variable	user-specified precision, exact	up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
numeric	variable	user-specified precision, exact	up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
...

postgresql的numeric最大能保存131072位,而DuckDB是38位,如果不指定精度,postgresql的numeric类型数据在duckdb就可能溢出。

删除旧表,用带精度的DECIMAL重新建表,把duckdb.convert_unsupported_numeric_to_double设置为false,查询也没有警告了。

sql 复制代码
postgres=# SET duckdb.convert_unsupported_numeric_to_double = false;
SET


postgres=# drop table orders;
DROP TABLE
postgres=# CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    product_name TEXT,
    amount DECIMAL(15,2),
    order_date DATE
);
CREATE TABLE
postgres=# INSERT INTO orders (product_name, amount, order_date)
VALUES ('Laptop', 1200.00, '2024-07-01'),
       ('Keyboard', 75.50, '2024-07-01'),
       ('Mouse', 25.00, '2024-07-02');
INSERT 0 3
postgres=# SELECT
    order_date,
    COUNT(*) AS number_of_orders,
    SUM(amount) AS total_revenue
FROM
    orders
GROUP BY
    order_date
ORDER BY
    order_date;
 order_date | number_of_orders | total_revenue 
------------+------------------+---------------
 2024-07-01 |                2 |       1275.50
 2024-07-02 |                1 |         25.00
(2 rows)
  1. 测试2:duckdb独有的函数
sql 复制代码
postgres=# select * from range(1,3);
ERROR:  function range(integer, integer) does not exist
LINE 1: select * from range(1,3);
                      ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
postgres=# \timing on
Timing is on.
postgres=# select sum(i) from generate_series(1,10000000)t(i);
      sum       
----------------
 50000005000000
(1 row)

Time: 33.870 ms

postgres=# explain select sum(i) from generate_series(1,10000000)t(i);
                         QUERY PLAN                         
------------------------------------------------------------
 Custom Scan (DuckDBScan)  (cost=0.00..0.00 rows=0 width=0)
   DuckDB Execution Plan: 
 
 ┌───────────────────────────┐
 │    UNGROUPED_AGGREGATE    │
 │    ────────────────────   │
 │    Aggregates: sum(#0)    │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         PROJECTION        │
 │    ────────────────────   │
 │             i             │
 │                           │
 │      ~10,000,000 rows     │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │      GENERATE_SERIES      │
 │    ────────────────────   │
 │         Function:         │
 │      GENERATE_SERIES      │
 │                           │
 │      ~10,000,000 rows     │
 └───────────────────────────┘
 
 
(25 rows)

DuckDB独有的range函数不能用。

两者都有的generate_series函数可以用,自动采用DuckDB引擎。

sql 复制代码
postgres=# select * from '/par/tpch1/region.csv';
ERROR:  syntax error at or near "'/par/tpch1/region.csv'"
LINE 1: select * from '/par/tpch1/region.csv';
                      ^
postgres=# select * from read_csv('/par/tpch1/region.csv');
 r_regionkey |   r_name    |                                                      r_comment                                                      
-------------+-------------+---------------------------------------------------------------------------------------------------------------------
           0 | AFRICA      | ar packages. regular excuses among the ironic requests cajole fluffily blithely final requests. furiously express p
           1 | AMERICA     | s are. furiously even pinto bea
           2 | ASIA        | c, special dependencies around 
           3 | EUROPE      | e dolphins are furiously about the carefully 
           4 | MIDDLE EAST |  foxes boost furiously along the carefully dogged tithes. slyly regular orbits according to the special epit
(5 rows)

read_csv虽然是duckdb独有的,但已经在文档(https://github.com/duckdb/pg_duckdb/blob/main/docs/functions.md )中列出,就可以用,但不明确写read_csv函数,只写文件路径名的简写方式不支持。

sql 复制代码
postgres=# set duckdb.force_execution=false;
SET

postgres=# explain select * from read_csv('/par/tpch1/region.csv');
                         QUERY PLAN                         
------------------------------------------------------------
 Custom Scan (DuckDBScan)  (cost=0.00..0.00 rows=0 width=0)
   DuckDB Execution Plan: 
 
 ┌───────────────────────────┐
 │         READ_CSV          │
 │    ────────────────────   │
 │     Function: READ_CSV    │
 │                           │
 │        Projections:       │
 │        r_regionkey        │
 │           r_name          │
 │         r_comment         │
 │                           │
 │          ~27 rows         │
 └───────────────────────────┘
 
 
(17 rows)

postgres=# explain select sum(i) from generate_series(1,10000000)t(i);
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Aggregate  (cost=125000.00..125000.01 rows=1 width=8)
   ->  Function Scan on generate_series t  (cost=0.00..100000.00 rows=10000000 width=4)
 JIT:
   Functions: 5
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)

postgres=# \timing on
Timing is on.
postgres=# select sum(i) from generate_series(1,10000000)t(i);
      sum       
----------------
 50000005000000
(1 row)

Time: 1479.601 ms (00:01.480)

明确设置duckdb.force_execution为false,仅在文档中列出的read_csv走duckdb引擎,generate_series走了postgresql。

相关推荐
Mahir084 小时前
Redis 与 MySQL 数据同步:一致性保证的完整解决方案
数据库·redis·mysql·缓存·面试·数据一致性
2301_769340674 小时前
如何在 Vuetify 中可靠捕获 Chip 关闭事件(包括键盘触发).txt
jvm·数据库·python
AC赳赳老秦4 小时前
供应链专员提效:OpenClaw自动跟踪物流信息、更新库存数据,异常自动提醒
java·大数据·服务器·数据库·人工智能·自动化·openclaw
灵犀学长5 小时前
基于 Spring ThreadPoolTaskScheduler + CronTrigger 实现的动态定时任务调度系统
java·数据库·spring
北秋,5 小时前
PostgreSQL(Postgres)数据库基础用法 + 数字型 + 字符型 完整联合注入实战
数据库·postgresql·开源
m0_596749096 小时前
JavaScript中手动实现一个new操作符的底层逻辑
jvm·数据库·python
多加点辣也没关系6 小时前
Redis 的安装(详细教程)
数据库·redis·缓存
数据库小学妹6 小时前
数据库连接池避坑指南:告别“连接超时”与“资源耗尽”,让系统跑得更快!
数据库·redis·sql·mysql·缓存·dba
dishugj6 小时前
HANA 数据库备份与恢复
数据库·oracle
前进的李工7 小时前
EXPLAIN输出格式全解析:JSON、TREE与可视化
开发语言·数据库·mysql·性能优化·explain