试用postgresql的pg_duckdb插件

存储库地址：https://github.com/duckdb/pg_duckdb/blob/main/README.md

拉取pgduckdb镜像并运行容器，登录容器

aaa@kylin-pc:~ $sudo docker pull docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1 输入密码 18-v1.1.1: Pulling from pgduckdb/pgduckdb 3ac51941b358: Pull complete 8146abf1dff1: Pull complete 8fad3e9ec6e0: Pull complete b6ba526dc589: Pull complete 8a4a7306158c: Pull complete dbfb9c3db61f: Pull complete 3de8d92fb268: Pull complete 8dec8597fc9a: Pull complete ba2d457458a5: Pull complete 34c6fdfe850f: Pull complete ae8eb2cf7a2d: Pull complete ff8919aaa347: Pull complete f05c91f4b5ea: Pull complete a326b33c168a: Pull complete a05b0326ab0a: Pull complete 435f1ce1604d: Pull complete 6c5039352746: Pull complete ce9e9cc819e9: Download complete Digest: sha256:44c88eb9207971af1d7f9804a37429b1bb36f413cd8d7118c81e1288ddde85d1 Status: Downloaded newer image for docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1 docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1 aaa@kylin-pc:~$ sudo docker run -d -e POSTGRES_PASSWORD=duckdb -v /home/aaa/par:/par --network host --name pgduckdb docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1
522c80d17fa1a863f14b99c6ac0d8c865a511b5498d9dd6613c04de4be4799be
aaa@kylin-pc:~ $sudo docker exec -it pgduckdb bash postgres@kylin-pc:/$ psql
psql (18.1 (Debian 18.1-1.pgdg12+2))
Type "help" for help.

2.测试1：查询postgresql表

sql 复制代码

postgres=# -- This is a standard PostgreSQL table
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    product_name TEXT,
    amount NUMERIC,
    order_date DATE
);

INSERT INTO orders (product_name, amount, order_date)
VALUES ('Laptop', 1200.00, '2024-07-01'),
       ('Keyboard', 75.50, '2024-07-01'),
       ('Mouse', 25.00, '2024-07-02');
CREATE TABLE
INSERT 0 3


postgres=# SET duckdb.force_execution = true;
SELECT
    order_date,
    COUNT(*) AS number_of_orders,
    SUM(amount) AS total_revenue
FROM
    orders
GROUP BY
    order_date
ORDER BY
    order_date;
SET
WARNING:  (PGDuckDB/CreatePlan) Prepared query returned an error: Binder Error: No function matches the given name and argument types 'sum(UnsupportedPostgresType(DuckDB requires the precision of a NUMERIC to be set. You can choose to convert these NUMERICs to a DOUBLE by using 'SET duckdb.convert_unsupported_numeric_to_double = true'))'. You might need to add explicit type casts.
	Candidate functions:
	sum(DECIMAL) -> DECIMAL
	sum(BOOLEAN) -> HUGEINT
	sum(SMALLINT) -> HUGEINT
	sum(INTEGER) -> HUGEINT
	sum(BIGINT) -> HUGEINT
	sum(HUGEINT) -> HUGEINT
	sum(DOUBLE) -> DOUBLE
	sum(BIGNUM) -> BIGNUM


LINE 1: SELECT order_date, count(*) AS number_of_orders, sum(amount) AS total_revenue FROM pgduckdb.public.orders...
                                                         ^
 order_date | number_of_orders | total_revenue 
------------+------------------+---------------
 2024-07-01 |                2 |       1275.50
 2024-07-02 |                1 |         25.00
(2 rows)

警告类型绑定失败，DuckDB需要设置NUMERIC的精度，回退到postgresql引擎执行，但可以用设置参数的方法解决。

sql 复制代码

postgres=# 

postgres=# SET duckdb.convert_unsupported_numeric_to_double = true;
SET
postgres=# explain SELECT
    order_date,
    COUNT(*) AS number_of_orders,
    SUM(amount) AS total_revenue
FROM
    orders
GROUP BY
    order_date
ORDER BY
    order_date;
                         QUERY PLAN                         
------------------------------------------------------------
 Custom Scan (DuckDBScan)  (cost=0.00..0.00 rows=0 width=0)
   DuckDB Execution Plan: 
 
 ┌───────────────────────────┐
 │          ORDER_BY         │
 │    ────────────────────   │
 │   orders.order_date ASC   │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │       HASH_GROUP_BY       │
 │    ────────────────────   │
 │         Groups: #0        │
 │                           │
 │        Aggregates:        │
 │        count_star()       │
 │          sum(#1)          │
 │                           │
 │         ~512 rows         │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         PROJECTION        │
 │    ────────────────────   │
 │         order_date        │
 │           amount          │
 │                           │
 │         ~810 rows         │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │  PGDUCKDB_POSTGRES_SCAN   │
 │    ────────────────────   │
 │       Table: orders       │
 │                           │
 │        Projections:       │
 │         order_date        │
 │           amount          │
 │                           │
 │         ~810 rows         │
 └───────────────────────────┘
 
 
(40 rows)

设置duckdb.convert_unsupported_numeric_to_double 以后，警告消失，执行计划也变成了DuckDB。

pg_duckdb插件这么处理的原因是，两种数据库numeric类型的范围不同。

duckdb文档指出的范围

复制代码

--- https://duckdb.org/docs/current/sql/data_types/numeric#fixed-point-decimals

Fixed-Point Decimals
The data type DECIMAL(WIDTH, SCALE) (also available under the alias NUMERIC(WIDTH, SCALE)) represents an exact fixed-point decimal value. When creating a value of type DECIMAL, the WIDTH and SCALE can be specified to define which size of decimal values can be held in the field. The WIDTH field determines how many digits can be held, and the scale determines the number of digits after the decimal point. For example, the type DECIMAL(3, 2) can fit the value 1.23, but cannot fit the value 12.3 or the value 1.234. The default WIDTH and SCALE is DECIMAL(18, 3), if none are specified.

Addition, subtraction and multiplication of two fixed-point decimals returns another fixed-point decimal with the required WIDTH and SCALE to contain the exact result, or throws an error if the required WIDTH would exceed the maximal supported WIDTH, which is currently 38.

Division of fixed-point decimals does not typically produce numbers with finite decimal expansion. Therefore, DuckDB uses approximate floating-point arithmetic for all divisions that involve fixed-point decimals and accordingly returns floating-point data types.

Internally, decimals are represented as integers depending on their specified WIDTH.

Width	Internal	Size (bytes)
1-4	INT16	2
5-9	INT32	4
10-18	INT64	8
19-38	INT128	16
Performance can be impacted by using too large decimals when not required. In particular, decimal values with a width above 19 are slow, as arithmetic involving the INT128 type is much more expensive than operations involving the INT32 or INT64 types. It is therefore recommended to stick with a WIDTH of 18 or below, unless there is a good reason for why this is insufficient.

postgresql文档指出的范围

复制代码

--- https://www.postgresql.org/docs/current/datatype-numeric.html

Table 8.2. Numeric Types

Name	Storage Size	Description	Range
smallint	2 bytes	small-range integer	-32768 to +32767
integer	4 bytes	typical choice for integer	-2147483648 to +2147483647
bigint	8 bytes	large-range integer	-9223372036854775808 to +9223372036854775807
decimal	variable	user-specified precision, exact	up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
numeric	variable	user-specified precision, exact	up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
...

postgresql的numeric最大能保存131072位，而DuckDB是38位，如果不指定精度，postgresql的numeric类型数据在duckdb就可能溢出。

删除旧表，用带精度的DECIMAL重新建表，把duckdb.convert_unsupported_numeric_to_double设置为false，查询也没有警告了。

sql 复制代码

postgres=# SET duckdb.convert_unsupported_numeric_to_double = false;
SET


postgres=# drop table orders;
DROP TABLE
postgres=# CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    product_name TEXT,
    amount DECIMAL(15,2),
    order_date DATE
);
CREATE TABLE
postgres=# INSERT INTO orders (product_name, amount, order_date)
VALUES ('Laptop', 1200.00, '2024-07-01'),
       ('Keyboard', 75.50, '2024-07-01'),
       ('Mouse', 25.00, '2024-07-02');
INSERT 0 3
postgres=# SELECT
    order_date,
    COUNT(*) AS number_of_orders,
    SUM(amount) AS total_revenue
FROM
    orders
GROUP BY
    order_date
ORDER BY
    order_date;
 order_date | number_of_orders | total_revenue 
------------+------------------+---------------
 2024-07-01 |                2 |       1275.50
 2024-07-02 |                1 |         25.00
(2 rows)

测试2：duckdb独有的函数

sql 复制代码

postgres=# select * from range(1,3);
ERROR:  function range(integer, integer) does not exist
LINE 1: select * from range(1,3);
                      ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
postgres=# \timing on
Timing is on.
postgres=# select sum(i) from generate_series(1,10000000)t(i);
      sum       
----------------
 50000005000000
(1 row)

Time: 33.870 ms

postgres=# explain select sum(i) from generate_series(1,10000000)t(i);
                         QUERY PLAN                         
------------------------------------------------------------
 Custom Scan (DuckDBScan)  (cost=0.00..0.00 rows=0 width=0)
   DuckDB Execution Plan: 
 
 ┌───────────────────────────┐
 │    UNGROUPED_AGGREGATE    │
 │    ────────────────────   │
 │    Aggregates: sum(#0)    │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         PROJECTION        │
 │    ────────────────────   │
 │             i             │
 │                           │
 │      ~10,000,000 rows     │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │      GENERATE_SERIES      │
 │    ────────────────────   │
 │         Function:         │
 │      GENERATE_SERIES      │
 │                           │
 │      ~10,000,000 rows     │
 └───────────────────────────┘
 
 
(25 rows)

DuckDB独有的range函数不能用。

两者都有的generate_series函数可以用，自动采用DuckDB引擎。

sql 复制代码

postgres=# select * from '/par/tpch1/region.csv';
ERROR:  syntax error at or near "'/par/tpch1/region.csv'"
LINE 1: select * from '/par/tpch1/region.csv';
                      ^
postgres=# select * from read_csv('/par/tpch1/region.csv');
 r_regionkey |   r_name    |                                                      r_comment                                                      
-------------+-------------+---------------------------------------------------------------------------------------------------------------------
           0 | AFRICA      | ar packages. regular excuses among the ironic requests cajole fluffily blithely final requests. furiously express p
           1 | AMERICA     | s are. furiously even pinto bea
           2 | ASIA        | c, special dependencies around 
           3 | EUROPE      | e dolphins are furiously about the carefully 
           4 | MIDDLE EAST |  foxes boost furiously along the carefully dogged tithes. slyly regular orbits according to the special epit
(5 rows)

read_csv虽然是duckdb独有的，但已经在文档（https://github.com/duckdb/pg_duckdb/blob/main/docs/functions.md ）中列出，就可以用，但不明确写read_csv函数，只写文件路径名的简写方式不支持。

sql 复制代码

postgres=# set duckdb.force_execution=false;
SET

postgres=# explain select * from read_csv('/par/tpch1/region.csv');
                         QUERY PLAN                         
------------------------------------------------------------
 Custom Scan (DuckDBScan)  (cost=0.00..0.00 rows=0 width=0)
   DuckDB Execution Plan: 
 
 ┌───────────────────────────┐
 │         READ_CSV          │
 │    ────────────────────   │
 │     Function: READ_CSV    │
 │                           │
 │        Projections:       │
 │        r_regionkey        │
 │           r_name          │
 │         r_comment         │
 │                           │
 │          ~27 rows         │
 └───────────────────────────┘
 
 
(17 rows)

postgres=# explain select sum(i) from generate_series(1,10000000)t(i);
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Aggregate  (cost=125000.00..125000.01 rows=1 width=8)
   ->  Function Scan on generate_series t  (cost=0.00..100000.00 rows=10000000 width=4)
 JIT:
   Functions: 5
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)

postgres=# \timing on
Timing is on.
postgres=# select sum(i) from generate_series(1,10000000)t(i);
      sum       
----------------
 50000005000000
(1 row)

Time: 1479.601 ms (00:01.480)

明确设置duckdb.force_execution为false，仅在文档中列出的read_csv走duckdb引擎，generate_series走了postgresql。