目录
[一、Doris 核心概念](#一、Doris 核心概念)
[1.1 架构组成](#1.1 架构组成)
[1.2 数据模型](#1.2 数据模型)
[二、Doris 部署方式](#二、Doris 部署方式)
[2.1 单机部署(测试环境)](#2.1 单机部署(测试环境))
[2.2 集群部署(生产环境)](#2.2 集群部署(生产环境))
[3.1 数据库与表管理](#3.1 数据库与表管理)
[3.2 数据导入方式](#3.2 数据导入方式)
[3.2.1 批量导入](#3.2.1 批量导入)
[3.2.2 实时导入](#3.2.2 实时导入)
[3.3 数据查询示例](#3.3 数据查询示例)
[4.1 分区分桶策略](#4.1 分区分桶策略)
[4.2 索引优化](#4.2 索引优化)
[4.3 查询优化技巧](#4.3 查询优化技巧)
Apache Doris 是一款高性能、实时的分析型数据库,广泛应用于大数据分析、实时报表等场景。本文将全面介绍 Doris 的核心概念、部署方式、数据操作及优化技巧。
下面附上官网地址:
Doris官网https://doris.apache.org/zh-CN/docs/dev/gettingStarted/what-is-apache-doris
一、Doris 核心概念
1.1 架构组成
- FE (Frontend):负责元数据管理、客户端连接和查询计划生成
- BE (Backend):负责数据存储和查询执行
- Broker:用于访问外部存储系统(如HDFS/S3)
1.2 数据模型
- 明细模型(Duplicate Key):适合原始数据存储
- 聚合模型(Aggregate Key):预聚合提高查询性能
- 主键模型(Unique Key):支持实时更新
- 更新模型(Merge-on-Write):2.0版本新增,更高性能更新
二、Doris 部署方式
2.1 单机部署(测试环境)
bash
# 下载解压
wget https://apache-doris-releases.oss-accelerate.aliyuncs.com/doris-1.2.4-bin.tar.gz
tar -zxvf doris-1.2.4-bin.tar.gz
# 启动FE
cd fe/bin/
./start_fe.sh --daemon
# 启动BE
cd be/bin/
./start_be.sh --daemon
2.2 集群部署(生产环境)
sql
-- 在FE节点添加BE节点
ALTER SYSTEM ADD BACKEND "be1:9050";
ALTER SYSTEM ADD BACKEND "be2:9050";
ALTER SYSTEM ADD BACKEND "be3:9050";
-- 查看节点状态
SHOW PROC '/backends';
三、数据操作指南
3.1 数据库与表管理
sql
-- 创建数据库
CREATE DATABASE demo_db;
-- 创建明细表
CREATE TABLE demo_db.user_behavior (
user_id LARGEINT NOT NULL,
item_id LARGEINT NOT NULL,
behavior_type VARCHAR(20),
ts DATETIME NOT NULL
)
DUPLICATE KEY(user_id, item_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10
PROPERTIES (
"replication_num" = "3",
"storage_medium" = "SSD"
);
-- 创建聚合表
CREATE TABLE demo_db.sales_agg (
dt DATE NOT NULL,
product_id LARGEINT NOT NULL,
user_region VARCHAR(50),
SUM(sales_amount) BIGINT SUM,
COUNT(sales_count) BIGINT COUNT
)
AGGREGATE KEY(dt, product_id, user_region)
DISTRIBUTED BY HASH(product_id) BUCKETS 10;
3.2 数据导入方式
3.2.1 批量导入
sql
-- 本地文件导入
LOAD LABEL demo_db.label_20231101
(DATA INFILE("hdfs://path/to/file.parquet")
INTO TABLE user_behavior
FORMAT AS "parquet")
WITH BROKER "hdfs_broker";
-- Stream Load(HTTP API)
curl --location-trusted -u user:passwd \
-H "column_separator:," \
-T data.csv \
http://fe_host:8030/api/demo_db/user_behavior/_stream_load
3.2.2 实时导入
sql
-- Kafka实时接入
CREATE ROUTINE LOAD demo_db.kafka_load ON user_behavior
COLUMNS(user_id, item_id, behavior_type, ts)
PROPERTIES (
"desired_concurrent_number" = "3",
"max_batch_interval" = "20",
"max_batch_rows" = "300000"
)
FROM KAFKA (
"kafka_broker_list" = "broker1:9092,broker2:9092",
"kafka_topic" = "user_events",
"property.group.id" = "doris_consumer"
);
3.3 数据查询示例
sql
-- 基础查询
SELECT
user_region,
SUM(sales_amount) AS total_sales
FROM sales_agg
WHERE dt BETWEEN '2023-10-01' AND '2023-10-31'
GROUP BY user_region
ORDER BY total_sales DESC
LIMIT 10;
-- 窗口函数
SELECT
user_id,
ts,
behavior_type,
COUNT(*) OVER (PARTITION BY user_id ORDER BY ts RANGE INTERVAL 1 HOUR PRECEDING) AS hourly_actions
FROM user_behavior;
-- 物化视图加速查询
CREATE MATERIALIZED VIEW mv_user_behavior_hourly
REFRESH EVERY INTERVAL 1 HOUR
AS
SELECT
user_id,
DATE_TRUNC('HOUR', ts) AS hour,
COUNT(*) AS action_count,
SUM(CASE WHEN behavior_type = 'buy' THEN 1 ELSE 0 END) AS buy_count
FROM user_behavior
GROUP BY user_id, DATE_TRUNC('HOUR', ts);
四、性能优化实践
4.1 分区分桶策略
sql
-- 按天分区+哈希分桶
CREATE TABLE time_series_data (
ts DATETIME NOT NULL,
device_id LARGEINT NOT NULL,
metric_value DOUBLE
)
ENGINE=OLAP
DUPLICATE KEY(ts, device_id)
PARTITION BY RANGE(ts) (
PARTITION p202301 VALUES LESS THAN ('2023-02-01'),
PARTITION p202302 VALUES LESS THAN ('2023-03-01'),
PARTITION p202303 VALUES LESS THAN ('2023-04-01')
)
DISTRIBUTED BY HASH(device_id) BUCKETS 32
PROPERTIES (
"replication_num" = "3",
"storage_medium" = "SSD",
"storage_cooldown_time" = "7 days"
);
4.2 索引优化
sql
-- 添加倒排索引
ALTER TABLE user_behavior
ADD INDEX idx_behavior_type (behavior_type) USING INVERTED;
-- 添加Bloom Filter索引
ALTER TABLE sales_agg
ADD INDEX bf_product_id (product_id) USING BLOOM_FILTER;
4.3 查询优化技巧
sql
-- 使用分区裁剪
SELECT * FROM time_series_data
WHERE ts BETWEEN '2023-03-15' AND '2023-03-20';
-- 使用Bucket裁剪
SELECT * FROM user_behavior
WHERE user_id = 10086;
-- 使用Colocate Group
CREATE TABLE colocate_table (
user_id BIGINT,
item_id BIGINT
)
DISTRIBUTED BY HASH(user_id) BUCKETS 10
PROPERTIES (
"colocate_with" = "user_group"
);
本文部分技术描述基于Apache Doris官方文档[1]及社区公认技术实践,相关SQL语法示例参考自开源项目文档。
1\] 官方文档链接:[https://doris.apache.org/docs/](https://doris.apache.org/docs/ "https://doris.apache.org/docs/ ")