数据仓库笔记 第四篇:Star Schema 层(维度建模)

数据仓库笔记 第四篇:Star Schema 层(维度建模)


什么是 摘要

Star Schema(星型模型)是 Ralph Kimball 提出的维度建模方法,是数据仓库面向分析的核心层

此笔记使用的数据库为SQLServer,相应的示例脚本都围绕于此,其它数据库的相应实现会略有不同。

Star Schema的结构像一颗星星:

复制代码
                    ┌─────────────┐
                    │  事实表     │  ← 中央(星星中心)
                    │ FACT_XXX   │    存储业务度量(销售额、数量等)
                    │            │    多维外键指向维度表
                    └──────┬──────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
         ↓                 ↓                 ↓
    ┌─────────┐      ┌─────────┐      ┌─────────┐
    │ 时间维度│      │ 客户维度│      │ 商品维度│  ← 维度表(星星角)
    │DIM_DATE │      │DIM_CUST │      │DIM_PROD │    描述"谁/什么/何时/何地"
    └─────────┘      └─────────┘      └─────────┘

如果你看Power BI教程,很多都是在演示从这一层取数据,也就是各种实施表和维度表。我个人不是非常建议从这一层来取数,而是从下一层取数,具体的细节可以参考下一篇关于数据集市层的描述。


核心概念

事实表(Fact Table)

存储业务的可度量数值(Measures),是数据分析的核心。

复制代码
特征:
  ✓ 以数值型度量为主(销售额、订单数量、利润率)
  ✓ 包含多个外键,指向维度表
  ✓ 通常数据量巨大(百万~亿级)
  ✓ 支持聚合运算(SUM、COUNT、AVG 等)

常见事实表类型:

类型 说明 示例
事务事实表 每笔业务事件一行 销售订单、退货记录
周期快照表 按固定周期快照 每日库存余额、月末账户余额
累计快照表 记录流程全过程 订单生命周期(下单→支付→发货→签收)

维度表(Dimension Table)

提供分析业务的角度(谁、什么、何时、何地、为什么)。

复制代码
特征:
  ✓ 以文本型描述为主
  ✓ 行数相对较少(通常几百~几万)
  ✓ 包含丰富的描述性属性
  ✓ 被多个事实表共享

SQL 实战:创建 Star Schema 层

日期维度表

sql 复制代码
-- ============================================================
-- 在 star_db 中创建维度表
-- ============================================================

USE star_db;
GO

-- 日期维度表
IF OBJECT_ID('dbo.dim_date', 'U') IS NOT NULL DROP TABLE dbo.dim_date;
GO

CREATE TABLE dbo.dim_date (
    date_key         INT           NOT NULL PRIMARY KEY,  -- 格式:YYYYMMDD
    date_value       DATE          NOT NULL UNIQUE,
    year             SMALLINT      NOT NULL,
    quarter          SMALLINT      NOT NULL,              -- 1-4
    quarter_name     VARCHAR(10)   NOT NULL,              -- Q1/Q2/Q3/Q4
    month            SMALLINT      NOT NULL,              -- 1-12
    month_name       NVARCHAR(20)  NOT NULL,
    month_name_short VARCHAR(10)   NOT NULL,
    week_of_year     SMALLINT      NOT NULL,
    day_of_week      SMALLINT      NOT NULL,              -- 1-7
    day_name         NVARCHAR(10)  NOT NULL,
    day_of_month     SMALLINT      NOT NULL,              -- 1-31
    day_of_year      SMALLINT      NOT NULL,              -- 1-366
    is_weekend       BIT           NOT NULL,
    is_holiday       BIT           DEFAULT 0,
    fiscal_year      SMALLINT      NULL,
    fiscal_quarter   SMALLINT      NULL
);
GO

CREATE NONCLUSTERED INDEX idx_dim_date_year ON dbo.dim_date(year);
CREATE NONCLUSTERED INDEX idx_dim_date_month ON dbo.dim_date(year, month);
GO

填充日期维度数据(2023-01-01 ~ 2025-12-31):

sql 复制代码
-- ============================================================
-- 填充日期维度(使用存储过程循环)
-- ============================================================

USE star_db;
GO

IF OBJECT_ID('dbo.sp_populate_dim_date', 'P') IS NOT NULL
    DROP PROCEDURE dbo.sp_populate_dim_date;
GO

CREATE PROCEDURE dbo.sp_populate_dim_date
    @start_date DATE = '2023-01-01',
    @end_date DATE = '2025-12-31'
AS
BEGIN
    SET NOCOUNT ON;
    
    DECLARE @cur_date DATE = @start_date;
    DECLARE @y INT, @m INT, @q INT;
    
    WHILE @cur_date <= @end_date
    BEGIN
        SET @y = YEAR(@cur_date);
        SET @m = MONTH(@cur_date);
        SET @q = CAST(CEILING(CAST(@m AS FLOAT) / 3) AS SMALLINT);
        
        INSERT INTO dbo.dim_date (
            date_key, date_value, year, quarter, quarter_name,
            month, month_name, month_name_short,
            week_of_year, day_of_week, day_name,
            day_of_month, day_of_year, is_weekend,
            fiscal_year, fiscal_quarter
        )
        VALUES (
            @y * 10000 + @m * 100 + DAY(@cur_date),   -- date_key: YYYYMMDD
            @cur_date,
            @y,
            @q,
            'Q' + CAST(@q AS VARCHAR),
            @m,
            DATENAME(month, @cur_date),                -- January, February...
            LEFT(DATENAME(month, @cur_date), 3),       -- Jan, Feb...
            DATEPART(week, @cur_date),
            DATEPART(weekday, @cur_date),              -- 1=Sunday, 7=Saturday (默认)
            DATENAME(weekday, @cur_date),              -- Monday, Tuesday...
            DAY(@cur_date),
            DATEPART(dayofyear, @cur_date),
            CASE WHEN DATEPART(weekday, @cur_date) IN (1, 7) THEN 1 ELSE 0 END,
            @y,                                       -- 财年默认同自然年
            @q
        );
        
        SET @cur_date = DATEADD(day, 1, @cur_date);
    END;
    
    PRINT N'日期维度填充完成: ' + CAST(@start_date AS VARCHAR) + N' ~ ' + CAST(@end_date AS VARCHAR);
END;
GO

-- 执行填充
EXEC dbo.sp_populate_dim_date;
GO

-- 验证
SELECT COUNT(*) AS total_dates,
       MIN(date_value) AS min_date,
       MAX(date_value) AS max_date
FROM dbo.dim_date;
GO

客户维度表(SCD Type 2)

sql 复制代码
USE star_db;
GO

-- 客户维度表
IF OBJECT_ID('dbo.dim_customer', 'U') IS NOT NULL DROP TABLE dbo.dim_customer;
GO

CREATE TABLE dbo.dim_customer (
    customer_key     BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_id      VARCHAR(50)     NOT NULL,
    -- SCD 字段
    load_date        DATETIME        NOT NULL,
    end_date         DATETIME        NULL,               -- NULL = 当前有效
    is_current       BIT             DEFAULT 1,
    -- 业务属性
    customer_name    NVARCHAR(100)   NOT NULL,
    customer_type    VARCHAR(20)     NULL,                -- individual/enterprise
    email            VARCHAR(100)    NULL,
    phone            VARCHAR(20)     NULL,
    address          NVARCHAR(200)   NULL,
    city             NVARCHAR(50)    NULL,
    region           NVARCHAR(50)    NULL,
    country          NVARCHAR(50)    DEFAULT N'中国',
    register_date    DATE            NULL,
    is_active        BIT             NULL
);
GO

CREATE NONCLUSTERED INDEX idx_dim_customer_id ON dbo.dim_customer(customer_id);
CREATE NONCLUSTERED INDEX idx_dim_customer_curr ON dbo.dim_customer(customer_id)
    WHERE is_current = 1;
GO

商品维度表(SCD Type 2)

sql 复制代码
USE star_db;
GO

-- 商品维度表
IF OBJECT_ID('dbo.dim_product', 'U') IS NOT NULL DROP TABLE dbo.dim_product;
GO

CREATE TABLE dbo.dim_product (
    product_key      BIGINT IDENTITY(1,1) PRIMARY KEY,
    product_id       VARCHAR(50)     NOT NULL,
    load_date        DATETIME        NOT NULL,
    end_date         DATETIME        NULL,
    is_current       BIT             DEFAULT 1,
    product_name     NVARCHAR(200)   NOT NULL,
    category         NVARCHAR(50)    NOT NULL,
    sub_category     NVARCHAR(50)    NULL,
    brand            NVARCHAR(50)    NULL,
    supplier_id      VARCHAR(50)     NULL,
    unit_cost        DECIMAL(10,2)   NULL,
    unit_price       DECIMAL(10,2)   NULL,
    is_active        BIT             NULL
);
GO

CREATE NONCLUSTERED INDEX idx_dim_product_id ON dbo.dim_product(product_id);
CREATE NONCLUSTERED INDEX idx_dim_product_curr ON dbo.dim_product(product_id)
    WHERE is_current = 1;
CREATE NONCLUSTERED INDEX idx_dim_product_cat ON dbo.dim_product(category)
    WHERE is_current = 1;
GO

订单状态维度表(退化维度)

sql 复制代码
USE star_db;
GO

-- 订单状态维度(小型退化维度)
IF OBJECT_ID('dbo.dim_order_status', 'U') IS NOT NULL DROP TABLE dbo.dim_order_status;
GO

CREATE TABLE dbo.dim_order_status (
    status_key       INT             NOT NULL PRIMARY KEY,
    status_code      VARCHAR(20)     NOT NULL UNIQUE,
    status_name      NVARCHAR(50)    NOT NULL,
    is_active_status BIT             NOT NULL  -- 是否为活跃状态
);
GO

INSERT INTO dbo.dim_order_status (status_key, status_code, status_name, is_active_status) VALUES
(1, 'pending',   N'待处理',   1),
(2, 'confirmed', N'已确认',   1),
(3, 'shipped',   N'已发货',   1),
(4, 'cancelled', N'已取消',   0),
(5, 'returned',  N'已退货',   0);
GO

销售事实表

sql 复制代码
USE star_db;
GO

-- 销售事实表(事务事实表)
IF OBJECT_ID('dbo.fact_sales', 'U') IS NOT NULL DROP TABLE dbo.fact_sales;
GO

CREATE TABLE dbo.fact_sales (
    fact_sales_id    BIGINT IDENTITY(1,1) PRIMARY KEY,
    -- 代理键(引用维度表)
    date_key         INT             NOT NULL,
    customer_key     BIGINT          NULL,
    product_key      BIGINT          NULL,
    status_key       INT             NULL,
    -- 退化维度
    order_id         VARCHAR(50)     NOT NULL,
    -- 度量值
    quantity         INT             NOT NULL,
    unit_price       DECIMAL(10,2)   NOT NULL,
    total_amount     DECIMAL(14,2)   NOT NULL,
    discount_amount  DECIMAL(10,2)   DEFAULT 0,
    -- 成本
    unit_cost        DECIMAL(10,2)   NULL,
    total_cost       DECIMAL(14,2)   NULL,
    -- 衍生指标
    gross_profit     DECIMAL(14,2)   NULL,    -- total_amount - total_cost
    profit_margin    DECIMAL(5,2)    NULL,    -- (gross_profit / total_amount) * 100
    -- 元数据
    etl_batch_id     VARCHAR(50)     NULL,
    load_time        DATETIME        DEFAULT GETDATE()
);
GO

CREATE NONCLUSTERED INDEX idx_fact_sales_date ON dbo.fact_sales(date_key);
CREATE NONCLUSTERED INDEX idx_fact_sales_customer ON dbo.fact_sales(customer_key)
    WHERE customer_key IS NOT NULL;
CREATE NONCLUSTERED INDEX idx_fact_sales_product ON dbo.fact_sales(product_key)
    WHERE product_key IS NOT NULL;
CREATE NONCLUSTERED INDEX idx_fact_sales_order ON dbo.fact_sales(order_id);
GO

Star Schema ETL 流程

客户维度加载(SCD Type 2)

sql 复制代码
USE star_db;
GO

IF OBJECT_ID('dbo.sp_load_dim_customer', 'P') IS NOT NULL
    DROP PROCEDURE dbo.sp_load_dim_customer;
GO

CREATE PROCEDURE dbo.sp_load_dim_customer
    @batch_id VARCHAR(50)
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @start_time DATETIME = GETDATE();
    DECLARE @rows_inserted BIGINT = 0;
    DECLARE @rows_updated BIGINT = 0;
    DECLARE @error_msg NVARCHAR(MAX);
    
    BEGIN TRY
        INSERT INTO etl_db.dbo.etl_log (
            batch_id, layer_name, db_name, table_name, start_time, status
        ) VALUES (
            @batch_id, 'star', 'star_db', 'dim_customer', @start_time, 'RUNNING'
        );
        
        -- Step 1: 关闭已变化记录的旧版本
        -- 从 PSA 取最新客户数据,与当前维度比较
        UPDATE d
        SET d.end_date = GETDATE(), d.is_current = 0
        FROM dbo.dim_customer d
        INNER JOIN (
            SELECT customer_id, customer_name, city, region, address,
                customer_type, is_active, email, phone,
                ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY psa_load_time DESC) AS rn
            FROM psa_db.dbo.customers
        ) p ON d.customer_id = p.customer_id
        WHERE d.is_current = 1
          AND d.end_date IS NULL
          AND p.rn = 1
          AND (
              d.customer_name <> p.customer_name
              OR ISNULL(d.city, '') <> ISNULL(p.city, '')
              OR ISNULL(d.region, '') <> ISNULL(p.region, '')
              OR ISNULL(d.address, '') <> ISNULL(p.address, '')
          );
        
        SET @rows_updated = @@ROWCOUNT;
        
        -- Step 2: 插入新版本或全新记录
        INSERT INTO dbo.dim_customer (
            customer_id, load_date, end_date, is_current,
            customer_name, customer_type, email, phone,
            address, city, region, country,
            register_date, is_active
        )
        SELECT 
            p.customer_id, p.psa_load_time, NULL, 1,
            p.customer_name, p.customer_type, p.email, p.phone,
            p.address, p.city, p.region, N'中国',
            p.register_date, p.is_active
        FROM (
            SELECT customer_id, customer_name, customer_type, email, phone,
                address, city, region, register_date, is_active, psa_load_time,
                ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY psa_load_time DESC) AS rn
            FROM psa_db.dbo.customers
        ) p
        WHERE p.rn = 1
          AND NOT EXISTS (
              SELECT 1 FROM dbo.dim_customer d
              WHERE d.customer_id = p.customer_id
                AND d.is_current = 1
                AND d.customer_name = p.customer_name
                AND ISNULL(d.city, '') = ISNULL(p.city, '')
                AND ISNULL(d.region, '') = ISNULL(p.region, '')
                AND ISNULL(d.address, '') = ISNULL(p.address, '')
          );
        
        SET @rows_inserted = @@ROWCOUNT;
        
        UPDATE etl_db.dbo.etl_log
        SET end_time = GETDATE(), rows_inserted = @rows_inserted,
            rows_updated = @rows_updated, status = 'SUCCESS'
        WHERE batch_id = @batch_id AND db_name = 'star_db' AND table_name = 'dim_customer';
        
        PRINT N'dim_customer: 新增 ' + CAST(@rows_inserted AS VARCHAR) 
            + N' 条, 关闭旧版本 ' + CAST(@rows_updated AS VARCHAR) + N' 条';
    END TRY
    BEGIN CATCH
        SET @error_msg = ERROR_MESSAGE();
        UPDATE etl_db.dbo.etl_log
        SET end_time = GETDATE(), status = 'FAILED', error_message = @error_msg
        WHERE batch_id = @batch_id AND db_name = 'star_db' AND table_name = 'dim_customer';
        THROW;
    END CATCH;
END;
GO

商品维度加载(SCD Type 2)

sql 复制代码
USE star_db;
GO

IF OBJECT_ID('dbo.sp_load_dim_product', 'P') IS NOT NULL
    DROP PROCEDURE dbo.sp_load_dim_product;
GO

CREATE PROCEDURE dbo.sp_load_dim_product
    @batch_id VARCHAR(50)
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @start_time DATETIME = GETDATE();
    DECLARE @rows_inserted BIGINT = 0;
    DECLARE @rows_updated BIGINT = 0;
    DECLARE @error_msg NVARCHAR(MAX);
    
    BEGIN TRY
        INSERT INTO etl_db.dbo.etl_log (
            batch_id, layer_name, db_name, table_name, start_time, status
        ) VALUES (
            @batch_id, 'star', 'star_db', 'dim_product', @start_time, 'RUNNING'
        );
        
        -- Step 1: 关闭旧版本
        UPDATE d
        SET d.end_date = GETDATE(), d.is_current = 0
        FROM dbo.dim_product d
        INNER JOIN (
            SELECT product_id, product_name, category, sub_category, brand,
                unit_cost, unit_price, supplier_id, is_active,
                ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY psa_load_time DESC) AS rn
            FROM psa_db.dbo.products
        ) p ON d.product_id = p.product_id
        WHERE d.is_current = 1
          AND d.end_date IS NULL
          AND p.rn = 1
          AND (
              d.product_name <> p.product_name
              OR ISNULL(d.unit_price, 0) <> ISNULL(p.unit_price, 0)
              OR ISNULL(d.unit_cost, 0) <> ISNULL(p.unit_cost, 0)
          );
        
        SET @rows_updated = @@ROWCOUNT;
        
        -- Step 2: 插入新版本
        INSERT INTO dbo.dim_product (
            product_id, load_date, end_date, is_current,
            product_name, category, sub_category, brand,
            supplier_id, unit_cost, unit_price, is_active
        )
        SELECT 
            p.product_id, p.psa_load_time, NULL, 1,
            p.product_name, p.category, p.sub_category, p.brand,
            p.supplier_id, p.unit_cost, p.unit_price, p.is_active
        FROM (
            SELECT product_id, product_name, category, sub_category, brand,
                unit_cost, unit_price, supplier_id, is_active, psa_load_time,
                ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY psa_load_time DESC) AS rn
            FROM psa_db.dbo.products
        ) p
        WHERE p.rn = 1
          AND NOT EXISTS (
              SELECT 1 FROM dbo.dim_product d
              WHERE d.product_id = p.product_id
                AND d.is_current = 1
                AND d.product_name = p.product_name
                AND ISNULL(d.unit_price, 0) = ISNULL(p.unit_price, 0)
                AND ISNULL(d.unit_cost, 0) = ISNULL(p.unit_cost, 0)
          );
        
        SET @rows_inserted = @@ROWCOUNT;
        
        UPDATE etl_db.dbo.etl_log
        SET end_time = GETDATE(), rows_inserted = @rows_inserted,
            rows_updated = @rows_updated, status = 'SUCCESS'
        WHERE batch_id = @batch_id AND db_name = 'star_db' AND table_name = 'dim_product';
        
        PRINT N'dim_product: 新增 ' + CAST(@rows_inserted AS VARCHAR) 
            + N' 条, 关闭旧版本 ' + CAST(@rows_updated AS VARCHAR) + N' 条';
    END TRY
    BEGIN CATCH
        SET @error_msg = ERROR_MESSAGE();
        UPDATE etl_db.dbo.etl_log
        SET end_time = GETDATE(), status = 'FAILED', error_message = @error_msg
        WHERE batch_id = @batch_id AND db_name = 'star_db' AND table_name = 'dim_product';
        THROW;
    END CATCH;
END;
GO

销售事实表加载

sql 复制代码
USE star_db;
GO

IF OBJECT_ID('dbo.sp_load_fact_sales', 'P') IS NOT NULL
    DROP PROCEDURE dbo.sp_load_fact_sales;
GO

CREATE PROCEDURE dbo.sp_load_fact_sales
    @batch_id VARCHAR(50)
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @start_time DATETIME = GETDATE();
    DECLARE @rows_inserted BIGINT = 0;
    DECLARE @error_msg NVARCHAR(MAX);
    
    BEGIN TRY
        INSERT INTO etl_db.dbo.etl_log (
            batch_id, layer_name, db_name, table_name, start_time, status
        ) VALUES (
            @batch_id, 'star', 'star_db', 'fact_sales', @start_time, 'RUNNING'
        );
        
        -- 增量加载:仅插入新增的订单(不在事实表中的)
        INSERT INTO dbo.fact_sales (
            date_key, customer_key, product_key, status_key,
            order_id,
            quantity, unit_price, total_amount, discount_amount,
            unit_cost, total_cost,
            gross_profit, profit_margin,
            etl_batch_id, load_time
        )
        SELECT
            -- 日期键
            YEAR(o.order_date) * 10000 + MONTH(o.order_date) * 100 + DAY(o.order_date),
            -- 客户代理键(取当前有效)
            (SELECT MAX(c.customer_key) FROM dbo.dim_customer c
             WHERE c.customer_id = o.customer_id AND c.is_current = 1),
            -- 商品代理键(取当前有效)
            (SELECT MAX(p.product_key) FROM dbo.dim_product p
             WHERE p.product_id = o.product_id AND p.is_current = 1),
            -- 状态键
            (SELECT s.status_key FROM dbo.dim_order_status s
             WHERE s.status_code = o.status),
            -- 订单号
            o.order_id,
            -- 度量值
            o.quantity,
            o.unit_price,
            o.total_amount,
            0,
            -- 成本(从商品维度获取)
            (SELECT TOP 1 dp.unit_cost FROM dbo.dim_product dp
             WHERE dp.product_id = o.product_id AND dp.is_current = 1),
            (SELECT TOP 1 ISNULL(dp.unit_cost, 0) FROM dbo.dim_product dp
             WHERE dp.product_id = o.product_id AND dp.is_current = 1) * o.quantity,
            -- 衍生指标
            o.total_amount - (
                (SELECT TOP 1 ISNULL(dp.unit_cost, 0) FROM dbo.dim_product dp
                 WHERE dp.product_id = o.product_id AND dp.is_current = 1) * o.quantity
            ),
            CASE
                WHEN o.total_amount > 0 THEN
                    CAST(ROUND(((
                        o.total_amount - (
                            (SELECT TOP 1 ISNULL(dp.unit_cost, 0) FROM dbo.dim_product dp
                             WHERE dp.product_id = o.product_id AND dp.is_current = 1) * o.quantity
                        )
                    ) / o.total_amount * 100), 2) AS DECIMAL(5,2))
                ELSE 0
            END,
            -- 元数据
            @batch_id,
            GETDATE()
        FROM (
            -- 取 PSA 中每个订单的最新版本
            SELECT order_id, customer_id, product_id, order_date,
                quantity, unit_price, total_amount, status, psa_load_time,
                ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY psa_load_time DESC) AS rn
            FROM psa_db.dbo.orders
        ) o
        WHERE o.rn = 1
          AND NOT EXISTS (
              SELECT 1 FROM dbo.fact_sales f WHERE f.order_id = o.order_id
          );
        
        SET @rows_inserted = @@ROWCOUNT;
        
        -- 更新已有订单的状态(如果状态发生了变化)
        UPDATE f
        SET f.status_key = (SELECT s.status_key FROM dbo.dim_order_status s WHERE s.status_code = o.status),
            f.etl_batch_id = @batch_id,
            f.load_time = GETDATE()
        FROM dbo.fact_sales f
        INNER JOIN (
            SELECT order_id, status,
                ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY psa_load_time DESC) AS rn
            FROM psa_db.dbo.orders
        ) o ON f.order_id = o.order_id
        WHERE o.rn = 1
          AND f.status_key <> (SELECT s.status_key FROM dbo.dim_order_status s WHERE s.status_code = o.status);
        
        UPDATE etl_db.dbo.etl_log
        SET end_time = GETDATE(), rows_inserted = @rows_inserted, status = 'SUCCESS'
        WHERE batch_id = @batch_id AND db_name = 'star_db' AND table_name = 'fact_sales';
        
        PRINT N'fact_sales 加载完成: ' + CAST(@rows_inserted AS VARCHAR) + N' 行';
    END TRY
    BEGIN CATCH
        SET @error_msg = ERROR_MESSAGE();
        UPDATE etl_db.dbo.etl_log
        SET end_time = GETDATE(), status = 'FAILED', error_message = @error_msg
        WHERE batch_id = @batch_id AND db_name = 'star_db' AND table_name = 'fact_sales';
        THROW;
    END CATCH;
END;
GO

Star Schema 主 ETL 调度

sql 复制代码
USE star_db;
GO

IF OBJECT_ID('dbo.sp_run_star_etl', 'P') IS NOT NULL
    DROP PROCEDURE dbo.sp_run_star_etl;
GO

CREATE PROCEDURE dbo.sp_run_star_etl
    @batch_id VARCHAR(50)
AS
BEGIN
    SET NOCOUNT ON;
    PRINT N'=== Star Schema ETL 开始 ===';
    PRINT N'批次号: ' + @batch_id;
    
    -- 先加载维度(事实表依赖维度代理键)
    EXEC dbo.sp_load_dim_customer @batch_id;
    EXEC dbo.sp_load_dim_product @batch_id;
    
    -- 再加载事实表
    EXEC dbo.sp_load_fact_sales @batch_id;
    
    PRINT N'=== Star Schema ETL 完成 ===';
END;
GO

执行 ETL 并验证

sql 复制代码
-- ============================================================
-- 执行 Star Schema ETL
-- ============================================================

DECLARE @batch_id VARCHAR(50);
SET @batch_id = 'BATCH_STAR_' + CONVERT(VARCHAR(8), GETDATE(), 112) + '_' 
                + REPLACE(CONVERT(VARCHAR(8), GETDATE(), 108), ':', '');

EXEC star_db.dbo.sp_run_star_etl @batch_id;
GO

-- ============================================================
-- 验证结果
-- ============================================================

USE star_db;
GO

-- 查看 ETL 日志
SELECT batch_id, table_name, start_time, end_time,
    DATEDIFF(second, start_time, end_time) AS duration_sec,
    rows_inserted, rows_updated, status
FROM etl_db.dbo.etl_log
WHERE layer_name = 'star'
ORDER BY start_time;
GO

-- 查看各表记录数
SELECT 'dim_date' AS tbl, COUNT(*) AS cnt FROM dbo.dim_date
UNION ALL SELECT 'dim_customer', COUNT(*) FROM dbo.dim_customer
UNION ALL SELECT 'dim_product', COUNT(*) FROM dbo.dim_product
UNION ALL SELECT 'dim_order_status', COUNT(*) FROM dbo.dim_order_status
UNION ALL SELECT 'fact_sales', COUNT(*) FROM dbo.fact_sales;
GO

-- 查看事实表样本数据
SELECT TOP 5
    f.order_id,
    d.date_value AS order_date,
    c.customer_name,
    p.product_name,
    f.quantity,
    f.total_amount,
    f.gross_profit,
    f.profit_margin
FROM dbo.fact_sales f
LEFT JOIN dbo.dim_date d ON f.date_key = d.date_key
LEFT JOIN dbo.dim_customer c ON f.customer_key = c.customer_key
LEFT JOIN dbo.dim_product p ON f.product_key = p.product_key
ORDER BY f.fact_sales_id;
GO

Star Schema 分析查询

sql 复制代码
-- ============================================================
-- Q1:每月销售额趋势
-- ============================================================

USE star_db;
GO

SELECT
    d.year,
    d.month,
    d.month_name_short,
    SUM(f.total_amount) AS total_sales,
    SUM(f.quantity) AS total_quantity,
    COUNT(DISTINCT f.order_id) AS order_count,
    ROUND(AVG(f.profit_margin), 2) AS avg_margin
FROM dbo.fact_sales f
INNER JOIN dbo.dim_date d ON f.date_key = d.date_key
GROUP BY d.year, d.month, d.month_name_short
ORDER BY d.year, d.month;
GO

-- ============================================================
-- Q2:按客户类型的销售分析
-- ============================================================

SELECT
    c.customer_type,
    COUNT(DISTINCT f.order_id) AS order_count,
    SUM(f.total_amount) AS total_sales,
    ROUND(AVG(f.profit_margin), 2) AS avg_margin
FROM dbo.fact_sales f
INNER JOIN dbo.dim_customer c ON f.customer_key = c.customer_key
WHERE c.is_current = 1
GROUP BY c.customer_type
ORDER BY total_sales DESC;
GO

-- ============================================================
-- Q3:按商品品类的销售排行
-- ============================================================

SELECT
    p.category,
    p.product_name,
    COUNT(DISTINCT f.order_id) AS order_count,
    SUM(f.quantity) AS total_quantity,
    SUM(f.total_amount) AS total_sales,
    SUM(f.gross_profit) AS total_profit
FROM dbo.fact_sales f
INNER JOIN dbo.dim_product p ON f.product_key = p.product_key
WHERE p.is_current = 1
GROUP BY p.category, p.product_name
ORDER BY total_sales DESC;
GO

-- ============================================================
-- Q4:按区域的销售分析
-- ============================================================

SELECT
    c.region,
    COUNT(DISTINCT f.order_id) AS order_count,
    SUM(f.total_amount) AS total_sales
FROM dbo.fact_sales f
INNER JOIN dbo.dim_customer c ON f.customer_key = c.customer_key
WHERE c.is_current = 1
GROUP BY c.region
ORDER BY total_sales DESC;
GO

-- ============================================================
-- Q5:按订单状态的订单分布
-- ============================================================

SELECT
    s.status_name,
    COUNT(*) AS order_count,
    SUM(f.total_amount) AS total_sales
FROM dbo.fact_sales f
INNER JOIN dbo.dim_order_status s ON f.status_key = s.status_key
GROUP BY s.status_name
ORDER BY order_count DESC;
GO

Star Schema 与 Data Vault 的对比

维度 Star Schema Data Vault
设计理念 面向分析,性能优先 面向业务,敏捷可追溯
结构 事实表 + 维度表 Hub + Link + Satellite
主键 代理键(整数自增) 业务主键或哈希键
历史追踪 可选(通过 SCD) 原生支持(每条记录带时间)
查询性能 极快(少表关联) 较慢(多表关联)
业务变更适应性 较差(需重构) 极强(只增不改)
适用场景 成熟稳定的业务分析 快速变化的业务环境
使用时机 数据仓库成熟期 数据仓库建设期/中期

经典 Star Schema 模式

复制代码
┌─────────────────────────────────────────────────────────────────┐
│                    销售分析 Star Schema                          │
│                    (star_db 数据库)                             │
│                                                                  │
│  ┌──────────────────────┐                                        │
│  │    FACT_SALES        │                                        │
│  │  date_key (FK)       │                                        │
│  │  customer_key (FK)   │                                        │
│  │  product_key (FK)    │                                        │
│  │  status_key (FK)     │                                        │
│  │  order_id (DD)       │                                        │
│  │  quantity            │                                        │
│  │  total_amount        │                                        │
│  │  gross_profit        │                                        │
│  └──────────┬───────────┘                                        │
│             │                                                    │
│    ┌────────┼──────────┐                                         │
│    ↓        ↓          ↓                                         │
│ ┌────────┐ ┌─────────┐ ┌──────────┐ ┌────────────────┐         │
│ │DIM_DATE│ │DIM_CUST │ │DIM_PROD  │ │DIM_ORDER_STATUS│         │
│ │date_key│ │cust_key │ │prod_key  │ │status_key      │         │
│ │year    │ │name     │ │name      │ │status_code     │         │
│ │quarter │ │city     │ │category  │ │status_name     │         │
│ │month   │ │region   │ │brand     │ └────────────────┘         │
│ │week    │ │type     │ │price     │                             │
│ └────────┘ └─────────┘ └──────────┘                             │
└─────────────────────────────────────────────────────────────────┘

小结

概念 说明
事实表 存储业务度量,中心位置,多表关联
维度表 描述业务的分析角度(谁/什么/何时/何地)
退化维度(DD) 简单、低基数的属性,直接放入事实表(如订单号)
代理键 整数型主键,替代业务主键,提升性能
SCD Type 2 缓慢变化维度,通过 load_date/end_date 追踪历史

Star Schema 设计原则:

  • 📌 一个事实表 + 多个维度表
  • 📌 维度表字段尽量"宽"(丰富的描述属性)
  • 📌 尽量避免多对多关系
  • 📌 退化维度简化查询
相关推荐
RestCloud3 小时前
零售行业全渠道数据整合:ETL工具如何支撑精准营销?
数据仓库·etl·零售·数据处理·数据集成·数据传输·数据同步
哥本哈士奇6 小时前
数据仓库笔记 第二篇:PSA 层(持久化暂存区)详解
数据仓库
juniperhan20 小时前
Flink 系列第17篇:Flink Table&SQL 核心概念、原理与实战详解
大数据·数据仓库·分布式·sql·flink
QEasyCloud20221 天前
企业数据仓库建设实践与价值分析
数据仓库
地球资源数据云2 天前
1951-2025年中国逐年1千米逐月总降水量区域统计数据集_年表_县
大数据·数据结构·数据库·数据仓库·人工智能
SelectDB技术团队2 天前
Apache Doris 4.1:面向 AI & Search 的统一数据存储与检索底座
数据库·数据仓库·实时分析·selectdb
juniperhan2 天前
Flink 系列第16篇:Flink 核心数据类型类详解(POJO、Row、Tuple)
java·大数据·数据仓库·分布式·flink
RestCloud3 天前
TiDB 混合负载场景下的 ETL 与 CDC 实践
数据仓库·tidb·etl·cdc·数据同步·数据库传输
AllData公司负责人4 天前
AllData数据中台通过开源项目RustFS建设现代数据湖存储,接入工业, 医疗, 物联网数据,包括文件/图像/音频/视频数据!
数据库·数据仓库·物联网·开源·数据存储·数据接入·rustfs