公司广告业务需求,需要多个维度统计每个应用的设备数,点击率,展示率,等相关数据,而且数据需要进行去重,我第一时间想到的是利用clickhouse来做统计,因为我们平台访问量比较大,用mysql可能不太适合
首先我建了四个表
#点击数据表
CREATE TABLE raw_click
(
`Date` Date,
`Time` DateTime,
`Hour` Int8,
`AdvertiserID` UInt32 DEFAULT 0,
`AdsID` UInt32 DEFAULT 0,
`DeveloperID` UInt32 DEFAULT 0,
`WebID` UInt32 DEFAULT 0,
`FeeTypeID` UInt32 DEFAULT 0,
`AdvType` UInt8 DEFAULT 0,
`GroupID` UInt32 DEFAULT 0,
`PlatformID` UInt32 DEFAULT 0,
`PlatformNameID` UInt8 DEFAULT 0,
`MaterialId` UInt32 DEFAULT 0,
`DeviceID` Nullable(String) DEFAULT NULL,
`AppOs` UInt8 DEFAULT 1
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(Date)
ORDER BY Date
SETTINGS index_granularity = 8192
#填充数表
CREATE TABLE raw_fill
(
`Date` Date,
`Time` DateTime,
`Hour` Int8,
`AdvertiserID` UInt32 DEFAULT 0,
`AdsID` UInt32 DEFAULT 0,
`DeveloperID` UInt32 DEFAULT 0,
`WebID` UInt32 DEFAULT 0,
`FeeTypeID` UInt32 DEFAULT 0,
`AdvType` UInt8 DEFAULT 0,
`GroupID` UInt32 DEFAULT 0,
`PlatformID` UInt32 DEFAULT 0,
`PlatformNameID` UInt8 DEFAULT 0,
`MaterialId` UInt32 DEFAULT 0,
`DeviceID` Nullable(String) DEFAULT NULL,
`AppOs` UInt8 DEFAULT 1
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(Date)
ORDER BY Date
SETTINGS index_granularity = 8192
#请求数表
CREATE TABLE raw_request
(
`Date` Date,
`Time` DateTime,
`Hour` Int8,
`AdvertiserID` UInt32 DEFAULT 0,
`AdsID` UInt32 DEFAULT 0,
`DeveloperID` UInt32 DEFAULT 0,
`WebID` UInt32 DEFAULT 0,
`FeeTypeID` UInt32 DEFAULT 0,
`AdvType` UInt8 DEFAULT 0,
`GroupID` UInt32 DEFAULT 0,
`PlatformID` UInt32 DEFAULT 0,
`PlatformNameID` UInt8 DEFAULT 0,
`MaterialId` UInt32 DEFAULT 0,
`DeviceID` Nullable(String) DEFAULT NULL,
`AppOs` UInt8 DEFAULT 1
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(Date)
ORDER BY Date
SETTINGS index_granularity = 8192
#展示数表
CREATE TABLE raw_show
(
`Date` Date,
`Time` DateTime,
`Hour` Int8,
`AdvertiserID` UInt32 DEFAULT 0,
`AdsID` UInt32 DEFAULT 0,
`DeveloperID` UInt32 DEFAULT 0,
`WebID` UInt32 DEFAULT 0,
`FeeTypeID` UInt32 DEFAULT 0,
`AdvType` UInt8 DEFAULT 0,
`GroupID` UInt32 DEFAULT 0,
`PlatformID` UInt32 DEFAULT 0,
`PlatformNameID` UInt8 DEFAULT 0,
`MaterialId` UInt32 DEFAULT 0,
`DeviceID` Nullable(String) DEFAULT NULL,
`AppOs` UInt8 DEFAULT 1
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(Date)
ORDER BY Date
SETTINGS index_granularity = 8192
当时建表时,我犹豫了两个方面,第一需不需要进行按月分表,然后我问了chatgpt
翻译过来的意思就是《你硬件的极限才是我clickhouse的极限》,那我就放心把数据往里面塞了
犹豫的第二点就是,我要不要只建一个表,将点击展示填充这些行为用type区分。后来仔细思考了一下,还是觉得每个行为进行一次分表是最好的
数据表里的每个字段,都将是我们业务报表,需要进行维度查询的条件,所以数据库就这样定下来了。
接下来就是需要考虑怎么将数据插入进来,我这里只分享一下我的插入数据脚本
#!/usr/local/php/bin/php -q
<?php
declare(ticks=1);
const _TOUCHER_NAME_ = "ch_stat";#同步器的名称
// 如果存在开发环境配置,则加载
include("int/clickhouse1.3.10/Clickhouse.php");
include("int/config.php");
$mq_name = $argv[1] ?? '';
if (empty($mq_name)) {
exit("不是正确的打开方式!");
}
$table_name_arr = [
'raw_show_mq' => 'raw_show',
'raw_click_mq' => 'raw_click',
'raw_fill_mq' => 'raw_fill',
'raw_request_mq' => 'raw_request'
];
$table_name = $table_name_arr[$mq_name] ?? '';
if (empty($table_name)) {
exit("不是正确的打开方式啊!");
}
#监听断开信号
$handle = true;
pcntl_signal(SIGTERM, 'handleSignal');
pcntl_signal(SIGINT, 'handleSignal');
pcntl_signal(SIGQUIT, 'handleSignal');
#链接redis
$redisconn = redis_conn();
$redisconn->select(9);
$clickhouse = new Clickhouse($ch_config, '数据库表名');
while (true) {
if (date("H") == '05' && date("i") == '00' && date("s") == '00') {
exit(_TOUCHER_NAME_ . ":I am gone away");
}
$start_time = microtime_float(); //记录开始时间
try {
$queueLen = $redisconn->lLen($mq_name);
} catch (\Exception $e) {
# 预防redis 挂掉
exit(_TOUCHER_NAME_ . ": redis gone away ");
}
#暂时一次插入1000
$queue_count = 1000;
$data = [];
if ($queueLen < $queue_count) {
#数据不够 我在等等
$queue_count = $queueLen;
// msg2log(_TOUCHER_NAME_ . ":数据不够 我在等等!");
// sleep(3);
// continue;
}
for ($i = 0; $i < $queue_count; $i++) {
#取出队列的数据
$json_data = $redisconn->rPop($mq_name);
if (empty($json_data)) {
#会有为空吗
continue;
}
#组装数据插入
$data[] = json_decode($json_data, true);
}
if (empty($data)) {
msg2log(_TOUCHER_NAME_ . ":队列暂时没有可消耗数据!");
sleep(5);
continue;
}
#批量插入
try {
$clickhouse->insert($table_name, $data);
} catch (Exception $exception) {
#批量插入失败 全部推回去
msg2log(_TOUCHER_NAME_ . ":批量插入失败,将数据推回去");
foreach ($data as $v) {
#数据结构有问题 可暂时先注释
$redisconn->lPush($mq_name, json_encode($v));
}
#清空数据
$data = [];
#排除是不是clickhouse挂了
if (!$clickhouse->alive()) {
exit("clickhouse 链接异常 尝试退出重连!");
}
}
$end_time = microtime_float();
if (!$handle) {
msg2log(_TOUCHER_NAME_ . ":程序主动退出!Using Time " . ($end_time - $start_time) . " Sec, Totoal touched :" . count($data));
break;
}
msg2log(_TOUCHER_NAME_ . ": Using Time " . ($end_time - $start_time) . " Sec, Totoal touched :" . count($data));
sleep(3);
}
function handleSignal($signal)
{
global $handle;
switch ($signal) {
case SIGTERM:
case SIGINT:
case SIGQUIT:
$handle = false;
#exit;
// 处理其他信号...
}
}
?>
脚本的内容,主要就是从队列里面拿到数据插入到clickhouse里面去,然后里面加了一点检测redis,clickhouse是否断开的判断处理,以及当数据存在异常时,将数组从新推回队列,防止数据丢失,最后一点就是当我们断掉脚本的时候,检测信号,将数据整理完毕之后再断开,这样尽可能的避免数据的丢失
插入数据脚本没问题了之后,等到数据进来,发现数据增长的是真的快,这是跑了2个多月的数据,因为平台流量大,导致数据很多,虽然查询起来有没有问题,但是我发现每次执行sql,时间大约在一个四五秒左右(以下面这段sql为例)
SELECT Date,
SUM(dau) AS dau,
SUM(request) AS request,
SUM(fill) AS fill,
SUM(show) AS show,
SUM(click) AS click
FROM (
SELECT Date, count(distinct DeviceID) AS dau, count(*) AS request, 0 AS fill, 0 AS show, 0 AS click
FROM raw_request
WHERE PlatformNameID > 0 AND Date BETWEEN '2024-03-07' AND '2024-03-13'
GROUP BY Date
UNION ALL
SELECT Date, 0 AS dau, 0 AS request, count(*) AS fill, 0 AS show, 0 AS click
FROM raw_fill
WHERE PlatformNameID > 0 AND Date BETWEEN '2024-03-07' AND '2024-03-13'
GROUP BY Date
UNION ALL
SELECT Date, 0 AS dau, 0 AS request, 0 AS fill, count(*) AS show, 0 AS click
FROM raw_show
WHERE PlatformNameID > 0 AND Date BETWEEN '2024-03-07' AND '2024-03-13'
GROUP BY Date
UNION ALL
SELECT Date, 0 AS dau, 0 AS request, 0 AS fill, 0 AS show, count(*) AS click
FROM raw_click
WHERE PlatformNameID > 0 AND Date BETWEEN '2024-03-07' AND '2024-03-13'
GROUP BY Date
) AS subquery
GROUP BY Date
ORDER BY Date DESC;
后面我发现其实,之前的历史数据,基本上都用不到,另外一直存着这些数据,备份起来,担心磁盘不够用,所以我想着只保存前面一个月的数据,因为我的数据存储是按天分区的,所以我删除的时候也要按天来删,注意删之后一定要归档一份,删除语句主要用到的是
ALTER TABLE table DROP PARTITION date
日期的格式是20240303 这样的,删除之后,发现数据查询确实也是会快一点,后面再慢慢优化
ALTER TABLE table ADD column 字段名 UInt8 DEFAULT 默认值;
clickhouse目的是为了存储更多的信息,尽量扩展到每一个我们可能会用到的查询条件,如果忘记了,那么我们就需要新增字段,新增字段还是比较快的,上亿条数据执行这段sql,一秒不到,个人猜测可能跟他的列式存储方式有关
最后,这是我个人的一个经验分享,欢迎大家交流学习,也希望能对你有帮助。