Clickhouse数据库部署、Python3压测实践
一、Clickhouse数据库部署
-
版本:yandex/clickhouse-server:latest
-
部署方式:docker
-
内容
ymlversion: "3" services: clickhouse: image: yandex/clickhouse-server:latest container_name: clickhouse ports: - "8123:8123" - "9000:9000" - "9009:9009" - "9004:9004" volumes: - ./data/config:/var/lib/clickhouse ulimits: nproc: 65535 nofile: soft: 262144 hard: 262144 healthcheck: test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"] interval: 30s timeout: 5s retries: 3 deploy: resources: limits: cpus: '4' memory: 4096M reservations: memory: 4096M
-
建表语句
CREATE TABLE test_table (id int, feild1 String, feild2 String, feild3 String , feild4 String, feild5 String, feild6 String , feild7 String, feild8 String, feild9 String , feild10 String, feild11 String, feild12 String , feild13 String, feild14 String, feild15 String , feild16 String, feild17 String, feild18 String , feild19 String, feild20 String ) ENGINE = MergeTree:
二、Python3插入数据压测
-
关键库:clickhouse_driver、 concurrent.futures
-
代码:
pythonimport random import time from clickhouse_driver import Client from concurrent.futures import ThreadPoolExecutor, as_completed client = Client(host='ip') # 采用多个连接,避免单个连接被打死 clients = [ Client(host='ip'), Client(host='ip'), Client(host='ip'), Client(host='ip') ] # 采用批量插入,经过测试,单条并发插入支持差,每秒只能执行2-5次insert def task(i): sql = "INSERT INTO ck_table (id, feild1, feild2,feild3,feild4,feild5,feild6,feild7,feild8,feild9,feild10,feild11,feild12,feild13,feild14,feild15,feild16,feild17,feild18,feild19,feild20) VALUES" values = [] for i in range(1000): values.append((random.randint(1,10000000),"feild1-"+str((random.randint(1,10000000))),"feild2-"+str(i),"feild3-"+str(i), "feild4-"+str(i), "feild5-"+str(i), "feild6-"+str(i), "feild7-"+str(i) , "feild8-"+str(i), "feild9-"+str(i), "feild10-"+str(i), "feild11-"+str(i), "feild12-"+str(i), "feild13-"+str(i), "feild14-"+str(i) , "feild15-"+str(i), "feild16-"+str(i), "feild17-"+str(i), "feild18-"+str(i), "feild19-"+str(i) , "feild20-"+str(i) )) clid = random.randint(1, len(clients)-1) clients[clid].execute(sql, values) return '第',clid, "插入",i, '条数据成功' if __name__ == '__main__': print ("程序开始运行") exec = ThreadPoolExecutor(max_workers=2) #ress = [] start_time = time.perf_counter() for j in range(4000000): # 总共需要执行的次数 res = exec.submit(task,j) #ress.append(res) # for i in as_completed(ress): # print("执行状态",i.result()) print("执行耗时", time.perf_counter()-start_time,"s")
三、Python3查询数据测试
-
关键库:clickhouse_driver、 concurrent.futures
-
代码
pythonimport time from concurrent.futures import ThreadPoolExecutor, as_completed from clickhouse_driver import Client client = Client(host='10.10.16.110') query_sql = """select * from ck_table where feild2='feild2-1009' """ def new_task(i): count_sql = """ select count(*) from ck_table""" time.sleep(1) return "执行第",i,"个任务",client.execute(count_sql) if __name__ == '__main__': print ("程序开始运行") thd_ques = [] exec = ThreadPoolExecutor(max_workers=1) ress = [] start_time = time.perf_counter() for j in range(1000): res = exec.submit(new_task,j) ress.append(res) for i in as_completed(ress): print("执行状态",i.result()) print("执行耗时", time.perf_counter()-start_time,"s")
四、测试结论
clickhouse:21个字段表插入-查询测试, CPU200w数据以内 >100,峰值:133.6, 均值: 约110
1、不支持频繁插入(一般1-2次/s),否则会断联等报错,只能批插入(脚本使用2协程每次1000条没有报错,2个协程或者以上会出现断联等报错)
2、不支持频发查询,QPS官方建议100以内,否则CPU占用会很高,拉高服务器负载
3、查询效率:
一个条件where查询(Memery):60W 0.33s
5个条件where查询(Memery):80W 0.57s
5个条件where查询(Memery):100W 0.54s
5个条件where查询(Memery):112W 0.56s
5个条件where查询(Memery):200W 0.565s
5个条件where查询(Memery):500W 1.2s(停止插入的情况下)
5个条件where查询(Memery):560W 1.97s(停止插入的情况下)
5个条件where查询(TinyLog):7000W条 1分47秒
2个条件where查询(TinyLog):1亿零460万条 89s
5个条件where查询(TinyLog):1亿零460万条 84s
10个条件where查询(TinyLog):1亿零460万条 87s
备注 450w条数据后,数据插入线程和查询线程只能存在一个,慢查询的内存消耗很高,16G内存不够用。5个条件where查询还能执行,在1-2s
(1)500w数据量服务器情况:(COPU均值在320左右,16G内存剩余在500-800M之间,停止写入/查询后,CPU恢复正常水平,内存剩余在800M左右)
total used free shared buff/cache available
15G 5.9G 519M 9.2M 9.1G 9.2G
%CPU %MEM
429.5 26.0
(2)1亿数据量服务器情况(1T磁盘消耗共38%,预计消耗6% )
total used free shared buff/cache available
15G 2.7G 181M 9.2M 12G 12G
%CPU %MEM
103.7 3.6
总结:
- 1、不支持并发单条频繁插入,否则会报错,断联等造成数据丢失
- 2、不支持高并发查询,官方建议QPS<= 100,否则会增加服务器负载,CPU,内存等消耗过高
- 3、对服务器要求高,亿级CPU一般建议16核心以上,内存64G以上
- 4、优点是查询快,批量插入效率高,建议低频大批量插入