之前用一些现成样本测试压缩率,达到惊人的1000:60, 猜测可能是因为日志内容高度相似,被极致压缩了,这次准备了一些极随机的样本来做一次测试
这里用到了一个专门用来生成伪数据的库 faker,可以用来生成以随机单词组成的句子,用来伪装成日志内容。
python
import json, time, requests, random
import uuid, base64
from faker import Faker
fake = Faker()
VLOGS_URL = "http://192.168.1.11:9428/insert/jsonline"
def fast_uuid():
return base64.urlsafe_b64encode(uuid.uuid4().bytes)[:10].decode()
users = []
for _ in range(5000):
users.append(fast_uuid())
aitools = []
for _ in range(1000):
aitools.append("aitools_%s"%fast_uuid())
def build_logs(count: int):
logs = []
base_time = int(time.time() * 1000)
for i in range(count):
msg = fake.sentence(nb_words=120, variable_nb_words=True)
logs.append({
"_msg": f"user login success {msg} user_id={i}",
"_time": base_time + i ,
"host": "web-01",
"app": "auth-service",
"level": random.choice(["INFO", "WARN", "ERROR"]),
"userid": fast_uuid(),
"aitools": "aitools_%s"%fast_uuid()
})
contents = "\n".join(json.dumps(l, ensure_ascii=False) for l in logs)
print("avg doc len = %s"%(len(contents)/count))
return contents
def main():
start = time.time()
total = 0
while 1:
batch = build_logs(1000) # 一次 1000 条
total+=1000
params = {
"_msg_field": "_msg", # 日志原文字段
"_time_field": "_time", # 时间戳字段
"_stream_fields": "host,app" # 用哪些字段算 stream_id
}
headers = {"Content-Type": "application/json"}
r = requests.post(VLOGS_URL, data=batch, params=params, headers=headers)
print("HTTP", r.status_code, r.text)
print("total = %s, insert speed=%s"%(total, total/(time.time()-start)))
time.sleep(2)
if __name__ == '__main__':
main()
测试下来,压缩率为3:1, 比上面的结果差很多,但是能有这个级别我已经很满意了,毕竟这种情况太极端了。