1. 背景
1.1 IM 消息存储的核心要求
IM系统的消息存储需要满足以下关键特性:
- 海量数据与高并发写入:消息量可达万亿级,需支撑持续高吞吐写入
- 低延迟:消息收发及历史消息查询的P99延迟需控制在毫秒级
- 消息漫游:支持按会话(Timeline)组织消息,按时间序高效分页拉取
- 高扩展与高可用:线性扩展能力,自动故障转移,无需业务层介入分片
- 存储成本可控:高压缩率,支持TTL自动清理过期数据
1.2 主流存储方案对比
| 方案 | 优势 | 不足 |
|---|---|---|
| MySQL | ACID,易用 | B+Tree大表写入差,分库分表复杂 |
| MongoDB | 文档模型,自动分片 | 数据量大时内存瓶颈,延迟飙升 |
| Cassandra | 线性扩展,去中心化 | 读放大严重,Java GC导致延迟尖刺 |
| ScyllaDB | 高性能C++实现,无GC,Cassandra兼容 | 相对较新 |
1.3 为什么选择 ScyllaDB
- Shard-per-Core架构:每核独立分片,无锁竞争,性能随核心数线性增长
- 无GC停顿:C++实现,彻底规避Java GC带来的延迟毛刺(Discord迁移后P99从40-125ms降至15ms)
- 读性能优于Cassandra:LSM树实现更高效,读放大问题得到缓解
- 生态兼容:完全兼容Cassandra CQL和驱动,迁移成本低
- 运维友好:压缩更高效,修复操作对业务影响小
2. SycllDB 使用要点
2.1 服务器配置 & 高可用
官方推荐的最小配置为:4C/16G
最小高可用需3台
生产环境上线建议:3台4C/16G/500GB SSD服务器(测试/小规模生产可以2C/8G)
2.2 Keyspace
可以类比为数据库(Database),Keyspace = 一组"表 + 副本策略 + 分布规则"的逻辑容器。
sql
CREATE KEYSPACE IF NOT EXISTS im_offline
WITH replication = {
'class': 'NetworkTopologyStrategy', --SimpleStrategy:单机/测试环境;NetworkTopologyStrategy:生产多机房 / 多节点
'datacenter1': 3 --数据中心副本数量;
}
| 节点数 | RF建议 |
|---|---|
| 1 | RF=1 |
| 3 | RF=3(最小生产高可用) |
| 5+ | RF=3 或 RF=5(看成本) |
2.3 Key
ScyllaDB 的表设计遵循 Query-Driven Design 原则,即先确定查询模式,再设计数据模型 (partition key / clustering key)。系统不支持高效的全表扫描查询,因此非基于分区键的扫描查询在大规模数据场景下应避免使用。
Scylla/Cassandra 的硬约束是:查询必须命中分区键,否则就会变成 scan/二级索引(不可靠、不可控)。
Partition Key(分区键) :决定数据存到哪个节点、哪个 partition。
Clustering Key(聚簇键) :决定partition 内部排序。
PRIMARY KEY(主键) :主键是"复合概念",PRIMARY KEY =Partition Key + Clustering Key,例如:PRIMARY KEY ((app_id, user_id, day_bucket, box_type), data_time, msg_id),partition key:(app_id, user_id, day_bucket, box_type),ScyllaDB 会:hash(app_id+user_id+day_bucket+box_type)->token,然后token → 某节点。

2.4 Compaction
ScyllaDB / Cassandra 底层思想和 LevelDB非常接近,因此可以把ScyllaDB 类比为一个分布式的LevelDB,基于LevelDB实现的NoSQL DB 一般都有一个共性:大规模整理期间性能急剧下降,因此compaction机制的选择非常重要。
bash
WAL
MemTable(内存)
↓ flush
SSTable1
SSTable2
SSTable3
...
Compaction:
SSTable1
SSTable2
SSTable3
↓ merge
SSTable_new
如上,可以认为ScyllaDB 的整理是:后台自动合并 SSTable(可以理解为排好序的 KV 文件),同时清理 TTL 过期数据、排序数据、降低读放大等。ScyllaDB 比 Cassandra 更强的原因之一是专门优化了Compaction,例如:动态控制 Compaction IO 避免打爆磁盘、Compaction task 避免锁竞争,Streaming Compaction 边处理边合并等,支持的CompactionStrategy有:
| CompactionStrategy | 简称 | 核心思想 | 优点 | 缺点 | 典型场景 | 是否适合 IM |
|---|---|---|---|---|---|---|
| SizeTieredCompactionStrategy | STCS | 按 SSTable 大小分组后合并 | 写入吞吐高;默认策略;通用性强; | 读放大较高;TTL/tombstone 场景表现一般 | 通用 KV、普通业务表 | 一般 |
| LeveledCompactionStrategy | LCS | 将 SSTable 分层(L0/L1/L2...),控制每层大小 | 读性能极强;查询延迟稳定;读放大小; | 写放大高;Compaction CPU/IO 压力大 | 用户资料、配置表、随机读热点表 | 不推荐 |
| TimeWindowCompactionStrategy | TWCS | 按时间窗口组织 SSTable,只在窗口内 Compaction | TTL 极友好;时序写入性能强;tombstone(删除标记) 少; | 不适合频繁更新旧数据;乱序时间写入会退化 | IM、日志、IoT、Metrics、事件流 | 强烈推荐 |
由于IM消息存储涉及:day_bucket、msgTime、TTL,是一个标准的时序模型,因此选择TWCS会:
- 降低 tombstone
- 降低 compaction IO
- 降低读放大
- 提高写入稳定性
2.5 Materialized View
What
可以将物化视图理解为自动维护的另一张表(自动同步索引表),而不是MySQL 那种"运行时动态 SQL 视图"。它真的会:写磁盘、生成 SSTable、参与 Compaction、占用空间。
Why
Scylla/Cassandra不支持任意 ORDER BY,不支持跨 partition 排序。IM中离线消息正向读,历史消息倒序读,因此历史消息表通常会建立im_history_messages_time_desc的MV,可以理解为一张按 DESC 排序重新存储的一张镜像表。
How
顾虑于Cassandra的MV的数据一致性问题,很多基于 Apache Cassandra / ScyllaDB 的 IM 系统,最终都会:放弃 Materialized View,改成"应用层双写两张表"。因为:
- MV repair 成本高
- MV 历史上一致性坑较多
- debug困难
- backfill复杂
- 大集群 compact 压力更大
2. 消息存储设计
公共原则:
1、所有相关数据库均开启WAL(durable_writes = true),不开:不开启写入更快,但是宕机可能丢数据。
2、消息存储会分为离线消息存储和历史消息存储2类,离线消息存储:写多读多、写扩散、TTL默认7天;历史消息存储:写少读少,不写扩展(群聊只存1条上行)、TTL默认180天;压测选择离线消息存储做为压测环境。
2.1 离线消息
2.1.1 im_offline_messages
读写说明:写多读多、写扩散、TTL默认7天;
sql
CREATE KEYSPACE IF NOT EXISTS im_offline
WITH replication = {
'class': 'SimpleStrategy', --生产环境:NetworkTopologyStrategy
'datacenter1': 1 --生产环境:3
}
AND durable_writes = true;
USE im_offline;
CREATE TABLE IF NOT EXISTS im_offline_messages (
-- 分区键:用户 + 天 + 箱子
app_id bigint,
user_id text,
day_bucket int, -- yyyyMMdd
box_type tinyint, -- 0=InBox 收件箱, 1=SendBox 发件箱(下发时 direction = box_type)
-- 聚簇键:时间 + 消息 ID(防同毫秒冲突)
data_time bigint, -- 毫秒
msg_id text,
-- 消息Profile列
from_id text, -- 发送方 userId
target_id text, -- 会话对端:单聊=对端 userId;群=groupId
state_flag bigint, -- 状态(按位与)
conversation_type tinyint, -- 会话类型:1:单聊;2:群聊;
msg_type text, -- 对应 classname,如 CallingMsg
content blob, -- 消息体
PRIMARY KEY ((app_id, user_id, day_bucket, box_type), data_time, msg_id)
) WITH CLUSTERING ORDER BY (data_time ASC, msg_id ASC)
AND compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1
}
AND default_time_to_live = 604800; -- 默认 7 天;INSERT 时用 USING TTL 按 app 覆盖
2.1.2 写入
- 接收方收件箱
sql
INSERT INTO im_offline.im_offline_messages (
app_id, user_id, day_bucket, box_type,
data_time, msg_id,
from_id, target_id, state_flag, conversation_type, msg_type, content
) VALUES (
10001, 'user_B', 20260527, 0,
1716789012345, 'msg-uuid-001',
'user_A', 'user_A', 16, 0, 'TxtMsg', 0x48656c6c6f
)
USING TTL 604800;
2.发送方发件箱
sql
INSERT INTO im_offline.im_offline_messages (
app_id, user_id, day_bucket, box_type,
data_time, msg_id,
from_id, target_id, state_flag, conversation_type, msg_type, content
) VALUES (
10001, 'user_A', 20260527, 1,
1716789012345, 'msg-uuid-001',
'user_A', 'user_B', 16, 0, 'TxtMsg', 0x48656c6c6f
)
USING TTL 604800;
2.1.3 读取
处理原则:
批量拉消息时一般是收件箱+发件箱一起批量拉某个时间点之后的N条消息(例如:100条),但是做不到"一条 CQL 同时拉收件箱+发件箱 + 跨天 + limit100"(因为分区键里有 box_type 和 day_bucket)。通常的做法是:应用层并行/串行执行两条查询(InBox/SendBox),各自跨天迭代,再把结果按 data_time 归并,最终取前 100 条。
--收件箱:拉某时间之后最多 100 条(单天)
sql
SELECT data_time, msg_id, from_id, target_id,
state_flag, conversation_type, msg_type, content
FROM im_offline.im_offline_messages
WHERE app_id = ?
AND user_id = ?
AND day_bucket = ?
AND box_type = 0
AND data_time > ?
LIMIT ?;
--发件箱:拉某时间之后最多 100 条(单天)
sql
SELECT data_time, msg_id, from_id, target_id,
state_flag, conversation_type, msg_type, content
FROM im_offline.im_offline_messages
WHERE app_id = ?
AND user_id = ?
AND day_bucket = ?
AND box_type = 1
AND data_time > ?
LIMIT ?;
3. SycllaDB 压力测试
3.1 压测方法
1、构造100万用户Id,读/写均按照环形规则构造消息的FromId/ToId,例如:user1/user2, user2/user3 ... 最后一个用户/第一个用户...。这样消息并发读写时From/To均可以快速计算出来:fromId= userIdList.get(currentSampleIndex % userIdList.size()),targetId = userIdList.get((currentSampleIndex + 1) % userIdList.size());
java
@SamplerType(name = SYCLLA_DB_SAMPLER)
public class SycllaDBSampler extends BaseAppSampler {
private static final Logger LOGGER = LogManager.getLogger(SycllaDBSampler.class);
private List<String> userIdList = new ArrayList<>();
private CqlSession cqlSession;
private int cmdType;
private boolean isAsync;
private PreparedStatement insertStmt;
private PreparedStatement selectStmt;
public SycllaDBSampler(AppSamplerSetting config) {
super(config);
}
@Override
public void sampleStarted() {
try {
this.init();
this.insertStmt = this.cqlSession.prepare(
"INSERT INTO im_offline_messages (" +
"app_id, user_id, day_bucket, box_type, " +
"data_time, msg_id, from_id, target_id, " +
"state_flag, conversation_type, msg_type, content) " +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)");
this.selectStmt = this.cqlSession.prepare(
"SELECT * FROM im_offline_messages " +
"WHERE app_id=? AND user_id=? AND day_bucket=? AND box_type=? LIMIT 10"
);
LOGGER.info("SycllaDBSampler testStarted() done.");
} catch (Exception ex) {
LOGGER.error("SycllaDBSampler.sampleStarted() error!", ex);
}
}
@Override
public SampleResult sample() {
SampleResult sampleResult = new SampleResult();
try {
int currentSampleIndex = this.sampleIndex.getAndIncrement();
switch (this.cmdType) {
case 0:
this.writeStressTest(sampleResult, currentSampleIndex);
break;
case 1:
this.readStressTest(sampleResult, currentSampleIndex);
break;
default:
throw new PocException("Invalid cmdType !");
}
} catch (Exception ex) {
LOGGER.error("SycllaDBSampler.sample() error!", ex);
if (ex instanceof InterruptedException) {
Thread.currentThread().interrupt();
}
return SamplerUtils.buildSampleResult(this.config.getSampleLabel(),
String.valueOf(CODE_SAMPLE_ERROR),
ex.getMessage(),
false);
}
return sampleResult;
}
@Override
public void sampleEnded() {
try {
if (Objects.nonNull(this.cqlSession)) {
this.cqlSession.close();
}
LOGGER.info("SycllaDBSampler sampleEnded() done.");
} catch (Exception ex) {
LOGGER.error("SycllaDBSampler.sampleEnded() error!", ex);
}
}
/**
* writeStressTest
*/
private void writeStressTest(SampleResult sampleResult, int currentIndex) throws InterruptedException {
String fromId = this.userIdList.get(currentIndex % this.userIdList.size());
String targetId = this.userIdList.get((currentIndex + 1) % this.userIdList.size());
String msgId = UUIDUtil.getUUID().toString();
long timestamp = System.currentTimeMillis();
BoundStatement insert = this.insertStmt.bind(APP_ID,
fromId,
DAY_BUCKET,
BOX_TYPE,
timestamp,
msgId,
fromId,
targetId,
STATE_FLAG,
CONVERSATION_TYPE,
MSG_TYPE,
CONTENT_DATA);
if (this.isAsync) {
this.asyncWrite(sampleResult, insert, currentIndex);
} else {
this.syncWrite(sampleResult, insert, currentIndex);
}
}
/**
* readStressTest
*/
private void readStressTest(SampleResult sampleResult, int currentIndex) throws InterruptedException {
String userId = this.userIdList.get(currentIndex % this.userIdList.size());
BoundStatement select = this.selectStmt.bind(APP_ID, userId, DAY_BUCKET, BOX_TYPE);
if (this.isAsync) {
this.asyncRead(sampleResult, select, currentIndex);
} else {
this.syncRead(sampleResult, select, currentIndex);
}
}
/**
* asyncRead
*/
private void asyncRead(SampleResult sampleResult, BoundStatement select, int currentIndex) throws InterruptedException {
CountDownLatch latch = new CountDownLatch(1);
sampleResult.sampleStart();
this.cqlSession.executeAsync(select).whenComplete((rs, ex) -> {
try {
if (ex != null) {
sampleResult.sampleEnd();
sampleResult.setSuccessful(false);
sampleResult.setResponseCode(String.valueOf(CODE_SAMPLE_ERROR));
sampleResult.setResponseMessage(ex.getMessage());
} else {
int rowCount = countRows(rs);
sampleResult.sampleEnd();
sampleResult.setSuccessful(true);
sampleResult.setResponseCode(String.valueOf(CODE_SUCCESS));
sampleResult.setResponseMessage("OK");
if (currentIndex % 1000 == 0) {
LOGGER.info("asyncRead done, currentIndex: {}, time: {}, rowCount: {} ", currentIndex, sampleResult.getTime(), rowCount);
}
}
} finally {
latch.countDown();
}
});
latch.await();
}
/**
* syncRead
*/
private void syncRead(SampleResult sampleResult, BoundStatement select, int currentIndex) {
sampleResult.sampleStart();
ResultSet rs = this.cqlSession.execute(select);
int rowCount = countRows(rs);
sampleResult.sampleEnd();
sampleResult.setSuccessful(true);
sampleResult.setResponseCode(String.valueOf(CODE_SUCCESS));
sampleResult.setResponseMessage("OK");
if (currentIndex % 1000 == 0) {
LOGGER.info("syncRead done, currentIndex: {}, time: {}, rowCount: {} ", currentIndex, sampleResult.getTime(), rowCount);
}
}
/**
* syncWrite
*/
private void syncWrite(SampleResult sampleResult, BoundStatement insert, int currentIndex) {
sampleResult.sampleStart();
this.cqlSession.execute(insert);
sampleResult.sampleEnd();
sampleResult.setSuccessful(true);
sampleResult.setResponseCode(String.valueOf(CODE_SUCCESS));
sampleResult.setResponseMessage("OK");
if (currentIndex % 1000 == 0) {
LOGGER.info("syncWrite done, currentIndex: {}, time: {}", currentIndex, sampleResult.getTime());
}
}
/**
* asyncWrite
*/
private void asyncWrite(SampleResult sampleResult, BoundStatement insert, int currentIndex) throws InterruptedException {
CountDownLatch latch = new CountDownLatch(1);
sampleResult.sampleStart();
this.cqlSession.executeAsync(insert).whenComplete((rs, ex) -> {
try {
if (ex != null) {
sampleResult.sampleEnd();
sampleResult.setSuccessful(false);
sampleResult.setResponseCode(String.valueOf(CODE_SAMPLE_ERROR));
sampleResult.setResponseMessage(ex.getMessage());
} else {
sampleResult.sampleEnd();
sampleResult.setSuccessful(true);
sampleResult.setResponseCode(String.valueOf(CODE_SUCCESS));
sampleResult.setResponseMessage("OK");
if (currentIndex % 1000 == 0) {
LOGGER.info("asyncWrite done, currentIndex: {}, time: {}", currentIndex, sampleResult.getTime());
}
}
} finally {
latch.countDown();
}
});
latch.await();
}
/**
* init
*/
private CqlSession init() throws PocException {
// initUserIdList
for (int i = 0; i < 1000000; i++) {
this.userIdList.add("POC-User-" + i);
}
// initCqlSession
LinkedHashMap<String, String> varSettingMap = SamplerUtils.getKvMap(this.config.getSampleVar());
String host = varSettingMap.get("host");
int port = Integer.parseInt(varSettingMap.get("port"));
String keySpace = varSettingMap.get("keySpace");
this.cmdType = Integer.parseInt(varSettingMap.get("cmdType"));
this.isAsync = Boolean.parseBoolean(varSettingMap.get("isAsync"));
DriverConfigLoader driverConfigLoader = DriverConfigLoader.programmaticBuilder()
.withString(DefaultDriverOption.SESSION_NAME, "im-test-session")
.withDuration(DefaultDriverOption.REQUEST_TIMEOUT, Duration.ofSeconds(15))
.withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, 16)
.withInt(DefaultDriverOption.CONNECTION_POOL_REMOTE_SIZE, 1)
.withInt(DefaultDriverOption.CONNECTION_MAX_REQUESTS, 2048)
.withString(DefaultDriverOption.REQUEST_THROTTLER_CLASS, "ConcurrencyLimitingRequestThrottler")
.withInt(DefaultDriverOption.REQUEST_THROTTLER_MAX_CONCURRENT_REQUESTS, 20000)
.withInt(DefaultDriverOption.REQUEST_THROTTLER_MAX_QUEUE_SIZE, 200000)
.build();
this.cqlSession = CqlSession.builder()
.withConfigLoader(driverConfigLoader)
.addContactPoint(new InetSocketAddress(host, port))
.withLocalDatacenter("datacenter1")
.withKeyspace(keySpace)
.build();
String releaseVersion = this.cqlSession.execute("SELECT release_version FROM system.local").one().getString("release_version");
LOGGER.info("CqlSession init done: {}", releaseVersion);
return this.cqlSession;
}
/**
* countRows
*/
@SuppressWarnings({"squid:S1481"})
private static int countRows(AsyncResultSet rs) {
int count = 0;
AsyncResultSet current = rs;
while (true) {
for (var row : current.currentPage()) {
count++;
}
if (!current.hasMorePages()) {
break;
}
current = current.fetchNextPage().toCompletableFuture().join();
}
return count;
}
/**
* countRows
*/
@SuppressWarnings({"squid:S1481"})
private static int countRows(ResultSet rs) {
int count = 0;
for (var row : rs) {
count++;
}
return count;
}
private static final long APP_ID = 100000L;
private static final int DAY_BUCKET = Integer.parseInt(DateUtil.date2Str(new Date(), DateUtil.DEFAULT_DATE_HYPHEN_FORMAT));
private static final byte BOX_TYPE = (byte) 0;
private static final long STATE_FLAG = 0L;
private static final byte CONVERSATION_TYPE = (byte) 1;
private static final String MSG_TYPE = "Chat";
private static final String CONTENT = "{\n" +
" \"messageId\": \"6oNL9o7tX8\",\n" +
" \"fromUserId\": \"10020439\",\n" +
" \"fromUserName\": \"\",\n" +
" \"content\": {\n" +
" \"callId\": \"4cd5433f92e74c71\",\n" +
" \"roomId\": 0,\n" +
" \"roomType\": 0,\n" +
" \"groupCid\": \"\",\n" +
" \"kickedUid\": \"\",\n" +
" \"status\": \"OnApply\",\n" +
" \"video\": true,\n" +
" \"senderId\": \"10020439\",\n" +
" \"audio\": true,\n" +
" \"isAnswered\": false,\n" +
" \"ownerId\": \"10020439\",\n" +
" \"ownerName\": \"ownerName\",\n" +
" \"callerList\": [\"10020439\", \"10018215\"],\n" +
" \"deviceKey\": \"android\",\n" +
" \"sdkProvider\": \"agora\",\n" +
" \"toRtcTokenList\": \"[{\\\"userId\\\":\\\"10018215\\\",\\\"token\\\":\\\"007eJxTY...\\\"}]\",\n" +
" \"encryptionKey\": \"2e65abf873a63313f42a859e0dc6b92c\",\n" +
" \"encryptionSalt\": \"drdwNQF5lpI9Wa6guAFt7fgynMm6lVCZ4+qw8q9vy8U=\",\n" +
" \"serverUrl\": \"\",\n" +
" \"serverOptions\": null\n" +
" },\n" +
" \"isEdit\": 0,\n" +
" \"replyToMessageId\": \"\",\n" +
" \"replyTo\": \"\",\n" +
" \"replyName\": \"\",\n" +
" \"replyUserId\": \"\",\n" +
" \"replyToType\": \"\",\n" +
" \"timeSend\": 1775808825451,\n" +
" \"type\": \"Calling\",\n" +
" \"toUserId\": \"10018215\",\n" +
" \"to\": [\"Static\"],\n" +
" \"toJid\": [\"10018215\"],\n" +
" \"atUserIds\": [],\n" +
" \"extraData\": \"\"\n" +
"}";
private static final ByteBuffer CONTENT_DATA = ByteBuffer.wrap(CONTENT.getBytes(StandardCharsets.UTF_8));
}