背景
这是我校招刚入职 Shopee 时遇到的一个问题。Shopee 私有云上 WAF 给内部用户提供了设置 IP 黑白名单规则的能力,所有规则存储在 MySQL 中。我校招刚入职时从已离职前辈的手中接过了这套系统。但很快发现每次修改规则后的 5min 内读到的数据不稳定------新规则时而查得到,时而查不到,也经常有用户反馈这个问题。排查发现原因是服务代码中使用了内存缓存,而这个服务部署了两个实例,实例之间没有同步写请求。如果写后读的读写请求被路由到不同的实例上,就无法读到最新数据。而内存缓存的过期时间被设置为 5min。
查了下这个服务的运维记录,在我入职之前做过一次扩容,从单实例扩容到双实例。之前的研发同事维护 WAF 时一直是单实例运行,所以没出过问题。后来他离职了,别的同事扩容时可能也没意识到会造成不一致的问题。于是问题就到了我这儿。
引入 Redis
我首先想到的解决办法是把内存缓存换成了 Redis,但上线灰度阶段 Redis 带宽被打满,排查发现是因为有些规则的封禁 IP 列表很长,导致传输数据量非常大。
最终方案
由于 WAF 规则读多写少,绝大多数时候从 Redis 读到的数据不会有变化。有经验的老同事建议用 Redis 维护版本号,规则数据仍然存在内存缓存中。经过反复推敲,最终的设计的架构如下。
读写逻辑:
- 写操作比较简单,使用当前微秒时间戳作为新的版本号,做如下四件事:写 DB,更新 redis 版本号,更新本地内存缓存中的数据和版本号,四件事的顺序可交换
- 读操作稍微复杂一点:先读 redis 中的版本号,如果本地版本号没有过期(绝大多数情况)就直接从本地内存缓存中读数据。对于 redis 与内存中版本号不一致和 redis 没读到(expired)的情况要单独处理,处理逻辑如伪代码所示
- 如果一微秒内有多个写请求,仍然可能出现不一致。不过 Shopee WAF 的实际使用场景不太会有如此频繁的更新,所以我就没做处理了。不过时间戳在这里只用来判等,不会比较大小,因此可以用任何一种分布式唯一 ID 解决方案替换时间戳
- 版本号不对用户暴露,事实上同一版本号可能会读到不同的规则数据,但这并不会破坏最终一致性
scss
func Set(key, data) {
newVer := time()
localCacheVer.Set(newVer)
localCacheData.Set(data)
WriteMySQL(key, data)
redis.Set(key, newVer, exprire=5min)
}
func Read(key) Data {
ver := redis.Get(key)
if ver != nil {
if localCacheVer.Load() == ver {
// Local cache is up-to-date, just use it
return localCacheData.Load()
}
} else { // This version has expired
ver := time()
res := redis.SetNX(key, ver, expire=5min)
if res == false {
// Another instance has proceded, use that version
ver = redis.Get(key)
}
}
data := ReadFromMySQL(key)
localCacheVer.Store(ver)
localCacheData.Store(data)
return data
}
TLA+ 形式化验证
恰好当时自学了 TLA+,顺手写了下这个设计对应的 TLA+ 公式,果然成功通过了最终一致性的验证。写这篇总结的时候感觉应该是线性一致的,但没有验证。
最开始的持续 5min 的接口返回数据不一致问题成功得到了解决。
python
// ================ tla file ================
---- MODULE waf ----
EXTENDS Integers, TLC
VARIABLE redisVer, localVer, pc, threadVer, DBData, localData, threadData
CONSTANTS DataDomain, ProcSet, r1, r2, r3, t1, t2, t3
vars == << redisVer, localVer, pc, threadVer, localData, threadData, DBData>>
Init == /\ redisVer = -1 /\ localVer = -1 /\ localData = "" /\ DBData = ""
/\ threadVer = [self \in ProcSet |-> -1]
/\ pc = [self \in ProcSet |-> "A"]
/\ threadData = [self \in ProcSet |-> ""]
RedisExpire == /\ threadData = [self \in ProcSet |-> DBData]
/\ redisVer' = -1
/\ DBData' \in DataDomain
/\ UNCHANGED <<localVer, threadVer, localData, threadData, pc>>
ReadRedis(self) == /\ pc[self] = "A"
/\ threadVer' = [threadVer EXCEPT ![self] = redisVer]
/\ / /\ redisVer = -1
/\ pc' = [pc EXCEPT ![self] = "C"]
/ /\ redisVer # -1
/\ pc' = [pc EXCEPT ![self] = "F"]
/\ UNCHANGED <<localVer, redisVer, localData, threadData, DBData>>
SetRedis(self) == /\ pc[self] = "C"
/\ / /\ redisVer # -1 * SetNX failed => use existing redis
/\ redisVer' = redisVer
/\ threadVer' = [threadVer EXCEPT ![self] = redisVer] * Not strictly the same!
/ /\ redisVer = -1 * SetNX ok => change redis
/\ redisVer' \in 1600012345..1600012350
/\ threadVer' = [threadVer EXCEPT ![self] = redisVer']
/\ pc' = [pc EXCEPT ![self] = "I"]
/\ UNCHANGED <<localVer, localData, threadData, DBData>>
CheckLocal(self) == /\ pc[self] = "F"
/\ / /\ localVer = threadVer[self] * Normal case
/\ threadData' = [threadData EXCEPT ![self] = localData]
/\ pc' = [pc EXCEPT ![self] = "H"]
/ /\ localVer # threadVer[self]
/\ pc' = [pc EXCEPT ![self] = "I"]
/\ threadData' = threadData
/\ UNCHANGED <<redisVer, localVer, localData, threadVer, DBData>>
SetLocal(self) == /\ pc[self] = "I"
/\ localVer' = threadVer[self]
/\ localData' = DBData
/\ threadData' = [threadData EXCEPT ![self] = DBData]
/\ pc' = [pc EXCEPT ![self] = "H"]
/\ UNCHANGED <<redisVer, threadVer, DBData>>
ReturnResult(self) == /\ pc[self] = "H"
/\ pc' = [pc EXCEPT ![self] = "Done"]
/\ UNCHANGED <<redisVer, localVer, threadVer, localData, threadData, DBData>>
Again(self) == /\ pc[self] = "Done"
/\ pc' = [pc EXCEPT ![self] = "A"]
/\ UNCHANGED <<redisVer, localVer, threadVer, localData, threadData, DBData>>
Terminating == /\ \A self \in ProcSet: pc[self] = "Done"
/\ UNCHANGED vars
Proceed(t) == ReadRedis(t) / SetRedis(t) / CheckLocal(t) / SetLocal(t) / ReturnResult(t) / Again(t)
Next == / RedisExpire
/ \E t \in ProcSet: Proceed(t)
FairForEveryone == \A t \in ProcSet: SF_vars(Proceed(t))
Spec == /\ Init /\ [][Next]_vars /\ FairForEveryone
symm == Permutations({r1, r2, r3}) \union Permutations({t1, t2, t3})
EventualCons == \A v \in DataDomain: DBData = v ~> threadData = [t \in ProcSet |-> v]
ECSpec == Spec /\ EventualCons
// ======= cfg file ========
SPECIFICATION Spec
CONSTANTS
DataDomain = {r1, r2}
r1 = r1
r2 = r2
r3 = r3
ProcSet = {t1, t2, t3}
t1 = t1
t2 = t2
t3 = t3
SYMMETRY symm
PROPERTIES EventualCons