文章目录
- 一、简介
-
- [1.1 功能概述](#1.1 功能概述)
- [1.2 内存序语义区别](#1.2 内存序语义区别)
- [1.3 exclusive monitor](#1.3 exclusive monitor)
- [1.4 异常(Data Abort)处理细节](#1.4 异常(Data Abort)处理细节)
- [1.5 寄存器提示(performance hint)](#1.5 寄存器提示(performance hint))
- [1.6 性能实践](#1.6 性能实践)
- [1.7 语义细节与实现要点](#1.7 语义细节与实现要点)
- [1.8 指令格式](#1.8 指令格式)
- 二、代码示例
一、简介
1.1 功能概述
CASP / CASPA / CASPL / CASPAL 是对内存中**一对字(32-bit)或一对双字(64-bit)**做原子 Compare-And-Swap Pair 的指令:读出内存对并与第一对寄存器比较;若相等则把第二对寄存器的值写入内存;读/写作为一个原子事务发生(读与写之间不会被其他处理器打断)。
1.2 内存序语义区别
CASP:既没有 acquire 也没有 release(relaxed)
CASPA:load-acquire(读为 acquire)
CASPL:store-release(写为 release)
CASPAL:load-acquire + store-release(acq_rel,全屏障语义)
如果你需要只保证读后可见性用 CASPA,需要保证写前的顺序用 CASPL,常见需要两者的情形用 CASPAL。
1.3 exclusive monitor
指令会做 exclusive load/compare/可能的 store;体系结构允许"读"动作(即便最终比较失败)清除与该地址有关的任何 exclusive monitor。
1.4 异常(Data Abort)处理细节
如果指令产生同步 Data Abort(比如访问非法地址、权限问题等),则参与比较与读取的寄存器(/<W(s+1)> 或 /<X(s+1)>)在异常返回前会被"恢复为执行该指令前的值"。
这对异常安全很重要 ------ 不用担心寄存器会在异常后被任意修改。
1.5 寄存器提示(performance hint)
当 比较返回寄存器(/)与 写入寄存器(/)使用相同寄存器时,这是对内存系统的提示:表示接下来很可能会发生另一次对同一地址的 CAS 类访问。内存子系统可能据此采取优化使后续 CAS 更容易成功(例如预取或调整缓存目录)。这是一个性能 hint,并非语义必须。
1.6 性能实践
如果要写以 CASP/CASPA 开头并紧接再次对同一地址做 CASP/CASPA/CASPL/CASPAL 的序列,遵循这些规则可以获得更好性能:
(1)别在序列中做系统寄存器写、地址转换、cache/TLB 维护、会产生异常的指令、异常返回或 ISB。(这些会破坏硬件为加速后续 CAS 做的优化。)
(2)序列长度尽量 ≤ 32 条指令。
(3)第一次 CASP/CASPA 的 / 里给出的值最好是"可能比较失败"的值(例如 deliberately wrong value),因为失败(即不做写)对性能有时更好 ------ 硬件避免做实际写入会更省代价。
1.7 语义细节与实现要点
(1)返回值含义:指令返回的寄存器对(/<W(s+1)> 或 /<X(s+1)>)在执行后包含了内存原本的值(observed old)。调用方可据此判断成功/失败(如果等于 expected 则成功)。
(2)如果比较不相等,架构允许把读到的数据写回内存(虽然通常实现不会这样做,但允许性说明硬件可以自由选择最优实现)。
(3)寄存器对必须满足 ISA 要求(通常为连续寄存器对,且起始寄存器往往要是偶数编号,assembler 会检查)。
(4)对于 128-bit pair CAS, 使用 LSE 指令(CASP*)比 ldaxp/stlxp 循环更高效;fallback 要用 ldaxp + stlxp + 重试循环,并在比较失败路径执行 clrex。
1.8 指令格式

这类指令的操作数格式如下(以 64-bit 为例):
c
CASP <Xs>, <X(s+1)>, <Xt>, <X(t+1)>, [<Xn|SP>]
Xs : 第一个要比较和加载的64位寄存器(必须是偶数寄存器)
X(s+1): 第二个要比较和加载的64位寄存器(必须是 Xs+1)
Xt : 第一个要条件存储的64位寄存器(必须是偶数寄存器)
X(t+1): 第二个要条件存储的64位寄存器(必须是 Xt+1)
Xn|SP : 64位基址寄存器或栈指针
正确示例(64-bit):
c
mov x0, addr
mov x4, expected_lo
mov x5, expected_hi
mov x6, desired_lo
mov x7, desired_hi
caspal x4, x5, x6, x7, [x0]
使用了 (x4,x5) 和 (x6,x7) 两个合法寄存器对。
二、代码示例
caspal_cmpxchg128.S:
c
.text
.global caspal_cmpxchg128
.type caspal_cmpxchg128, %function
// int caspal_cmpxchg128(uint64_t *addr,
// uint64_t expected_lo, uint64_t expected_hi,
// uint64_t desired_lo, uint64_t desired_hi,
// uint64_t *out_lo, uint64_t *out_hi);
//
// return 0 = success (out = expected)
// return 1 = fail (out = observed)
caspal_cmpxchg128:
// x0 = addr
// x1 = expected_lo
// x2 = expected_hi
// x3 = desired_lo
// x4 = desired_hi
// x5 = out_lo
// x6 = out_hi
// Put expected in an even-start pair x16/x17
mov x16, x1
mov x17, x2
// Put desired in an even-start pair x18/x19
mov x18, x3
mov x19, x4
// CASPAL (Acquire+Release). On return x16/x17 hold observed old value.
caspal x16, x17, x18, x19, [x0]
// x16/x17 now contain observed old
// Compare observed == expected ?
cmp x16, x1
ccmp x17, x2, #0, eq
// If equal, success_path (Z flag set)
b.eq 1f
// Failure path: observed != expected
// Write observed to out and return 1
str x16, [x5]
str x17, [x6]
mov x0, #1
ret
1:
// Success path: memory was equal to expected and has been updated to desired.
// Per API, return out = expected, return 0.
str x1, [x5]
str x2, [x6]
mov x0, #0
ret
.size caspal_cmpxchg128, .-caspal_cmpxchg128
ldxp_cmpxchg128.S:
c
.text
.global ldxp_cmpxchg128
.type ldxp_cmpxchg128, %function
// int ldxp_cmpxchg128(uint64_t *addr,
// uint64_t expected_lo, uint64_t expected_hi,
// uint64_t desired_lo, uint64_t desired_hi,
// uint64_t *out_lo, uint64_t *out_hi);
//
// return 0 = success (out = expected)
// return 1 = fail (out = observed)
ldxp_cmpxchg128:
// x0 = addr
// x1 = expected_lo
// x2 = expected_hi
// x3 = desired_lo
// x4 = desired_hi
// x5 = out_lo
// x6 = out_hi
// Copy expected and desired to temporaries
mov x7, x1
mov x8, x2
mov x9, x3
mov x10, x4
LoopStart:
// Exclusive pair load (acquire)
ldaxp x11, x12, [x0] // observed_lo -> x11, observed_hi -> x12
// Compare observed with expected
cmp x11, x7
ccmp x12, x8, #0, eq
b.ne CasFail // if not equal -> fail path (must clrex)
/*
这是 ARM64 汇编语法对 STXP/LDXP/STLXP 指令的要求:
STLXP 不是普通的 128-bit 存储,而是 exclusive store pair,汇编器要求:
成功标志寄存器必须是 32-bit W 寄存器(而不是 X 寄存器)
存储的数据寄存器对可以是 Xn/Xn+1(64-bit 对)
因此正确用法是:
stlxp w13, x9, x10, [x0]
也就是说 第一个寄存器是 W 寄存器(返回状态 0=成功 / 1=失败),而不是 X 寄存器。
*/
// Try exclusive pair store (release). w13 is status: 0 = success
stlxp w13, x9, x10, [x0]
cbnz w13, LoopStart // store failed, retry (monitor already cleared/handled by stlxp)
// Success: stored desired, per API out = expected, return 0
str x7, [x5]
str x8, [x6]
mov x0, #0
ret
CasFail:
/*
在 CAS 失败(比较不相等) 的路径中,由于已经执行了 ldaxp(即执行了 exclusive load),必须执行 clrex 清除本地 exclusive monitor,否则会残留监控状态,对后续原子指令产生副作用。
ARM 官方文档的要求:
成功的 stlxp 或其他 store-exclusive 会自动清除 monitor
失败路径(不执行 stlxp 或 stlxp 不会执行)必须手动 clrex
*/
// Mismatch: we executed ldaxp but will not execute stlxp -> must clear monitor
clrex
// Return observed values
str x11, [x5]
str x12, [x6]
mov x0, #1
ret
.size ldxp_cmpxchg128, .-ldxp_cmpxchg128
caspal_cas128.c:
c
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <string.h>
// Declarations of assembly functions
extern int caspal_cmpxchg128(uint64_t *addr,
uint64_t expected_lo, uint64_t expected_hi,
uint64_t desired_lo, uint64_t desired_hi,
uint64_t *out_lo, uint64_t *out_hi);
extern int ldxp_cmpxchg128(uint64_t *addr,
uint64_t expected_lo, uint64_t expected_hi,
uint64_t desired_lo, uint64_t desired_hi,
uint64_t *out_lo, uint64_t *out_hi);
// aligned 128-bit pair
struct pair_t {
uint64_t lo;
uint64_t hi;
} __attribute__((aligned(16)));
static void print_pair(const char *label, const struct pair_t *p) {
printf("%s = {0x%016" PRIx64 ", 0x%016" PRIx64 "}\n", label, p->lo, p->hi);
}
int main(void) {
struct pair_t v;
uint64_t out_lo, out_hi;
int r;
printf("=== Test CASPAL (hardware LSE) ===\n");
v.lo = 0x1122334455667788ULL;
v.hi = 0x99AABBCCDDEEFF00ULL;
print_pair("Before", &v);
r = caspal_cmpxchg128(&v.lo,
0x1122334455667788ULL, 0x99AABBCCDDEEFF00ULL,
0x1111111111111111ULL, 0x2222222222222222ULL,
&out_lo, &out_hi);
printf("Return = %d (0=success)\n", r);
printf("Observed returned: {0x%016" PRIx64 ", 0x%016" PRIx64 "}\n", out_lo, out_hi);
print_pair("After", &v);
printf("\n");
printf("=== Test LDAXP/STLXP (software fallback) ===\n");
// reset memory
v.lo = 0x1122334455667788ULL;
v.hi = 0x99AABBCCDDEEFF00ULL;
print_pair("Before", &v);
r = ldxp_cmpxchg128(&v.lo,
0x1122334455667788ULL, 0x99AABBCCDDEEFF00ULL,
0xAAAAAAAAAAAAAAAAULL, 0xBBBBBBBBBBBBBBBBULL,
&out_lo, &out_hi);
printf("Return = %d (0=success)\n", r);
printf("Observed returned: {0x%016" PRIx64 ", 0x%016" PRIx64 "}\n", out_lo, out_hi);
print_pair("After", &v);
printf("\n");
// Test two failure cases: mismatch expected
printf("=== Test failure (mismatch expected) ===\n");
v.lo = 0xDEADBEEFDEADBEEFULL;
v.hi = 0xCAFEBABECAFEBABEULL;
print_pair("Before", &v);
r = ldxp_cmpxchg128(&v.lo,
0x1111111111111111ULL, 0x2222222222222222ULL, // wrong expected
0x3333333333333333ULL, 0x4444444444444444ULL,
&out_lo, &out_hi);
printf("Return = %d (0=success)\n", r);
printf("Observed returned: {0x%016" PRIx64 ", 0x%016" PRIx64 "}\n", out_lo, out_hi);
print_pair("After", &v);
printf("\n");
return 0;
}
Makefile:
c
# Makefile for CASPAL + LDAXP/STLXP 128-bit CAS test
# 编译器
CC = gcc
# 优化和标准
CFLAGS = -O2
ASMFLAGS_CASPAL = -O2 -march=armv8.1-a
ASMFLAGS_LDXP = -O2
# 目标文件
OBJS = caspal_cas128.o caspal_cmpxchg128.o ldxp_cmpxchg128.o
# 最终可执行文件
TARGET = test
# 默认目标
all: $(TARGET)
# 链接
$(TARGET): $(OBJS)
$(CC) $(OBJS) -o $(TARGET)
# C 文件编译
caspal_cas128.o: caspal_cas128.c
$(CC) $(CFLAGS) -c caspal_cas128.c
# CASPAL 汇编文件编译(需要 ARMv8.1-A)
caspal_cmpxchg128.o: caspal_cmpxchg128.S
$(CC) $(ASMFLAGS_CASPAL) -c caspal_cmpxchg128.S
# LDAXP/STLXP 汇编文件编译(不强制 ARMv8.1-A)
ldxp_cmpxchg128.o: ldxp_cmpxchg128.S
$(CC) $(ASMFLAGS_LDXP) -c ldxp_cmpxchg128.S
# 清理生成文件
clean:
rm -f $(OBJS) $(TARGET)
.PHONY: all clean
结果:
c
$ make
gcc -O2 -c caspal_cas128.c
gcc -O2 -march=armv8.1-a -c caspal_cmpxchg128.S
gcc -O2 -c ldxp_cmpxchg128.S
gcc caspal_cas128.o caspal_cmpxchg128.o ldxp_cmpxchg128.o -o test
$ ./test
=== Test CASPAL (hardware LSE) ===
Before = {0x1122334455667788, 0x99aabbccddeeff00}
Return = 0 (0=success)
Observed returned: {0x1122334455667788, 0x99aabbccddeeff00}
After = {0x1111111111111111, 0x2222222222222222}
=== Test LDAXP/STLXP (software fallback) ===
Before = {0x2222222222222222, 0x99aabbccddeeff00}
Return = 0 (0=success)
Observed returned: {0x2222222222222222, 0x99aabbccddeeff00}
After = {0xaaaaaaaaaaaaaaaa, 0xbbbbbbbbbbbbbbbb}
=== Test failure (mismatch expected) ===
Before = {0xdeadbeefdeadbeef, 0xcafebabecafebabe}
Return = 1 (0=success)
Observed returned: {0xdeadbeefdeadbeef, 0xcafebabecafebabe}
After = {0xdeadbeefdeadbeef, 0xcafebabecafebabe}
说明:我这里函数的输出
CAS 成功
addr\] == expected 内存写入 desired out_lo/out_hi = expected_lo/expected_hi CAS 失败 \[addr\] != expected 内存不变 out_lo/out_hi = observed_lo/observed_hi