除法的效率
性能结论:
除法:545ms
乘法:93ms
差了3倍
代码:
#define CNT (1000)
#define factor (0.666667) // 乘法0.666667 ,除法时 1/1.5
void parse3AState(Pack1* in, Pack2* out) {
for (int i = 0; i < CNT; i++) {
out[i].a1.a = in[i].a1.a * factor;
out[i].a1.b = in[i].a1.b * factor;
out[i].a1.c = in[i].a1.c * factor;
//
out[i].a2.a = in[i].a2.a * factor;
out[i].a2.b = in[i].a2.b * factor;
out[i].a2.c = in[i].a2.c * factor;
out[i].a3.a = in[i].a3.a * factor;
out[i].a3.b = in[i].a3.b * factor;
out[i].a3.c = in[i].a3.c * factor;
}
return;
}
int parseTest() {
int cnt = 1000* 10;
Pack1 in[CNT] = {0};
Pack2 out[CNT] = {0};
memset(in, 1, sizeof(in));
memset(out, 1, sizeof(out));
for (int i =0; i < cnt; i++) {
parse3AState(in, out);
}
return cnt;
}
测试结果 :
perfile_monitor_test_fun() E
cpu cycles: 1079996448 cycles per loop: 107999.645
inst cnt: 260163128 insts per loop: 26016.313
cache misses: 14571430
cache ipc: 0.240893
perfile_monitor_test_fun() X. perf:545.50ms
perfile_monitor_test_fun() E
cpu cycles: 179856277 cycles per loop: 17985.628
inst cnt: 270163809 insts per loop: 27016.381
cache misses: 14543847
cache ipc: 1.502109
perfile_monitor_test_fun() X. perf:93.20ms
使用__restrict__
void fun(Pack1* restrict in, Pack2* restrict out) ;
__restrict__ 申明该指针为唯一的访问该内存的指针
优化成果:
- 时间优化为原来的82%
- 指令优化为74%
- cache miss 没有变化
优化对比
优化前
cpu cycles: 3864307714 cycles per loop: 38643.077
inst cnt: 5633595795 insts per loop: 56335.958
cache misses: 295589314
ipc: 1.458
perfile_monitor_test_fun() X. perf:1953.13ms
优化后
cpu cycles: 3204657277 cycles per loop: 32046.573
inst cnt: 4199995460 insts per loop: 41999.955
cache misses: 295200231
ipc: 1.311
perfile_monitor_test_fun() X. perf:1621.12ms
测试平台
MTK arm天机 8000