为什么要慎用 Skia 多线程渲染？

Android 渲染通过 UI 线程和渲染线程的分离实现了多线程渲染，UI 线程负责更新 UI，渲染线程负责完成渲染指令。然而，使用多线程就不得不考虑到线程安全和死锁等问题，因此 Android 也设计了很多保护和限制。对于更新 UI 这项工作，Android 严格限制只能在 UI 线程中进行，如果在子线程中更新UI就会抛出一个异常------CalledFromWrongThreadException。对于进行渲染这项工作，Android 没有像更新 UI 那样直接限制成在非渲染线程中就抛出异常，所以开发者需要更加谨慎地使用使用多线程。本文将介绍一个笔者实际遇到的 AOSP bug，引出 Skia 单一使用者原则，分析 Google 如何实现对渲染的线程安全保护。

一、问题的引入

笔者在公司里遇到过这样一个 bug，Android 15 项目在使用今日头条时会异常闪退，堆栈如下：

txt 复制代码

Timestamp: 2024-07-26 12:36:27.940326822+0800
Process uptime: 42103s
Cmdline: com.ss.android.article.news
pid: 22671, tid: 7833, name: FinalizerDaemon  >>> com.ss.android.article.news <<<
uid: 10271
tagged_addr_ctrl: 0000000000000001 (PR_TAGGED_ADDR_ENABLE)
pac_enabled_keys: 000000000000000f (PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY, PR_PAC_APDBKEY)
signal 5 (SIGTRAP), code 1 (TRAP_BRKPT), fault addr 0x00000071b0e243d8
x0  0000000000000001  x1  00000070844b1db0  x2  0000000000000000  x3  0000000000000010
x4  0000000000000000  x5  000000700468f87c  x6  0000000000000000  x7  0000000000000015
x8  00000070c4157030  x9  0000000000000109  x10 0000000000000108  x11 0000000000000000
x12 0000000000000000  x13 0000000000000001  x14 00000071b1094140  x15 0000000000000000
x16 000000718c7f09e8  x17 00000071a7c856a0  x18 0000006e611cc000  x19 00000070844b1d90
x20 0000006fd40ec7e0  x21 00000070c4157030  x22 0000000000000005  x23 00000070844b1d90
x24 0000000000000080  x25 0000006eb819da80  x26 00000070844b1d90  x27 000000001be15128
x28 000000001be1fa48  x29 0000006eb819cef0
lr  00000071b0e241f0  sp  0000006eb819cef0  pc  00000071b0e243d8  pst 0000000080001000
backtrace:
#00 pc 00000000005a83d8  /system/lib64/libhwui.so (GrResourceCache::notifyARefCntReachedZero(GrGpuResource*, GrIORef<GrGpuResource>::LastRemovedRef)+600) (BuildId: 980129c5e4915e3ceb5d01b192babccc)
#01 pc 00000000006aa788  /system/lib64/libhwui.so (GrVkGpu::setBackendSurfaceState(GrVkImageInfo, sk_sp<skgpu::MutableTextureState>, SkISize, VkImageLayout, unsigned int, skgpu::MutableTextureState*, sk_sp<skgpu::RefCntedCallback>)+488) (BuildId: 980129c5e4915e3ceb5d01b192babccc)
#02 pc 00000000006aa914  /system/lib64/libhwui.so (GrVkGpu::setBackendTextureState(GrBackendTexture const&, skgpu::MutableTextureState const&, skgpu::MutableTextureState*, sk_sp<skgpu::RefCntedCallback>)+228) (BuildId: 980129c5e4915e3ceb5d01b192babccc)
#03 pc 0000000000579d1c  /system/lib64/libhwui.so (GrDirectContext::setBackendTextureState(GrBackendTexture const&, skgpu::MutableTextureState const&, skgpu::MutableTextureState*, void (*)(void*), void*)+172) (BuildId: 980129c5e4915e3ceb5d01b192babccc)
#04 pc 00000000002cf19c  /system/lib64/libhwui.so (android::uirenderer::AutoBackendTextureRelease::releaseQueueOwnership(GrDirectContext*)+140) (BuildId: 980129c5e4915e3ceb5d01b192babccc)
#05 pc 00000000002cf56c  /system/lib64/libhwui.so (android::uirenderer::DeferredLayerUpdater::destroyLayer()+172) (BuildId: 980129c5e4915e3ceb5d01b192babccc)
#06 pc 00000000002cf464  /system/lib64/libhwui.so (android::uirenderer::DeferredLayerUpdater::~DeferredLayerUpdater()+228) (BuildId: 980129c5e4915e3ceb5d01b192babccc)
#07 pc 00000000002cf620  /system/lib64/libhwui.so (android::uirenderer::DeferredLayerUpdater::~DeferredLayerUpdater()+16) (BuildId: 980129c5e4915e3ceb5d01b192babccc)
#08 pc 000000000021cc6c  /system/framework/arm64/boot-framework.oat (art_jni_trampoline+108) (BuildId: 0166d49e65fc85b28cdf29ede4f11c1e1a98801d)
#09 pc 000000000091a2d8  /system/framework/arm64/boot-framework.oat (com.android.internal.util.VirtualRefBasePtr.finalize+56) (BuildId: 0166d49e65fc85b28cdf29ede4f11c1e1a98801d)
#10 pc 0000000000046970  /system/framework/arm64/boot-core-libart.oat (java.lang.Daemons$FinalizerDaemon.doFinalize+256) (BuildId: 89643a79df2d3f43ea16ab23a7ae8124df1ed132)
#11 pc 0000000000046c6c  /system/framework/arm64/boot-core-libart.oat (java.lang.Daemons$FinalizerDaemon.processReference+476) (BuildId: 89643a79df2d3f43ea16ab23a7ae8124df1ed132)
#12 pc 0000000000046ddc  /system/framework/arm64/boot-core-libart.oat (java.lang.Daemons$FinalizerDaemon.runInternal+300) (BuildId: 89643a79df2d3f43ea16ab23a7ae8124df1ed132)
#13 pc 0000000000022cb4  /system/framework/arm64/boot-core-libart.oat (java.lang.Daemons$Daemon.run+116) (BuildId: 89643a79df2d3f43ea16ab23a7ae8124df1ed132)
#14 pc 0000000000156a10  /system/framework/arm64/boot.oat (java.lang.Thread.run+64) (BuildId: f7665d7512c7144313d5404175c53cdb217bc589)
#15 pc 0000000000210774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: 36382edd0977b40d6f9b6d517638ae9c)
#16 pc 000000000048964c  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+156) (BuildId: 36382edd0977b40d6f9b6d517638ae9c)
#17 pc 00000000008b73e4  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1348) (BuildId: 36382edd0977b40d6f9b6d517638ae9c)
#18 pc 00000000008b6e88  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallbackWithUffdGc(void*)+8) (BuildId: 36382edd0977b40d6f9b6d517638ae9c)
#19 pc 00000000000707d8  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+200) (BuildId: c0fb87713601937b3b3993691d4e95dd)
#20 pc 0000000000061b50  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: c0fb87713601937b3b3993691d4e95dd)

从异常信息来看，NE 的直接原因是收到了 SIGTRAP 信号，表明这是一个预期之内的异常；异常线程是 FinalizerDaemon，堆栈表明正在进行释放 skgpu 资源的操作。一般而言，对 GPU 资源的使用要避免多线程，所以通常只在渲染线程中进行对 skgpu 资源进行操作。所以，怀疑点在多线程操作 skgpu 资源上。结合本地测试结果，问题概率较高，压测一天可以稳定复现。于是，笔者和同事在 hwui 和 Skia 中增加相关的 debug log 打印，以确认是否存在多线程使用的情况。

txt 复制代码

08-13 13:53:36.220 17088 17094 D skia   : GRC(0x76a766ad30)::notify0 resource(0x761789cfe0) !wasDestoryed(true) NonPurgeValid(true)
08-13 13:53:36.220 17088 17094 D skia   : GRC(0x76a766ad30)::notify0 -> fPurgeableQueue.insert(0x761789cfe0)
08-13 13:53:36.220 17088 17094 D skia   : SKTDPQueue::insert Before.ArraySize 6 After.Append 7 setIndex 6
08-13 13:53:36.220 17088 17357 D skia   : GRC(0x76a766ad30)::notify0 resource(0x76a77a9138) !wasDestoryed(true) NonPurgeValid(true)
08-13 13:53:36.220 17088 17094 D skia   : GRC(0x76a766ad30)::removeResource -> fPurgeableQueue.remove(0x761789cfe0)
08-13 13:53:36.220 17088 17357 D skia   : GRC(0x76a766ad30)::notify0 -> fPurgeableQueaue.insert(0x76a77a9138)
08-13 13:53:36.220 17088 17094 D skia   : GRC(0x76a766ad30):.notify0 resource(0x761789cfe0) !wasDestoryed(true) NonPurgeValid(true)
08-13 13:53:36.220 17088 17357 D skia   : SkTDPQueue::insert Before.ArraySize 6 After.Append 7 setIndex 6
08-13 13:53:36.220 17088 17094 D skia   : GRC(0x76a766ad30)::notify0 -> fPurgeableQueue.insert(0x761789cfe0)
08-13 13:53.36.220 17088 17357 D skia   : GRC(0x76a766ad30)::notify0 resource(0x7617886e10) !wasDestoryed(true) NonPurgeValid(true)
08-13 13:53:36.220 17088 17094 D skia   : SkTDPQueue::insert Before.Arraysize 7 After.Append 8 setIndex 7
08-13 13:53:36.220 17088 17094 D skia   : GRC(0x76a766ad30)::removeResource -> fPurgeableQueue.remove(0x761789cfe0)
08-13 13:53:36.220 17088 17357 D skia   : GRC(0x76a766ad30)::notify0 -> fPurgeableQueue.insert(0x7617886e10)
08-13 13:53:36.220 17088 17094 D skia   : GRC(0x76a766ad30)::notify0 resource(0x761789cfe0) !wasDestoryed(true) NonPurgeValid(true)
08-13 13:53:36.220 17088 17357 D skia   : SkTDPQueue::insert Before.ArraySize 7 After.Append 8 setIndex 7
08-13 13:53:36.220 17088 17094 D skia   : GRC(0x76a766ad30)::notify0 -> fPurgeableQueue.insert(0x761789cfe0)
08-13 13:53:36.220 17088 17357 D skia   : GRC(0x76a766ad30)::removeResource -> fPurgeableQueue.remove(0x7617886e10)
08-13 13:53:36.220 17088 17357 D skia   : SkTDPQueue::remove setIndex 7 curSize 8
08-13 13:53:36.221 17088 17094 D skia   : SKTDPQueue::insert Before.ArraySize 8 After.Append 9 setIndex 8
08-13 13:53:36.221 17088 17094 D skia   : Index(8) out of bounds for size 8
08-13 13:53:36.221 17088 17351 D skia   : GRC(0x76a766ad30)::removeResource -> fPurgeableQueue.remove(0x76a77a9138)

上面抓的 debug log 验证了我们的怀疑，可以很清晰地看到线程 17357 (RenderThread) 和线程 17094 (FinalizerDaemon) 都在地址为 0x76a766ad30 的 GrResourceCache 上操作地址为 0x761789cfe0 的资源。由于两个线程都在对资源 0x761789cfe0 进行删除操作，从而导致了访问数组下标越界。现在可以确认，NE 的根因就是 FinalizerDaemon 和 RenderThread 两个线程同时操作同一块 skgpu 资源导致的访问资源异常。

二、NE 流程分析

在发行版本上，Skia 只有 SkUNREACHABLE 宏会触发 SIGTRAP 异常。

C++ 复制代码

#if !defined(SkUNREACHABLE)
#  if defined(_MSC_VER) && !defined(__clang__)
#    include <intrin.h>
#    define FAST_FAIL_INVALID_ARG                 5
// See https://developercommunity.visualstudio.com/content/problem/1128631/code-flow-doesnt-see-noreturn-with-extern-c.html
// for why this is wrapped. Hopefully removable after msvc++ 19.27 is no longer supported.
[[noreturn]] static inline void sk_fast_fail() { __fastfail(FAST_FAIL_INVALID_ARG); }
#    define SkUNREACHABLE sk_fast_fail()
#  else
#    define SkUNREACHABLE __builtin_trap()
#  endif
#endif

使用 SkUNREACHABLE 宏的地方很多，结合第一节中的 debug log 可以确认是在访问数组元素时触发了下标越界检测。

C++ 复制代码

class SkTDPQueue {
    // ...
    /** Random access removal. This requires that the INDEX function is non-nullptr. */
    void remove(T entry) {
        SkASSERT(nullptr != INDEX);
        int index = *INDEX(entry);
        SkASSERT(index >= 0 && index < fArray.size());
        this->validate();
        SkDEBUGCODE(*INDEX(fArray[index]) = -1;)
        if (index == fArray.size() - 1) {
            fArray.pop_back();
            return;
        }
        fArray[index] = fArray[fArray.size() - 1];
        fArray.pop_back();
        this->setIndex(index);
        this->percolateUpOrDown(index);
        this->validate();
    }
    // ...
    SkTDArray<T> fArray;
};

C++ 复制代码

template <typename T> class SkTDArray {
    // ...

    T& operator[](int index) {
        return this->data()[sk_collection_check_bounds(index, this->size())];
    }
    const T& operator[](int index) const {
        return this->data()[sk_collection_check_bounds(index, this->size())];
    }

    // ...
};

C++ 复制代码

template <typename T> SK_API inline T sk_collection_check_bounds(T i, T size) {
    if (0 <= i && i < size) SK_LIKELY {
        return i;
    }

    SK_UNLIKELY {
        #if defined(SK_DEBUG)
            sk_print_index_out_of_bounds(static_cast<size_t>(i), static_cast<size_t>(size));
        #else
            SkUNREACHABLE;
        #endif
    }
}

沿着调用链继续向上追溯，在 GrResourceCache::removeResource 中会去删除 fPurgeableQueue 中对应的资源，如果渲染线程已经释放了该资源，那么 FinalizerDaemon 等 GC 线程就会因为数组访问越界导致 NE。

三、Skia 的单一使用者断言

与发行版本不同，Skia 在调试版本上实现了 单一使用者断言，如果不在同一个使用者线程中会抛出异常，实现对核心函数的线程安全的保护。

3.1 保护哪些函数操作

事实上，Skia 中绝大部分函数操作都不需要进行单一使用者保护，需要保护的函数基本都是直接操作 GPU 资源。以 Context 为例，需要保护的仅有 makeRecorder、insertRecording 等函数。在每个受保护的函数中，会调用 ASSERT_SINGLE_OWNER 宏实现保护。

C++ 复制代码

std::unique_ptr<Recorder> Context::makeRecorder(const RecorderOptions& options) {
    ASSERT_SINGLE_OWNER

    auto recorder = std::unique_ptr<Recorder>(new Recorder(fSharedContext, options));
#if defined(GRAPHITE_TEST_UTILS)
    if (fStoreContextRefInRecorder) {
        recorder->priv().setContext(this);
    }
#endif
    return recorder;
}

bool Context::insertRecording(const InsertRecordingInfo& info) {
    ASSERT_SINGLE_OWNER

    return fQueueManager->addRecording(info, this);
}

ASSERT_SINGLE_OWNER 宏定义如下：

C++ 复制代码

#define ASSERT_SINGLE_OWNER SKGPU_ASSERT_SINGLE_OWNER(this->singleOwner())

ASSERT_SINGLE_OWNER 宏调用了 SKGPU_ASSERT_SINGLE_OWNER，并将 this->singleOwner() 传入。singleOwner 函数返回的是 Context#fSingleOwner 的引用，即一个 SingleOwner 类型的指针。从 SingleOwner 的类名不难看出，它正是 Skia 中实现单一使用者断言的类。

C++ 复制代码

class SK_API Context final {
    // ...
    SingleOwner* singleOwner() const { return &fSingleOwner; }
    // ...
    // In debug builds we guard against improper thread handling. This guard is passed to the
    // ResourceCache for the Context.
    mutable SingleOwner fSingleOwner;
    // ...
};

3.2 如何实现单一使用者断言------`SingleOwner` 类分析

3.2.1 `SingleOwner` 类定义

C++ 复制代码

// This is a debug tool to verify an object is only being used from one thread at a time.
class SingleOwner {
public:
     SingleOwner() : fOwner(kIllegalThreadID), fReentranceCount(0) {}

     // ...

private:
     void enter(const char* file, int line) {
         SkAutoMutexExclusive lock(fMutex);
         SkThreadID self = SkGetThreadID();
         SkASSERTF(fOwner == self || fOwner == kIllegalThreadID, "%s:%d Single owner failure.",
                   file, line);
         fReentranceCount++;
         fOwner = self;
     }

     void exit(const char* file, int line) {
         SkAutoMutexExclusive lock(fMutex);
         SkASSERTF(fOwner == SkGetThreadID(), "%s:%d Single owner failure.", file, line);
         fReentranceCount--;
         if (fReentranceCount == 0) {
             fOwner = kIllegalThreadID;
         }
     }

     SkMutex fMutex;
     SkThreadID fOwner    SK_GUARDED_BY(fMutex);
     int fReentranceCount SK_GUARDED_BY(fMutex);
};

SingleOwner 的定义比较简单，成员变量只有三个：fMutex、fOwner和 fReentranceCount，成员函数只有两个：enter 和 exit。

fOwner 表示当前使用者线程，fReentranceCount 表示重复进入的次数，二者都只在构造函数中初始化。
enter 和 exit 两个函数分别表示进入和退出两种操作，用于判断是否有不当的多线程使用。在 enter 中，会判断当前线程号是否等于 fOwner 或 fOwner 是否非法，如果都不满足则抛出异常，最后给 fReentranceCount 加一，用当前线程号给 fOwner 赋值。在 exit 中，会判断当前线程号是否等于 fOwner，如果不满足则抛出异常，再给 fReentranceCount 减一，如果 fReentranceCount 为零则将 fOwner 标记为非法，表示当前线程对 GPU 资源的使用已结束。

3.2.2 使用 `SingleOwner`

还是以 Context 为例，在 3.1 节中，已经分析过 ASSERT_SINGLE_OWNER 调用了 SKGPU_ASSERT_SINGLE_OWNER，并将 Context#fSingleOwner 传入。SKGPU_ASSERT_SINGLE_OWNER 的定义也在 SingleOwner.h 文件中，它会去创建一个 skgpu::SingleOwner::AutoEnforce 对象。

C++ 复制代码

#define SKGPU_ASSERT_SINGLE_OWNER(obj) \
    skgpu::SingleOwner::AutoEnforce debug_SingleOwner(obj, __FILE__, __LINE__);

skgpu::SingleOwner::AutoEnforce 是 SingleOwner 的一个内部结构体，在构造函数中调用 SingleOwner#enter，在析构函数中调用 SingleOwner#exit。

C++ 复制代码

// This is a debug tool to verify an object is only being used from one thread at a time.
class SingleOwner {
     // ...
     struct AutoEnforce {
         AutoEnforce(SingleOwner* so, const char* file, int line)
                : fFile(file), fLine(line), fSO(so) {
             fSO->enter(file, line);
         }
         ~AutoEnforce() { fSO->exit(fFile, fLine); }

         const char* fFile;
         int fLine;
         SingleOwner* fSO;
     };
     // ...
};

SKGPU_ASSERT_SINGLE_OWNER 通过 AutoEnforce 的创建和析构，可以自动实现对 SingleOwner 的 enter 和 exit 的调用，从而保证在作用域内只有单一使用者，实现了对函数操作的线程安全保护。

3.3 为什么发行版本没有实现单一使用者断言？

如果发行版本上也实现了单一使用者断言，那么根据 SingleOwner 打印出的异常信息，可以很容易地定位到第一节中问题的原因。Google 之所以没有在发行版本启用该功能，是因为对 GPU 资源进行操作的次数和频率很高，如果每次都要检查是否为单一使用者，开销会相当巨大。出于性能的考量，无法承受这样的开销，只能让开发者自己确保线程安全。

四、问题的修复

我们把第一节中的问题反馈给了 Google，很快 Google 回复他们已经确认了该问题，并在验证解决方案中。Google 最后的正式提交为 255ee52dd9254ba8bbe68bef0c1182aae91dbf41 - platform/frameworks/base - Git at Google，可以看到 Google 的修复思路很直接，直接判断当前线程是否为渲染线程，从而保证释放 skgpu 资源的操作是线程安全的。

总结

根据对上面的 AOSP bug 的介绍和分析，相信读者应该对如何使用 Skia 多线程渲染有了更加深刻的认知。尽管多线程是一项非常强大而方便的技术，但由于 GPU 代码对于线程安全的高要求，我们在使用 Skia 进行多线程渲染时需要严格遵循单一使用者的原则，避免不当的多线程操作导致程序异常。

为什么要慎用 Skia 多线程渲染？

一、问题的引入

二、NE 流程分析

三、Skia 的单一使用者断言

3.1 保护哪些函数操作

3.2 如何实现单一使用者断言------SingleOwner 类分析

3.2.1 SingleOwner 类定义

3.2.2 使用 SingleOwner

3.3 为什么发行版本没有实现单一使用者断言？

四、问题的修复

总结

3.2 如何实现单一使用者断言------`SingleOwner` 类分析

3.2.1 `SingleOwner` 类定义

3.2.2 使用 `SingleOwner`