纤程(Fiber),协程和线程1

纤程(Fiber),协程和线程2

Fiber, a less known concept compared with coroutine, is a pretty powerful addition to cooperative multitasking. As a graphics programmer in game industry, I totally appreciate the great flexibility that fiber brings on the table. As a matter of fact, I feel the tech is a little bit underappreciated due to the lack of sufficient public materials talking about fibers.

与协程相比,纤程是一个鲜为人知的概念,但它是协作多任务处理的一个非常强大的补充。作为游戏行业的图形程序员,我非常欣赏纤程带来的巨大灵活性。事实上,我觉得由于缺乏足够的公开材料来谈论纤程,这项技术有点被低估了。

In this blog post, I will put down some of my learnings about fiber basics. Everything mentioned in this post will be specifically about C++ even though similar concept exist in other languages as well. Readers who have zero knowledge about fibers will be learning what it is and how we can take advantages of it in graphics rendering.

在这篇博文中,我将写下我对纤程基础知识的一些了解。这篇文章中提到的所有内容都将专门针对 C++,尽管其他语言中也存在类似的概念。对纤程零知识的读者将了解它是什么以及我们如何在图形渲染中利用它。

Multitasking 多任务处理

As we all know that modern games commonly require quite some processing power so that they react to gamers in a timely manner. The performance improvement made in single core CPU could no longer keep up with the sheer computational demand quite a while ago. To no one's surprise, the industry has shifted from single thread game engine to multithread engine more than a decade ago. Multithreading has became an essential part of game development. It is also a very mature tech that is well supported and optimized by major operation systems.

众所周知,现代游戏通常需要相当高的处理能力,以便及时对玩家做出反应。不久前单核CPU的性能提升已经无法满足纯粹的计算需求。毫不奇怪,十多年前,业界已经从单线程游戏引擎转向多线程引擎。多线程已成为游戏开发的重要组成部分。它也是一项非常成熟的技术,得到各大操作系统的良好支持和优化。

With multithreading, it is necessary to split the computation of a frame into multiple sub tasks depending on their characteristics so that each thread gets something to do. Some tasks process physics, some process rendering, the list goes on. It is inevitable to introduce some dependencies between the tasks. Respecting the dependencies requires careful synchronization. In order to correctly manage the tasks, job systems (sometimes named as task systems) are introduced in game engines. They are used to harness the multi-core power provided by CPUs.

使用多线程,需要根据其特点将一帧的计算拆分为多个子任务,以便每个线程都有事可做。有些任务处理物理,有些任务处理渲染,这样的例子不胜枚举。任务之间不可避免地引入一些依赖关系。尊重依赖关系需要仔细同步。为了正确管理任务,游戏引擎中引入了作业系统(有时称为任务系统)。它们用于利用 CPU 提供的多核功能。

Limitation of Preemptive Multitasking

抢占式多任务处理的局限性

It is not uncommon to see more active threads 'running' than the number of physical cores. In order to give users an illusion of multitasking, operating systems commonly execute multiple tasks in an interleaved manner. Each thread gets a fraction of the whole timeline of the physical cores. As long as the frequency of swaping thread is high enough, users will feel like they are running at the same time. This model is what is commonly known as preemtive multitasking[1] .

"运行"的活动线程数量多于物理核心数量的情况并不罕见。为了给用户一种多任务处理的错觉,操作系统通常以交错的方式执行多个任务。每个线程都获得物理核心整个时间线的一小部分。只要交换线程的频率足够高,用户就会感觉他们在同时运行。这种模型就是通常所说的抢占式多任务 [1]

Even though this model works fairly well for majority of applications, game development is one of the rare domains where developers will try to squish every tiny bit out of the target platform. There are some loss of flexibility in preemptive multitasking that can be annoying sometimes. Specifically, the following facts about preemptive multitasking are what bother game developers.

尽管这种模型对于大多数应用程序来说都相当有效,但游戏开发是开发人员试图将每一点都挤出目标平台的罕见领域之一。抢占式多任务处理会失去一些灵活性,有时会很烦人。具体而言,以下有关抢占式多任务处理的事实令游戏开发人员感到困扰。

  • Context switch happens at a frequency that is not directly controlled by developers. This is exactly how the OS supports multitasking in the first place. However, this is not cheap as it requires a trip in the kernel. And whenever it is not needed, it can be considered as a waste of resource.
    上下文切换发生的频率不受开发人员直接控制。这正是操作系统最初支持多任务的方式。然而,这并不便宜,因为它需要在内核中进行一次旅行。当不需要的时候,就可以被认为是资源的浪费。
    In order to understand better where this waste comes from, imagine if we have 10 tasks running in the first 2 ms of a frame on a machine with only 4 physical cores, do we really need to create the illusion that the 10 tasks are all running at the same time? The truth is that we don't, all we care is that within the time limit of a frame that all the tasks within this frame are all finished in a correct order. However, this is not something the OS knows, having to preserve this behavior is the root cause of the waste.
    为了更好地理解这种浪费从何而来,想象一下,如果我们在一台只有 4 个物理核心的机器上,在一帧的前 2 毫秒内运行 10 个任务,我们是否真的需要创建 10 个任务都在运行的假象同时?事实上我们并不关心,我们关心的是在一帧的时间限制内,该帧内的所有任务都按正确的顺序完成。然而,这不是操作系统知道的事情,必须保留这种行为是浪费的根本原因。
  • Thread scheduler is highly operating system dependent. Whenever an active thread is about to be put on hold, it is the OS who decides which next thread gets the chance to take over the physical core for the next time window. Even though most OS interfaces offer some level of control, like thread priority, the scheduling algorithm is transparent to programmers. And this can be problematic from time to time.
    线程调度程序高度依赖于操作系统。每当一个活动线程即将被搁置时,操作系统就会决定下一个线程有机会在下一个时间窗口接管物理核心。尽管大多数操作系统接口提供某种程度的控制(例如线程优先级),但调度算法对程序员来说是透明的。这有时会出现问题。
    Again, the scheduler has no prior knowledge about the game engine. It will have to treat the system as no different than a generic system. So the next thread to be ran may not match developers' expectation sometimes.
    同样,调度程序没有关于游戏引擎的先验知识。它必须将该系统视为与通用系统没有什么不同。因此,下一个要运行的线程有时可能与开发人员的预期不符。

To some degree, we can think of the preemptive multithreading as virtual threads fighting for physical resources. No thread is in full control in this game since the scheduler can preemptively pause a thread any time. And this clearly comes at some cost and uncertainty.

在某种程度上,我们可以将抢占式多线程视为虚拟线程争夺物理资源。在这个游戏中没有线程可以完全控制,因为调度程序可以随时抢先暂停线程。这显然是有一定成本和不确定性的。

If these are within tolerance, what further pushed some game studios to move towards a more efficient system design is a problematic case that a task needs to wait for some inputs that is not yet produced by other task.

如果这些在容忍范围内,那么进一步推动一些游戏工作室转向更高效的系统设计的是一个有问题的情况,即任务需要等待其他任务尚未产生的一些输入。

  • An impractical solution would be to schedule more threads than the number of physical cores and yield the thread's control if it waits for something so that OS can schedule other thread on it. This may sound fine. Unfortunately, it has flaws.

    一个不切实际的解决方案是调度比物理核心数量更多的线程,并在线程等待某些事情时放弃线程的控制权,以便操作系统可以在其上调度其他线程。这听起来可能不错。不幸的是,它有缺陷。

    • First of all, the OS has no idea when the inputs are ready, it will keep trying to put this thread back on the physical core to make attempts to resume it from time to time. This is clearly very inefficient since until the input is ready, all prior attempts are useless effort that only waste hardware resources.

      首先,操作系统不知道输入何时准备好,它会不断尝试将该线程放回到物理核心上,以不时尝试恢复它。这显然是非常低效的,因为在输入准备好之前,所有先前的尝试都是无用的努力,只会浪费硬件资源。

    • If this doesn't sound scary enough, an even worse situation is that the system could end up in a dead lock state if all threads in the thread pool are waiting for something that are not ready, which will eventually hang the game since pending tasks can't find a thread to run it on and all threads are yielding due to the lack of inputs. To solve the problem

      如果这听起来还不够可怕,更糟糕的情况是,如果线程池中的所有线程都在等待未准备好的东西,系统最终可能会处于死锁状态,这最终会因挂起任务而挂起游戏找不到运行它的线程,并且由于缺少输入,所有线程都在屈服。解决问题

      • One option is to create new thread in such a case to make sure pending tasks can always find a thread to run on. Put aside the fact that creating a thread is not cost free, this would not only increase the complexity of the system, but also introduce more threads as well, meaning potentially more preemption happening in the future.
        在这种情况下,一种选择是创建新线程,以确保挂起的任务始终可以找到要运行的线程。撇开创建线程并不是没有成本的事实不谈,这不仅会增加系统的复杂性,而且还会引入更多的线程,这意味着将来可能会发生更多的抢占。
      • A different approach is the job nesting system, which, rather than putting the current job on hold when waiting for something to be ready, grabs another job and execute the new job on top of the existing call stack. A big problem of this solution is that the tasks' call stack will possibly stack on top of each other and there is no way that the job on bottom of the stack to finish before what is on the top to finish first.
        另一种方法是作业嵌套系统,它不是在等待某些内容准备就绪时搁置当前作业,而是获取另一个作业并在现有调用堆栈之上执行新作业。该解决方案的一个大问题是任务的调用堆栈可能会堆叠在彼此的顶部,并且堆栈底部的作业不可能先于顶部的作业完成。

      There are other ways to tackle the problem. However, the ideal solution is to yield when a task needs to without paying significant cost. Unfortunately, with preemptive multitasking, this can't be easily achieved.

      还有其他方法可以解决这个问题。然而,理想的解决方案是在任务需要时做出让步,而不需要付出大量成本。不幸的是,对于抢占式多任务处理来说,这并不容易实现。

  • An alternative solution would be to split the task into two at the boundary of the wait. Even though this sounds like a more practical solution, this kind of solution is more like a last resort that is very unlikely scalable as certain cases show up more frequently, which will force us to create more tasks.

    另一种解决方案是在等待边界将任务分成两部分。尽管这听起来像是一个更实用的解决方案,但这种解决方案更像是最后的手段,不太可能扩展,因为某些情况会更频繁地出现,这将迫使我们创建更多任务。

If there is anything to blame, it is that preemptive multitasking doesn't allow tasks to yield themselves during execution.

如果有什么可指责的,那就是抢占式多任务处理不允许任务在执行过程中自行产生。

Cooperative Multitasking

协作多任务处理

Cooperative multitasking is different in a way that it allows programmer to take over the scheduling, rather than handling over to the OS. As its name implies, it allows different tasks to cooperatively work with each other. This would mean that they only give control when they yield and other tasks will choose to trust this running task that it will give control to them at a reasonable point. With this trust established, there is no need to preemptively interupt a running task without its permission like the OS does with preemptive multitasking. Rather than tasks fighting for hardware resources, the cooperative multitasking is more like tasks are happily working together like a family.

协作多任务处理的不同之处在于它允许程序员接管调度,而不是交给操作系统处理。顾名思义,它允许不同的任务相互协作。这意味着它们仅在产出时才给予控制权,而其他任务将选择信任此正在运行的任务,它将在合理的时间点给予它们控制权。建立这种信任后,无需像操作系统抢占式多任务处理那样在未经其许可的情况下抢先中断正在运行的任务。与任务争夺硬件资源不同,协作式多任务更像是任务像一个家庭一样愉快地一起工作。

With this design, each task will carry more responsibility as if they don't give control to others, others will not have any control at all. The need to yield control to other tasks requires the subroutine to yield when they want to. Such a subroutine is commonly viewed as a more generalized version of regular subroutine, it is called corountine. It is the most commonly known solution that allows us to program a thread so that it works cooperatively.

通过这种设计,每个任务都会承担更多的责任,就好像他们不把控制权交给别人一样,其他人根本没有任何控制权。需要让出对其他任务的控制权需要子例程在它们想要的时候让出。这种子程序通常被视为常规子程序的更通用版本,称为协程。这是最常见的解决方案,它允许我们对线程进行编程,使其能够协作工作。

Of course, even a program is fully done through cooperative multitasking, it doesn't mean that no preemption will happen since the OS will need to schedule other back ground applications, such as email, to run on the shared physical cores from time to time. But minimizing the context switch within our own application already offers a lot of value on the table.

当然,即使一个程序完全通过协作多任务处理完成,也不意味着不会发生抢占,因为操作系统有时需要调度其他后台应用程序(例如电子邮件)在共享物理内核上运行。但是,最大限度地减少我们自己的应用程序中的上下文切换已经提供了很多价值。

Before we move forward, for those who are not familiar with the term of subroutine and coroutine. A quick explanation is as below

在我们继续之前,对于那些不熟悉子程序和协程术语的人来说。快速解释如下

  • Subroutine is a thing that can be invoked by the caller and it can return back the control back to the caller, which called it. I believe all programmers would quickly realize that the concept of function is a realization of subroutine.
    子程序是一个可以由调用者调用的东西,它可以将控制权返回给调用它的调用者。我相信所有程序员都会很快意识到函数的概念就是子程序的实现。
  • Coroutine has all the properties a subroutine has. Besides, it can suspend itself and return the control to the caller. And it can also resume at a later point and pick everything up at the suspended point even on another totally different thread.
    协程具有子程序所具有的所有属性。此外,它还可以挂起自身并将控制权返回给调用者。它也可以在稍后恢复,并在挂起点拾取所有内容,甚至在另一个完全不同的线程上也是如此。

In order to keep the post short, it is assumed that readers would have some basic coroutine knowledge. For readers who are not very familiar with coroutine in C++, here is an awesome talk in Cpp con.

为了保持文章简短,假设读者有一些基本的协程知识。对于不太熟悉 C++ 协程的读者,这里有 Cpp con 中的精彩演讲。

Basics about Fibers 纤程基础知识

Apart from Coroutine, Fiber is an interesting addition as a solution to cooperative multi-tasking. Fiber is quite a lightweight thread of execution. Like coroutine, fiber allows yielding at any point inside it. To some degree, we can regard fiber as a form of stackful coroutine, which is not available in C++ programming language. By not available, I mean there is no native language support for that. There are certainly libraries like boost that support this kind of coroutine or even fiber.

除了 Coroutine 之外,纤程 也是一个有趣的补充,可以作为协作多任务处理的解决方案。 纤程 是一个相当轻量级的执行线程。与协程一样,纤程允许在其内部的任何点让出(yeild)。在某种程度上,我们可以将 纤程 视为 Stackful Coroutine 的一种形式,这是 C++ 编程语言所不具备的。我所说的"不可用"是指没有本地语言支持。当然有像 boost 这样的库支持这种协程甚至纤程。

Don't be intimidated by its fancy name, fiber really is just a method that allows programmers to jump between different stack memory without regular return command. Since it offers the ability for us to jump between different call stacks, we can allocate our own stack memory and use it as our fiber stack.

不要被它花哨的名字吓到,纤程实际上只是一种允许程序员在不同堆栈内存之间跳转而无需常规返回命令的方法。由于它为我们提供了在不同调用堆栈之间跳转的能力,因此我们可以分配自己的堆栈内存并将其用作纤程堆栈。

Below is a diagram that demonstrates how fibers fit in a program.

下图展示了纤程如何融入程序。

We can see a few things from this diagram

从这张图中我们可以看出一些事情

  • User logic code can be ran on either a regular thread or a fiber, which itself runs on a thread.
    用户逻辑代码可以在常规线程或纤程上运行,纤程本身在线程上运行
  • Unlike thread, fiber's scheduler is application specific, meaning software developer needs to carry over the responsibility to schedule fibers. OS will no longer help with this.
    与线程不同,纤程的调度程序是特定于应用程序的,这意味着软件开发人员需要承担调度纤程的责任。操作系统将不再提供帮助。
  • As hinted in the diagram, user can commonly create a lot more fibers than threads as things like stack memory allocation are a lot more controllable by developers.
    如图所示,用户通常可以创建比线程更多的纤程,因为开发人员更容易控制堆栈内存分配之类的事情。
  • Though not shown in the diagram, a system with fiber commonly use thread affinity to fix threads on dedicated cores for better performance and the number of back ground threads doesn't need to be high. Commonly only a few low priority threads are needed for blocking operations like IO.
    尽管图中未显示,但具有纤程的系统通常使用线程亲和性将线程固定在专用内核上,以获得更好的性能,并且后台线程的数量不需要很高。通常只需要几个低优先级线程来执行 IO 等阻塞操作。

There is a lot more to explore. We will mention them in following chapters with more details.

还有很多值得探索的地方。我们将在接下来的章节中更详细地提到它们。

Difference between Fiber and Thread

纤程和线程的区别

Compared with fiber, thread is a lot well known concept. With the assumption that all readers already have a solid understanding of threads, below are some very obvious difference between a thread and a fiber.

与纤程相比,线程是一个众所周知的概念。假设所有读者都已经对线程有了深入的了解,下面是线程和纤程之间的一些非常明显的区别。

  • A thread can be scheduled by the operating system and runs on a physical core of the CPU. While a fiber can only be ran on a thread. We can think of fiber as a more high level concept as it sits on top of a thread.
    线程可以由操作系统调度并在CPU的物理核心上运行。而纤程只能在线程上运行。我们可以将纤程视为一个更高层次的概念,因为它位于线程之上。
  • Even though, both of them could support multi-tasking. Threads support multi-tasking through the OS's preemptive style scheduling. Fiber supports multi-tasking by asking the programmer to take the responsibility to schedule them in a well-behaved manner.
    尽管如此,它们都可以支持多任务处理。线程通过操作系统的抢占式调度支持多任务处理。 Fiber 通过要求程序员负责以良好的方式安排任务来支持多任务处理。
  • Thread switching is a lot more complicated than a fiber switch. Thread switch requires the program to take a trip in the kernel mode and can be a lot more expensive. A fiber switch is nothing but swapping the registers with previously stored fiber context.
    线程切换比纤程切换复杂得多。线程切换需要程序在内核模式中走一趟,并且成本可能会高很多。纤程切换只不过是用先前存储的纤程上下文切换寄存器。
  • The memory for call stacks for a thread is controlled by the OS, while the memory for a fiber stack can be explicitly controlled by programmers. This offers great flexibility as programmer commonly have knowledge of the system they are building so that they can simplify things with those assupmtions.
    线程调用堆栈的内存由操作系统控制,而纤程堆栈的内存可以由程序员显式控制。这提供了极大的灵活性,因为程序员通常了解他们正在构建的系统,以便他们可以利用这些假设来简化事情。
  • Thread local storage is safe for threads, but they may not be 100% safe with fibers as some system chooses to schedule fibers on different thread between executions of a same fiber.
    线程本地存储对于线程来说是安全的,但对于纤程来说可能不是 100% 安全,因为某些系统选择在同一纤程的执行之间在不同线程上调度纤程。
  • Native operating system offerred synchronization primitives, like mutex, will not work on fibers if fibers can resume on other threads than the one they were suspended on. We have to extremely careful when using sychronization primitives within a fiber.
    本机操作系统提供的同步原语(如互斥锁)将无法在纤程上工作,如果纤程可以在其他线程(而不是它们挂起的线程)上恢复。在纤程内使用同步原语时,我们必须非常小心。

Difference between Fiber and C++ Coroutine

Fiber 和 C++ 协程之间的区别

It is a lot easiser to compare thread and fiber than comparing fiber and coroutine. Please be mindful that the coroutine concept that we are talking about here is merely the C++ langauge supported coroutine. We are not comparing fiber with any custom implementation of coroutine.

比较线程和纤程比 比较纤程和协程要容易得多。请注意,我们这里讨论的协程概念仅仅是 C++ 语言支持的协程。我们不会将 纤程与协程的任何自定义实现进行比较。

  • C++ coroutine is a language concept. As a matter of fact, C++ only started to support language wise coroutine from C++ 20. While fiber is an OS level concept, commonly provided by OS library interface. Fiber can totally be implemented by progarmmers themselves with assembly languages. Later we will see how this can be achieved.
    C++协程是一个语言概念。事实上,C++ 从 C++ 20 才开始支持语言智能协程。而 纤程是操作系统级别的概念,通常由操作系统库接口提供。 纤程完全可以由程序员自己用汇编语言来实现。稍后我们将看到如何实现这一点。
  • Coroutine functions are generally a bit more complicated to implement. There are a few concept for configurations, like coroutine handle, promise. Programmer will have to either implement their own type or use the types offered by a third party library to implement a coroutine function. While there is really no such a concept as a fiber function, a regular function can take advantage to yield within it without any special treatment.
    协程函数的实现通常要复杂一些。配置有一些概念,比如协程句柄、promise。程序员必须实现自己的类型或使用第三方库提供的类型来实现协程函数。虽然确实不存在纤程函数这样的概念,但常规函数可以利用其中的yield点而无需任何特殊处理。
  • C++ coroutine will siliently added more configuration code to implement coroutine. Such hidden code is not only compiler dependent and can vary on different platforms as well. Programmers will have little control on how these code are generated. On the other hand, since fiber is not a language level concept, the compiler will do nothing special about it. There is good and bad about the difference here
    C++协程会默默地添加更多的配置代码来实现协程。这种隐藏代码不仅依赖于编译器,而且在不同平台上也可能有所不同。程序员几乎无法控制这些代码的生成方式。另一方面,由于 纤程 不是语言级别的概念,因此编译器不会对其执行任何特殊操作。这里的差异有好有坏
    • One advantages of coroutine is that all variables will get proper destruction when coroutine is ended. It can be ended either by quiting the coroutine function at the co_return command. The other way to end a coroutine's lifetime is to end the life time of the coroutine handle even if the coroutine is not finished executing yet. Of course, only variables that are actually touched within the execution progress will get destructed. Variables not even touched in coroutines will not be destructed as they never get constructed in the first place.
      协程的优点之一是,当协程结束时,所有变量都会得到适当的销毁。它可以通过使用 co_return 命令退出协程函数来结束。结束协程生命周期的另一种方法是结束协程句柄的生命周期,即使协程尚未执行完毕。当然,只有在执行过程中实际触及的变量才会被破坏。协程中甚至没有触及的变量也不会被破坏,因为它们从一开始就不会被构造。
      Unfortunately, fiber can't do so. There is no easy way to track all the local variables in fibers and properly destruct all of them if the fiber gets suspended and got killed. It is programmer's responsibility to make sure that all the local variables that needs destruction to be destructed at a proper time before killing a fiber.
      不幸的是,纤程不能做到这一点。没有简单的方法可以跟踪纤程中的所有局部变量,并在纤程被挂起并被杀死时正确地销毁所有这些变量。程序员有责任确保在杀死纤程之前在适当的时间销毁所有需要销毁的局部变量。
      An interesting example is smart pointers. Smart pointers in C++ is done by bundling all heap allocating to a stack allocation. As when the program ends, all stacks are gone, it can be sure that all the heap memory allocation that is bound to smart pointers will be freed as well. However, such a mechanism will fail to protect your heap allocatin memory leak in the context of fiber. We will mention that once we talk about the implementations of fiber later to avoid confusing readers here.
      一个有趣的例子是智能指针。 C++ 中的智能指针是通过将所有堆分配捆绑到堆栈分配来完成的。由于当程序结束时,所有堆栈都消失了,因此可以确定绑定到智能指针的所有堆内存分配也将被释放。然而,这样的机制将无法保护纤程上下文中的堆分配内存泄漏。当我们稍后讨论 Fiber 的实现时,我们会提到这一点,以避免读者感到困惑。
    • Since C++ coroutine is a language level concept, compilers are in good position to optimize as much as possible. One example is that compiler can choose to inline coroutine function sometimes, even making them disappaar in the thin air[3] . Such an optimization is clearly not possible with fiber. Later we will see, we will have to do something to prevent the compiler to optimize the code so that fiber logic can behave as expected.
      由于 C++ 协程是一个语言级别的概念,因此编译器可以尽可能地进行优化。一个例子是,编译器有时可以选择内联协程函数,甚至让它们凭空消失 [3] 。对于纤程来说,这样的优化显然是不可能的。稍后我们会看到,我们必须采取一些措施来阻止编译器优化代码,以便纤程逻辑可以按预期运行。
  • Coroutine's memory management is a bit more transparent than fiber. The size of the memory allocation is highly dependent on the compiler. For fiber, programmers are required to allocate a piece of memory as stack. It is up to programmers to decide how many bytes they need for the fiber execution. Of course, programmers need to make sure what is ran on the fiber will not cause stack overflow by allocating enough memory for the fiber stack. 协程的内存管理比 纤程 更透明一些。内存分配的大小高度依赖于编译器。对于 纤程,程序员需要分配一块内存作为堆栈。由程序员决定纤程执行需要多少字节。当然,程序员需要通过为纤程堆栈分配足够的内存来确保纤程上运行的内容不会导致堆栈溢出。
  • Coroutine can return a value, while fiber doesn't allow programmer to do so in a traditional return value way.
    协程可以返回一个值,而 纤程不允许程序员以传统的返回值方式这样做。
  • We can finish a coroutine function by running the code through the end. While we can't proceed to the end of a fiber entry function as there is no proper return address for fiber entry function.
    我们可以通过运行代码到最后来完成一个协程函数。虽然我们无法继续到纤程入口函数的末尾,因为纤程入口函数没有正确的返回地址。
  • C++ coroutine is asymmetric coroutine, which allows the coroutine to return the control to the caller, only the caller. There is no way for a coroutine function to yield its control to other coroutine that was suspended before. While there is a concept named symmetric coroutine that allows one coroutine to yield its control to another coroutine. Fiber is symmetric by default. It actually never returns to the caller code, it only yields to another fiber.
    C++协程是非对称协程,它允许协程将控制权返回给调用者,仅返回给调用者。协程函数无法将其控制权交给之前挂起的其他协程。虽然有一个称为对称协程的概念,它允许一个协程将其控制权交给另一个协程。纤程默认是对称的。它实际上永远不会返回到调用者代码,它只会yield于另一个纤程。
  • C++ coroutine is stackless, meaning that it is only allowed to yield within the coroutine function itself. If your coroutine function calls another regular function, it is disallowed to yield the control back to the caller that calls the coroutine function. Fiber does allow yielding the control at any depth in the call stack.
    C++ 协程是无堆栈的,这意味着它只允许在协程函数本身内产生。如果您的协程函数调用另一个常规函数,则不允许将控制权交还给调用协程函数的调用者。 纤程确实允许在调用堆栈中的任何深度产生控制权。

Above are some of the major differences between a fiber and C++ language level coroutine. Among all these differences, the last two are almost deal breakers for flexible job system implementation. Of course, there are certainly examples of using coroutine to implement a job system [4] [5] , it is techinically possible. But the flexibility offered by fiber is a lot more powerful than what coroutine offers on the table. Naughty Dog's game engine's job system is a successful example of using fibers to parallelize game engine [6] .

以上是 纤程和 C++ 语言级协程之间的一些主要区别。在所有这些差异中,最后两个几乎是弹性工作系统实施的障碍。当然,肯定有使用协程实现作业系统 [4] [5] 的例子,技术上是可行的。但 纤程提供的灵活性比协程提供的灵活性要强大得多。顽皮狗的游戏引擎作业系统是利用纤程并行化游戏引擎 [6] 的成功范例。

Fiber Implementation 纤程实现

Understanding the details in fiber implementation can be rewarding. Even though fiber offers great flexibility on the table, the implementation of fiber is nothing but a few hundred lines of code.

了解纤程实现的细节是有益的。尽管 纤程在桌面上提供了很大的灵活性,但 纤程的实现只不过是几百行代码。

In this section, we will go through a detailed fiber implementation on x64 architecture, a similar version working on arm64 architecture is also provided in the form of source code.

在本节中,我们将详细介绍 x64 架构上的 纤程实现,也以源代码的形式提供了在 arm64 架构上工作的类似版本。

Unlike a high level feature implementation, fiber's implementation is a bit unnormal and hacky. It requires programmers to have solid understanding of how CPU handles the call stacks during program execution. So before we move forward with a detailed fiber implementation, we will need to take a look at some basics of how CPUs handles call stack on x64 and arm64 respectively.

与高级功能实现不同,纤程的实现有点不正常和黑客。它要求程序员对CPU在程序执行过程中如何处理调用堆栈有深入的了解。因此,在进行详细的 纤程实现之前,我们需要先了解一下 CPU 如何分别在 x64 和 arm64 上处理调用堆栈的一些基础知识。

The implementation detail of fiber in assembly languages on Arm64 architecture is quite similar to what needs to be done on x64. The biggest difference is the registers set is different from each other. So we will not repeat a similar process. Readers who are interested in its implementation can take a look at my implementation with the above link.

纤程在 Arm64 架构上用汇编语言实现的细节与在 x64 上需要完成的非常相似。最大的区别是寄存器组不同。所以我们不会重复类似的过程。对它的实现感兴趣的读者可以通过上面的链接看一下我的实现。

Fiber implementation highly depends on ABI (Application Binary Interface). In this blog post, the fiber implementation on x64 is built upon System V ABI. Different ABI requires a different fiber implementation.

纤程的实现高度依赖于 ABI(应用程序二进制接口)。在这篇博文中,x64 上的 纤程 实现是基于 System V ABI 构建的。不同的 ABI 需要不同的纤程实现。

Target Platform Architecture

目标平台架构

Rather than going through everything, which is clearly not possible, I will only briefly mention what is related to fibers. And we will only spend our time on a 64 bit program. Though, it should work in a similar way in 32 bit programs.

我不会详细介绍所有内容(这显然是不可能的),而是仅简要提及与纤程相关的内容。而且我们只会把时间花在 64 位程序上。不过,它在 32 位程序中应该以类似的方式工作。

In the following two sections, we will be uncover the mystery of how CPU handles call stacks under the hood. Here is the high level program that we will look at. I intensionally keep the program extremely simple and meaningless so that we can focus on the call stack rather than being distracted by something else.

在接下来的两节中,我们将揭开 CPU 如何在幕后处理调用堆栈的奥秘。这是我们将要看到的高级程序。我故意让程序变得极其简单和无意义,以便我们可以专注于调用堆栈,而不是被其他事情分散注意力。

cpp 复制代码
1int Interface(int g) {
2    int k = g * g;
3    return k * k;
4}
5
6int main(int argc, char** argv) {
7    int a = Interface(argc);
8    return a;
9}

x64 Architecture x64架构

There are in total 16 64-bit general purpose registers in modern x64 CPU architecture. They are RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8 to R16 respectively. Apart from the general purpose registers, there is also a special register named RIP, the instruction pointer, which tells the CPU what to execute next.

现代 x64 CPU 架构中共有 16 个 64 位通用寄存器。它们是 RAXRBXRCXRDXRSIRDI 、分别为 RBPRSPR8R16 。除了通用寄存器外,还有一个名为 RIP 的特殊寄存器,即指令指针,它告诉 CPU 接下来要执行什么。

Besides these registers, there are certainly more. For example, all x86-64 compatible processors support SSE2 and include 16 128-bit SIMD registers, XMM0 to XMM15. It is fairly common to see AVX SIMD as well, this is achieved through another 16 256-bit registers, namely YMM0 to YMM15. Further more, there is AVX-512, which is an extension that allows the CPU to process 16 32-bit number operations at a same time. CPUs that support this have another 16 registers (ZMM0-ZMM15), each of which is 256-bit long.

除了这些寄存器之外,当然还有更多。例如,所有 x86-64 兼容处理器都支持 SSE2 并包括 16 128 位 SIMD 寄存器, XMM0XMM15 。 AVX SIMD 也很常见,这是通过另外 16 256 位寄存器实现的,即 YMM0YMM15 。更进一步,还有AVX-512,它是一个扩展,可以让CPU同时处理16个32位数字运算。支持此功能的 CPU 还有另外 16 个寄存器 ( ZMM0 - ZMM15 ),每个寄存器都是 256 位长。

It is very common that we store something in a register and fetch the value from this register later. However, it is not uncommon that we need to change the value of the register in-between the two instructions, especially if there are function calls inbetween. In order to make sure by the time the register is read, the value is not overwritten, the value has to be stored somewhere (commonly on the stack) before it is changed in between. There are a few registers that are callee-saved, which means that the callee needs to save the register before touching the registers and restore the value of the registers before leaving the function so that the caller does not even know the value of the register is changed. These registers are RBX, RBP, R12 to R15. All other regiters are caller saved, which means the opposite, the callee can change the values of the registers anytime they want by assuming that the caller will make sure the values of the registers will surive the callee's instructions.

我们通常会在寄存器中存储一些内容,然后从该寄存器中获取值。然而,我们需要在两条指令之间更改寄存器的值的情况并不少见,特别是在中间有函数调用的情况下。为了确保在读取寄存器时,该值不会被覆盖,该值必须在其间更改之前存储在某处(通常在堆栈上)。有一些寄存器是被调用者保存的,这意味着被调用者需要在使用寄存器之前保存寄存器,并在离开函数之前恢复寄存器的值,以便调用者甚至不知道寄存器的值是改变了。这些寄存器是 RBXRBPR12R15 。所有其他寄存器都是调用者保存的,这意味着相反,被调用者可以随时更改寄存器的值,前提是调用者将确保寄存器的值能够满足被调用者的指令。

Next, let's take a look at the aseembly code produced by g++ compiler below. Please be noted that in order to understand how CPU works with its registers to support stack properly, I will have to disable compiler optimization as otherwise the compiler may choose to optimize them out by avoiding using the rbp register so that it can be used as another general purpose register. Sometimes it even inline the whole function without a jump at all. Here is the assembly code generated by g++ 7.5.0 on Ubuntu,

接下来我们看一下下面g++编译器生成的汇编代码。请注意,为了了解 CPU 如何与其寄存器一起工作以正确支持堆栈,我必须禁用编译器优化,否则编译器可能会选择通过避免使用 rbp 寄存器来优化它们,以便它可以用作另一个寄存器通用寄存器。有时它甚至内联整个函数而根本没有跳转。这是 g++ 7.5.0 在 Ubuntu 上生成的汇编代码,

cpp 复制代码
 10x400667 <Interface(int)>                 push   %rbp
 20x400668 <Interface(int)+1>               mov    %rsp,%rbp
 30x40066b <Interface(int)+4>               mov    %edi,-0x14(%rbp)
 40x40066e <Interface(int)+7>               mov    -0x14(%rbp),%eax
 50x400671 <Interface(int)+10>              imul   -0x14(%rbp),%eax
 60x400675 <Interface(int)+14>              mov    %eax,-0x4(%rbp)
 70x400678 <Interface(int)+17>              mov    -0x4(%rbp),%eax
 80x40067b <Interface(int)+20>              imul   -0x4(%rbp),%eax
 90x40067f <Interface(int)+24>              pop    %rbp
100x400680 <Interface(int)+25>              retq
110x400681 <main(int, char**)>              push   %rbp
120x400682 <main(int, char**)+1>            mov    %rsp,%rbp
130x400685 <main(int, char**)+4>            sub    $0x20,%rsp
140x400689 <main(int, char**)+8>            mov    %edi,-0x14(%rbp)
150x40068c <main(int, char**)+11>           mov    %rsi,-0x20(%rbp)
160x400690 <main(int, char**)+15>           mov    -0x14(%rbp),%eax
170x400693 <main(int, char**)+18>           mov    %eax,%edi
180x400695 <main(int, char**)+20>           callq  0x400667 <Interface(int)>
190x40069a <main(int, char**)+25>           mov    %eax,-0x4(%rbp)
200x40069d <main(int, char**)+28>           mov    -0x4(%rbp),%eax
210x4006a0 <main(int, char**)+31>           leaveq
220x4006a1 <main(int, char**)+32>           retq

What is shown above is the assembly language code for the two functions in the above C++ program. Please be mindful that the same program may produce very different assembly instructions with different compilers, but the general structure of the program should be similar to each other. Rather than going through every single instructions, we will only focus on those that are relevant to our topic. Below is what happens when this program is executed in order with some instructions skipped as they are not relevant.

上面显示的是上述 C++ 程序中两个函数的汇编语言代码。请注意,同一个程序可能会使用不同的编译器产生非常不同的汇编指令,但程序的总体结构应该彼此相似。我们不会仔细阅读每一条指令,而是只关注与我们的主题相关的指令。以下是当该程序按顺序执行并跳过一些不相关的指令时会发生的情况。

  • At the very beginning, the RIP's value is 0x400681, which means that the next instruction is the push %rbp that is located at the begining of main function.
    一开始, RIP 的值为0x400681,这意味着下一条指令是位于main函数开头的 push %rbp
  • Starting from 0x400681, the first thing CPU does is to store the RBP register on the stack memory. This push operation will also decrease the value of RSP by 8. RSP is the stack pointer that points to the current top address of the stack. Please be noted that on x86/x64 architecture, the address of the stack goes downward whenever it grows. In this case, since RBP is a 64-bit registers, the stack pointer (RSP) needs to decrease by 8 to avoid the stored RBP being overwritten by anything else. Of course, after this instruction is executed, the RIP will be bumped as well so that CPU knows what to execute next.
    从0x400681开始,CPU做的第一件事就是将 RBP 寄存器存储到堆栈内存中。这个入栈操作也会将 RSP 的值减8。 RSP 是指向当前栈顶地址的栈指针。请注意,在 x86/x64 架构上,堆栈地址每当增长时都会向下移动。在这种情况下,由于 RBP 是 64 位寄存器,因此堆栈指针 ( RSP ) 需要减 8,以避免存储的 RBP 被覆盖还要别的吗。当然,这条指令执行后, RIP 也会被碰撞,以便CPU知道下一步要执行什么。
  • The next thing it does is to set the current stack pointer's value to the register of RBP. This will give us some insight of what RBP is. It is not difficult to see that the RBP register keeps the value of the bottom of the current function's stack frame.
    它所做的下一步是将当前堆栈指针的值设置为 RBP 的寄存器。这将使我们对 RBP 是什么有一些了解。不难看出, RBP 寄存器保存的是当前函数栈帧底部的值。
  • By skiping some instructions that are not relevant, let's take a look at the instruction 0x400695. This is the instruction that invokes the Interface function. What the CPU does here is that it will first push the next instruction's address, which is 0x40069a, on the stack. Of course, since it is a push, the RSP will be decreased again to keep the pushed value safe. This is what is commonly known as return address. Specifically, it is the next instruction for the CPU to execute after finishing the callee function, which is Interface in our case.
    跳过一些不相关的指令,我们来看看指令0x400695。这是调用 Interface 函数的指令。 CPU 在这里所做的是,它首先将下一条指令的地址(0x40069a)压入堆栈。当然,既然是push,那么 RSP 会再次减少,以保证push的值安全。这就是通常所说的返回地址。具体来说,它是 CPU 在完成被调用函数后要执行的下一条指令,在我们的例子中是 Interface
    Of course, besides pushing the next instruction's address on the stack, the CPU needs to execute the first instruction in the Interface function. This is achieve by changing the RIP register to 0x400667 so that the CPU knows what to execute next.
    当然,除了将下一条指令的地址压入堆栈外,CPU还需要执行 Interface 函数中的第一条指令。这是通过将 RIP 寄存器更改为 0x400667 来实现的,以便 CPU 知道下一步要执行什么。
  • Looking at the first instruction in the function, which is at the address of 0x400667, this immediately reminds us what is done at the beginning of the main function, pushing the RBP's value on the stack. At this point, we know what the RBP's value is, it is the base address of the main's call stack frame. Because we are currently in the Interface function, we will need to make sure the RBP's value is the bottom of the Interface function's stack frame, rather than main fucntion's stack frame. To do so, we simply need to move the value of RSP to RBP. However, the main function's stack frame's bottom information will be lost. We mentioned above that RBP is a callee saved register, this means that the callee (Interface) is responsible to make sure it appears that the RBP is not changed from the caller's perspective. To do so, we simply need to push the RBP on the stack before assigning a new value to RBP. And this is exactly what this line is all about.
    查看函数中的第一条指令,即地址 0x400667,这立即提醒我们在主函数的开头做了什么,将 RBP 的值压入堆栈。此时,我们知道 RBP 的值是什么,它是 main 的调用堆栈帧的基地址。因为我们当前位于 Interface 函数中,所以我们需要确保 RBP 的值是 Interface 函数堆栈帧的底部,而不是 main 函数的堆栈框架。为此,我们只需将 RSP 的值移动到 RBP 。然而, main 函数的栈帧底部信息将会丢失。我们上面提到 RBP 是被调用者保存的寄存器,这意味着被调用者( Interface )有责任确保 RBP 没有改变从调用者的角度来看。为此,我们只需在为 RBP 分配新值之前将 RBP 压入堆栈即可。这正是这条线的意义所在。
  • The next instruction 0x400668 is quite similar to the 0x400682 instruction we visited before. Its sole purpose is to make sure RBP keeps the value of the bottom of the current function's stack frame.
    下一条指令 0x400668 与我们之前访问过的 0x400682 指令非常相似。它的唯一目的是确保 RBP 保留当前函数堆栈帧底部的值。
  • Instructions between 0x40066b and 0x40067b (inclusive) are simply the implementation of the body of the function. It is actually quite self-explanatory even to someone who is not familiar with assembly languages. One thing to be mindful here is that the RSP is not changed throughout the instructions.
    0x40066b 到 0x40067b(含)之间的指令只是函数体的实现。即使对于不熟悉汇编语言的人来说,它实际上也是不言自明的。这里需要注意的一件事是 RSP 在整个指令中没有改变。
  • Looking at 0x40067f next, what the CPU does is to increase the RSP by 8 and take the current value pointed by the RSP. This is what we commonly known as poping the stack. Since the RSP still points to the next address adjacent to the address that has the old RBP's value, the bottom of the main's call stack frame, this instruction will erase what is currently stored in RBP and restore the RBP value before leaving the function.
    接下来看 0x40067f,CPU 所做的是将 RSP 加 8 并获取 RSP 指向的当前值。这就是我们俗称的出栈。由于 RSP 仍然指向与具有旧 RBP 值的地址相邻的下一个地址,即主调用堆栈帧的底部,因此该指令将擦除当前的内容存储在 RBP 中,并在离开函数之前恢复 RBP 值。
  • Right before quiting the function, CPU will execute retq instruction. What this does is to increase the value of RSP by 8, take the value pointed by RSP and assign it to RIP. Careful readers may have already realized that this is exactly 0x40069a since we stored this value in the instruction of 0x400695.
    在退出函数之前,CPU会执行 retq 指令。其作用是将 RSP 的值增加 8,获取 RSP 指向的值并将其分配给 RIP 。细心的读者可能已经意识到,这正是0x40069a,因为我们将这个值存储在0x400695的指令中。

At this point, we have fully visited how a function is called. As we can see, RBP and RSP play a critical role here in preserving the call stack information and make it visible to the CPU. Again, in reality, a lot of the times compilers will try to optimize it so it is quite possible that we won't see some of these in release build. Please be noted that this optimization of RBP is by no means the same thing as compiler trying to optimize the function by inlining it. It still jumps the RIP to another different code fragment that belongs to the Interface function.

至此,我们已经完整的了解了函数是如何被调用的。正如我们所看到的, RBPRSP 在保存调用堆栈信息并使其对 CPU 可见方面发挥着关键作用。同样,实际上,很多时候编译器会尝试对其进行优化,因此我们很可能在发布版本中看不到其中的一些内容。请注意,这种对 RBP 的优化与编译器尝试通过内联来优化函数完全不同。它仍然将 RIP 跳转到属于 Interface 函数的另一个不同的代码片段。

cpp 复制代码
10x4004d0 <Interface(int)>             mov    %edi,%eax
20x4004d2 <Interface(int)+2>           imul   %edi,%eax
30x4004d5 <Interface(int)+5>           imul   %eax,%eax
40x4004d8 <Interface(int)+8>           retq
5
60x4003e0 <main(int, char**)>          jmpq   0x4004d0 <Interface(int)>

Above is the assembly code produced with (level 3) optimization by the same compiler. In order to prevent the compiler from inlining this function, I moved the Interface function to a separate compilation unit. As we can see from it, it does incur a jump instruction in the main function rather than expanding what is inside Interface in main.

上面是由同一编译器通过(第 3 级)优化生成的汇编代码。为了防止编译器内联这个函数,我将 Interface 函数移到了一个单独的编译单元。从它可以看出,它确实在 main 函数中产生了跳转指令,而不是扩展 mainInterface 内部的内容。

There is an option -flto for the compiler to perform link time optimization, which allows the compiler to optimize across different compilation units. Similar option is available in all major C++ compilers. With this option, the following assembly code will be produced.

有一个选项 -flto 供编译器执行链接时优化,它允许编译器跨不同的编译单元进行优化。所有主要 C++ 编译器都提供类似的选项。使用此选项,将生成以下汇编代码。

cpp 复制代码
10x4003e0 <main(int, char**)>          mov    %edi,%eax
20x4003e2 <main(int, char**)+2>        imul   %edi,%eax
30x4003e5 <main(int, char**)+5>        imul   %eax,%eax
40x4003e8 <main(int, char**)+8>        retq

As we can see from the above code, the jump instruction is fully removed, meaning the function is already inlined with this optimization. Later we will see that we will need to prevent this from happening in the context of fiber switch.

从上面的代码中我们可以看到,跳转指令被完全删除,这意味着该函数已经内联了这种优化。稍后我们将看到,我们需要在纤程交换机的背景下防止这种情况发生。

Arm64 Architecture Arm64架构

Besides x64 architecture, I would also like to briefly mention about the Arm64 architecture in this post due to the growing popularity of the platform, especially after Apple's new Mac lineup with Apple Silicon. The purpose of the introduction of fiber in this post is mainly targeting on game development. It would be a blocker for commercial adoption of the tech if there is no solution on Arm64 since most mobile devices, along with Apple Silicon Macs, run on this platform.

除了 x64 架构之外,我还想在这篇文章中简要提及 Arm64 架构,因为该平台越来越受欢迎,特别是在 Apple 推出采用 Apple Silicon 的新 Mac 系列之后。本篇文章介绍Fiber的目的主要是针对游戏开发。如果 Arm64 上没有解决方案,这将阻碍该技术的商业采用,因为大多数移动设备以及 Apple Silicon Mac 都在该平台上运行。

Below is a quick summary of registers available on Arm64 CPUs.

以下是 Arm64 CPU 上可用寄存器的快速摘要。

  • X0-X29 : These 30 registers are mostly for general purpose usage. Programmers can use most of them for anything they want. Though common practice will assume some specific usage of a few registers, like X29 is commonly used as frame pointer, something similar to RBP on x64 arthitecture.
    X0-X29:这30个寄存器主要用于通用用途。程序员可以将其中的大部分用于他们想要的任何用途。尽管通常的做法会假设一些寄存器的某些特定用法,例如 X29 通常用作帧指针,类似于 x64 架构上的 RBP
  • X30, LR : Different from x64, there is a dedicated register for keeping track the return address when a function is invoked. And this register is X30, sometimes also referred as LR.
    X30、LR:与 x64 不同,有一个专用寄存器用于在调用函数时跟踪返回地址。这个寄存器是 X30 ,有时也称为 LR
  • SP, XZR : This is the stack pointer on Arm64 architecture, the analog of RSP on x64. However, a minor difference here is that this register can also be used as zero register when used in non-stack related instructions.
    SP、XZR:这是Arm64架构上的堆栈指针,类似于x64上的 RSP 。然而,这里的一个细微差别是,当在非堆栈相关指令中使用时,该寄存器也可以用作零寄存器。
  • PC : This is the Arm version RIP, the instruction pointer or program counter. It records what is to be executed next by the CPU.
    PC:这是Arm版本 RIP ,指令指针或程序计数器。它记录了CPU接下来要执行的内容。
  • V0-V31 : These are 32 registers that is used for float point operations and Neon, 4-way SIMD, operations.
    V0-V31:这些是 32 个寄存器,用于浮点运算和 Neon、4 路 SIMD 运算。

Above are just part of the whole register set. There are more registers like D0-D31, S0-S30 and etc. However, we are only interested in learning the above registers as only these matter when we implement fibers on Arm64 CPUs.

以上只是整个寄存器组的一部分。还有更多寄存器,例如 D0 - D31S0 - S30 等。但是,我们只对了解上述寄存器感兴趣因为当我们在 Arm64 CPU 上实现纤程时,只有这些才重要。

Similar to x64, some of the above registers are callee saved. They are X16-X30, V8-V15. The rest of available registers are all caller-saved.

与x64类似,上面的一些寄存器是被调用者保存的。它们是 X16 - X30V8 - V15 。其余可用寄存器均由调用者保存。

Again, let's start with the assembly code produced without optimization. In this case, I compiled the source code with Apple clang of version 14.0.0 on MacOS Ventura 13.1.

再次,让我们从未经优化的汇编代码开始。在本例中,我在 MacOS Ventura 13.1 上使用版本 14.0.0 的 Apple clang 编译了源代码。

First, here is the code for the main function

首先,这是main函数的代码

cpp 复制代码
 10x100003f7c <+0>:  sub    sp, sp, #0x30
 20x100003f80 <+4>:  stp    x29, x30, [sp, #0x20]
 30x100003f84 <+8>:  add    x29, sp, #0x20
 40x100003f88 <+12>: stur   wzr, [x29, #-0x4]
 50x100003f8c <+16>: stur   w0, [x29, #-0x8]
 60x100003f90 <+20>: str    x1, [sp, #0x10]
 70x100003f94 <+24>: ldur   w0, [x29, #-0x8]
 80x100003f98 <+28>: bl     0x100003f50               ; Interface at main.cpp:5
 90x100003f9c <+32>: str    w0, [sp, #0xc]
100x100003fa0 <+36>: ldr    w0, [sp, #0xc]
110x100003fa4 <+40>: ldp    x29, x30, [sp, #0x20]
120x100003fa8 <+44>: add    sp, sp, #0x30
130x100003fac <+48>: ret 

Since we already have some experience reading assebmly code, let's go through this one a bit quicker.

由于我们已经有一些阅读汇编代码的经验,所以让我们更快地浏览一下这一点。

  • Starting from the beginning, the PC (Programmer counter) register is 0x100003f7c, meaning the first instruction is sub sp, sp, #0x30
    从头开始, PC (程序员计数器)寄存器是0x100003f7c,意味着第一条指令是 sub sp, sp, #0x30
    The first instruction is nothing but to grow the call stack. Similar with x64, the call stack address goes downward as the stack grows. In this example, the call stack is grown by 48 bytes.
    第一条指令只不过是增加调用堆栈。与 x64 类似,调用堆栈地址随着堆栈的增长而下降。在此示例中,调用堆栈增长了 48 个字节。
  • As we mentioned before, X29 (FP, frame pointer) and X30 (LR) are all callee saved, we will have to save the values before moving forward. Instruction 0x100003f80 does exactly this. Later we will see that if we don't modify any of them in a function, there is no need to store them at the beginning of the function then.
    正如我们之前提到的, X29 (FP,帧指针)和 X30 (LR)都是被调用者保存的,我们必须在继续之前保存这些值。指令 0x100003f80 正是这样做的。稍后我们会看到,如果我们不在函数中修改它们中的任何一个,那么就不需要将它们存储在函数的开头。
  • Skipping to instruction 0x100003f98, what it does is to store 0x100003f9c into the x30(LR) register first and then set the PC to 0x100003f50, the first instruction in the function Interface.
    跳到指令 0x100003f98,它的作用是先将 0x100003f9c 存储到 x30 ( LR ) 寄存器中,然后将 PC 设置为 0x100003f50,即函数 Interface

Before we move forward with this program, let's quickly take a look inside the function Interface.

在我们继续这个程序之前,让我们快速看一下函数 Interface 的内部。

cpp 复制代码
 10x100003f50 <+0>:  sub    sp, sp, #0x10
 20x100003f54 <+4>:  str    w0, [sp, #0xc]
 30x100003f58 <+8>:  ldr    w8, [sp, #0xc]
 40x100003f5c <+12>: ldr    w9, [sp, #0xc]
 50x100003f60 <+16>: mul    w8, w8, w9
 60x100003f64 <+20>: str    w8, [sp, #0x8]
 70x100003f68 <+24>: ldr    w8, [sp, #0x8]
 80x100003f6c <+28>: ldr    w9, [sp, #0x8]
 90x100003f70 <+32>: mul    w0, w8, w9
100x100003f74 <+36>: add    sp, sp, #0x10
110x100003f78 <+40>: ret
  • The first instruction (0x100003f50) grows the call stack by 16 bytes.
    第一条指令 (0x100003f50) 将调用堆栈增加 16 个字节。
  • The instructions between 0x100003f54 and 0x100003f70 performs the calculation inside the Interface function.
    0x100003f54 和 0x100003f70 之间的指令执行 Interface 函数内部的计算。
  • Instruction 0x100003f74 pops the stack.
    指令 0x100003f74 弹出堆栈。
  • The last ret instruction is simply ask the program to jump to the instruction that LR register points to and it is set to 0x100003f9c in the main function by instruction 0x100003f98.
    最后一条ret指令只是要求程序跳转到 LR 寄存器指向的指令,并在主函数中通过指令0x100003f98将其设置为0x100003f9c。

One thing we can notice from this program is that the assembly code in Interface function doesn't bother to save and restore X29 and X30, this is fine as it never make any changes of these parameters within this function.

从这个程序中我们可以注意到的一件事是 Interface 函数中的汇编代码不会费心保存和恢复 X29X30 ,这很好,因为它永远不会在此函数中对这些参数进行任何更改。

After the Interface function is finished, the PC becomes 0x100003f9c, the next instruction after the invoking Interface function.
Interface 函数完成后, PC 变为0x100003f9c,即调用 Interface 函数之后的下一条指令。

  • Looking at 0x100003fa4, what this program does is to restore the X29 and X30 registers. It is important to restore these two registers. Specifically in this program, it is the LR that is important as once the return instruction is called at 0x100003fac, the main function needs to return to where LR points to.
    看一下0x100003fa4,这个程序所做的就是恢复 X29X30 寄存器。恢复这两个寄存器很重要。具体来说,在这个程序中, LR 非常重要,因为一旦在0x100003fac处调用返回指令,主函数就需要返回到 LR 指向的位置。
  • It is certainly the callee's responsibilty to make sure the SP register is unchanged. Since we grow the stack at instruction 0x100003f7c, we will have to pop the stack so that the SP register is intact.
    确保 SP 寄存器不变是被调用者的责任。由于我们在指令 0x100003f7c 处增加堆栈,因此我们必须弹出堆栈以使 SP 寄存器完好无损。

Similarly, let's take a look at the asm code produced by the same compiler, but with optimization. Below is the code produced by the compiler with the two functions split into two different compilation units.

同样,让我们​​看一下由同一编译器生成但经过优化的 asm 代码。下面是编译器生成的代码,其中两个函数分为两个不同的编译单元。

Below is the asm code for main.

下面是main的asm代码。

cpp 复制代码
10x100003fa8 <+0>: b      0x100003fac               ; Interface at test.cpp:3:15

And here is the asm code for Interface.

这是接口的 asm 代码。

cpp 复制代码
10x100003fac <+0>: mul    w8, w0, w0
20x100003fb0 <+4>: mul    w0, w8, w8
30x100003fb4 <+8>: ret

This is very self-exlanatory. I'd like to point out one interesting trick that the compiler did in this case. Please be mindful that the jump instruction is a b rather than a bl instruction like before. This b instruction will not store the return address in the LR register. This is fine as the compiler is being smart by taking advantage of the fact that there is no further instructions after invoking the Interface function. So after the Interface function is done, it directly jumps to the next instruction of whichever code that calls main.

这是非常不言自明的。我想指出编译器在这种情况下所做的一个有趣的技巧。请注意,跳转指令是 b 而不是之前的 bl 指令。该 b 指令不会将返回地址存储在LR寄存器中。这很好,因为编译器很聪明,利用了调用 Interface 函数后没有进一步指令的事实。所以 Interface 函数执行完成后,直接跳转到调用main的代码的下一条指令。

Last, let's take a look at the asm code produced with link time optimization.

最后,让我们看一下通过链接时间优化生成的 asm 代码。

cpp 复制代码
10x100003fac <+0>: mul    w8, w0, w0
20x100003fb0 <+4>: mul    w0, w8, w8
30x100003fb4 <+8>: ret

Very simple code and that does exactly what we need.

非常简单的代码,这正是我们所需要的。

Quick Summary Before Moving Forward

继续之前的快速总结

In this section, we briefly mentioned some of the basics of how CPU handles call stack on both x64 and Arm64 architecture. We are also clear by now which registers are callee saved and which are caller saved.

在本节中,我们简要提到了 CPU 如何在 x64 和 Arm64 架构上处理调用堆栈的一些基础知识。现在我们也清楚哪些寄存器是被调用者保存的,哪些是调用者保存的。

Even though what we touched is simply a tip of the iceburg, this should serve as a good foundation for us to keep learning what a fiber is and how it can yield when needed.

尽管我们接触到的只是冰山一角,但这应该为我们继续了解纤程是什么以及它如何在需要时产生作用奠定良好的基础。

Existing Fiber Interface on Windows

Windows 上现有的纤程接口

Next, before we finally dive into the implementation detail of fiber, let's take a quick look at what kind of interface the Windows operating system offers for fiber. It is actually really easy to use.

接下来,在我们最终深入探讨纤程的实现细节之前,我们先快速了解一下Windows操作系统为纤程提供了什么样的接口。它实际上非常容易使用。

  • ConvertThreadToFiber : This is a function helps to convert the current thread to a fiber. It is mandatory to convert a thread to a fiber before yielding the control to another fiber.
    ConvertThreadToFiber :这是一个帮助将当前线程转换为纤程的函数。在将控制权交给另一个纤程之前,必须将线程转换为纤程。
  • ConvertFiberToThread : This function is the reversed version of the previous function. It converts the current fiber to the original thread that was converted to it in the first place.
    ConvertFiberToThread :此函数是上一个函数的反转版本。它将当前纤程转换为最初转换为它的原始线程。
  • CreateFiber : This is the interface for creating a fiber. Programmers can specify the size of the stack and the entry function pointer of the fiber so that when it first gains control, it would run from there.
    CreateFiber :这是创建纤程的接口。程序员可以指定堆栈的大小和纤程的入口函数指针,以便当它第一次获得控制权时,它将从那里运行。
  • DeleteFiber : As its name implies, this is to ask the OS to delete the fiber. Of course, it is programmer's responsibility to make sure a running fiber is not deleted, which will quite possbily cause crash.
    DeleteFiber :顾名思义,这是要求操作系统删除纤程。当然,程序员有责任确保正在运行的纤程不被删除,这很可能会导致崩溃。
  • SwitchToFiber : This is the juicy part. This is the interface that allows fiber to yield to another fiber. And this function implementation is quite cheap, the performance cost is no where near a thread switch scheduled by the OS.
    SwitchToFiber :这是最有价值的部分。这是允许纤程yield于另一纤程的接口。而且这个函数的实现非常便宜,性能成本远不及操作系统调度的线程切换。

That is it. This is the essential part of fiber interfaces that is required to implement a job system that allows yielding in the middle of a task. As we can see, it is really not complicated at all.

这就对了。这是纤程接口的重要部分,需要实现允许在任务中间进行yield的作业系统。正如我们所看到的,它实际上一点也不复杂。

For readers who still have confusion how to use these, here is a short example that demonstrates how to use the interfaces.

对于仍然对如何使用这些接口感到困惑的读者,这里有一个简短的示例,演示了如何使用这些接口。

cpp 复制代码
 1#include <iostream>
 2#include <Windows.h>
 3
 4#define FiberHandle  LPVOID
 5
 6void RegularFunction(FiberHandle* fiber)
 7{
 8  // We are done executing this fiber, yield control back
 9  SwitchToFiber(fiber);
10
11  std::cout << "Hello Fiber Again" << std::endl;
12}
13
14void WINAPI FiberEntry(PVOID arg)
15{
16  // this is the fiber that yields control to the current fiber
17  FiberHandle* fiber = reinterpret_cast<FiberHandle*>(arg);
18
19  // do whatever you would like to do here.
20  std::cout << "Hello Fiber" << std::endl;
21
22  RegularFunction(fiber);
23
24  // We are done executing this fiber, yield control back
25  SwitchToFiber(fiber);
26}
27
28int main(int argc, char** argv) {
29  // convert the current thread to a fiber
30  FiberHandle fiber = ConvertThreadToFiber(nullptr);
31
32  // create a new fiber
33  FiberHandle new_fiber = CreateFiber(1024, FiberEntry, fiber);
34
35  // yield control to the new fiber
36  SwitchToFiber(new_fiber);
37  SwitchToFiber(new_fiber);
38
39  // convert the fiber back to thread
40  ConvertFiberToThread();
41
42  // delete the fibers
43  DeleteFiber(new_fiber);
44
45  return 0;
46}

In case there is confusion, here is a quick explaination. The execution order is that the main gets executed until line 36, where it gives the control to the new fiber created on line 33. After the yielding, the function main will no longer be in control anymore, CPU will start execute from line 14 then. Please be mindful that on line 9, the program jump directly from within RegularFunction, which as its name implies is just a regular c++ function, to the main function so that it keeps execution on line 37. There is no need to go through FiberEntry for such a jump. It is also possible to jump anywhere deep in the callstack of a fiber. Since line 37 immediately yields control back to the fiber, the new_fiber gains the control the second time except that this time, it resumes executing from where it was suspended before (line 9), rather than starting from scratch again. Last, but not least, it is programmers responsibility to make sure fiber always yields to the correct fiber for execution. In this case, line 25 makes sure that the control is back to main so that the rest of the main function gets execution. Do not expect compiler to help in this case, it doesn't have enough information to make such a decision.

如果有疑问,这里有一个快速解释。执行顺序是 main 函数执行到第 36 行,将控制权交给第 33 行创建的新纤程。屈服后,main 函数将不再受控制,CPU 将从第 14 行开始执行。请注意,在第 9 行,程序直接从 RegularFunction (顾名思义,它只是一个常规的 C++ 函数)内部跳转到 main 函数,以便它在第 37 行继续执行。没有这样的跳转需要经过 FiberEntry 。也可以跳转到纤程调用堆栈深处的任何位置。由于第 37 行立即将控制权交还给 Fiber,因此 new_fiber 第二次获得控制权,只不过这次它从之前暂停的位置(第 9 行)恢复执行,而不是再次从头开始。最后但并非最不重要的一点是,程序员有责任确保纤程始终屈服于正确的纤程以执行。在本例中,第 25 行确保控制返回到 main,以便 main 函数的其余部分得到执行。不要指望编译器在这种情况下提供帮助,它没有足够的信息来做出这样的决定。

Hopefully, through this simple example, readers can understand the power and flexibility of fiber. It offers greater power that is badly needed in a job system with tons of dependencies.

希望通过这个简单的例子,读者能够了解纤程的强大功能和灵活性。它提供了具有大量依赖项的作业系统急需的更强大的功能。

Implementing Fiber on x64

在 x64 上实施纤程

This blog post wouldn't exsit if it isn't this fun part. The real fun begins in this section when we start to mess around with the registers to fool the CPU so that we can switch fiber like the OS provided interface does. In order to make this blog post more educational, I made a tiny library that does just this. Here is the link to the Github gist repo I created. Readers are recommended to read this blog post along with the source code to gain a deeper understanding of the tech.

如果不是这个有趣的部分,这篇博文就不会存在。真正的乐趣从本节开始,我们开始摆弄寄存器来欺骗 CPU,以便我们可以像操作系统提供的接口一样切换纤程。为了使这篇博文更具教育意义,我创建了一个小型库来完成此任务。这是我创建的 Github gist 存储库的链接。建议读者阅读这篇博文和源代码,以更深入地了解该技术。

As a matter of fact, such an implementation is needed on MacOS since the OS doesn't offer interfaces for fiber control by the time this post was written. There was indeed an ucontext interface exists on MacOS. However, it was deprecated. Using such an interface would be risky in the future. On linux, we can indeed using this for achieving the same thing.

事实上,MacOS 上需要这样的实现,因为在撰写本文时,该操作系统尚未提供纤程控制接口。 MacOS 上确实存在 ucontext 接口。然而,它已被弃用。将来使用这样的接口会有风险。在Linux上,我们确实可以使用它来实现同样的事情。

The process of implemeting the fiber interface should be pretty rewarding. And the x64 fiber implementation that we will mention in this section will work on all platforms that supports System V ABI.

实现纤程接口的过程应该是非常有意义的。我们将在本节中提到的 x64 纤程实现将在所有支持 System V ABI 的平台上运行。

To implement fiber on x64 by ourselves, all we need to do is to implement the 5 interfaces we mentioned above. As a matter of fact, a good news here is that there is very little needs to be done in ConvertThreadToFiber and ConvertFiberToThread. Later we will explain why this is the case. This leaves us only three functions to implement, CreateFiber, DeleteFiber and SwitchToFiber.

要自己在 x64 上实现 Fiber,我们需要做的就是实现上面提到的 5 个接口。事实上,这里的一个好消息是 ConvertThreadToFiberConvertFiberToThread 中几乎不需要做任何事情。稍后我们会解释为什么会出现这种情况。这使得我们只需要实现三个函数: CreateFiberDeleteFiberSwitchToFiber

Fiber Structure Definition

纤程结构定义

To get started, we need to define the fiber structure first. Below is the definition of fiber in my implementation. Let's take a quick look at it first.

首先,我们需要首先定义纤程结构。以下是我的实现中纤程的定义。我们先快速浏览一下。

cpp 复制代码
1//! Abstruction for fiber struct.
2struct Fiber {
3    /**< Pointer to stack. */
4    void*                   stack_ptr = nullptr;
5    /**< fiber context, this is platform dependent. */
6    FiberContexInternal     context;
7};

As we can see from this data structure, there is only two members in it. stack_ptr, as its name implies, is simply the pointer to the address of the stack, which will be used by the fiber. Different from regular subroutine or language supported coroutine, fiber requires programmers to allocate its own stack memory by themselves. With Windows fiber interface, it is done under the hood of CreateFiber. However, with a low level asm implementation like this, we need to carry over the responsibility of creating the stack memory. In reality, this explicit control of memory allocation is commonly quite welcome by game developers since they are in charge of the memory allocating rather than handling it over to a third party library. Be mindful that there is no real requirement where the memory this pointer has to point to, it is commonly on heap, but it is totally fine if this fiber stack memory is allocated on another stack of either a fiber or a thread as long as the synchronization is done properly that the fiber stack memory won't get destroyed before it is done being used. The only reason we are keeping track of this is because we would like to properly deallocate this memory properly when the fiber gets destroyed. Assembly code will not use this member to track the stack at all. Instead, it will use a stack pointer, which is stored in FiberContextInternal, to keep track of the stack.

从这个数据结构中我们可以看到,它只有两个成员。 stack_ptr ,顾名思义,只是指向堆栈地址的指针,该堆栈地址将由纤程使用。与常规子例程或语言支持的协程不同,Fiber 需要程序员自己分配自己的堆栈内存。对于 Windows 纤程接口,它是在 CreateFiber 的底层完成的。然而,对于像这样的低级 asm 实现,我们需要承担创建堆栈内存的责任。实际上,这种对内存分配的显式控制通常受到游戏开发人员的欢迎,因为他们负责内存分配,而不是将其处理给第三方库。请注意,这个指针必须指向的内存没有真正的要求,它通常位于堆上,但如果将此纤程堆栈内存分配在纤程或线程的另一个堆栈上,只要同步正确完成,纤程堆栈内存在使用完毕之前不会被破坏。我们跟踪这一点的唯一原因是我们希望在纤程被破坏时正确地重新分配该内存。汇编代码根本不会使用此成员来跟踪堆栈。相反,它将使用存储在 FiberContextInternal 中的堆栈指针来跟踪堆栈。

context is the data structure that keeps track of the registers. The mystery fiber context structure merely keeps track a few registers, specifically defined as below.
context 是跟踪寄存器的数据结构。神秘纤程上下文结构仅跟踪一些寄存器,具体定义如下。

cpp 复制代码
 1struct FiberContexInternal {
 2    // callee-saved registers
 3    Register rbx;
 4    Register rbp;
 5    Register r12;
 6    Register r13;
 7    Register r14;
 8    Register r15;
 9
10    // stack and instruction register
11    Register rsp;
12    Register rip;
13};

Some readers may have a question by now. What is the rationale behind the choices of the registers that need to be stored? This is a very important question for us to understand how it works. To answer the question, let's take a look at the registers in the data structure.

一些读者现在可能有一个疑问。选择需要存储的寄存器的理由是什么?对于我们了解它是如何工作的,这是一个非常重要的问题。为了回答这个问题,我们先来看看数据结构中的寄存器。

  • Why do we need to store RIP?
    为什么我们需要存储 RIP
    This is a truly simple question. As mentioned previously, the RIP is the instruction pointer, which points to the next instruction to be executed by the CPU. FiberContextInternal is the data placeholder between a fiber suspension and fiber resume. Upon suspension, the fiber needs to know where it is suspended on so that when it gets resumed, it knows what is the next instruction for the CPU to execute so that it resumes from exactly where it was suspended.
    这是一个非常简单的问题。前面提到, RIP 是指令指针,它指向CPU要执行的下一条指令。 FiberContextInternal 是纤程悬浮液和纤程恢复之间的数据占位符。暂停时,纤程需要知道它暂停在哪里,以便当它恢复时,它知道 CPU 执行的下一条指令是什么,以便它从暂停的确切位置恢复。
  • Why do we need to store RSP?
    为什么我们需要存储 RSP
    This is an easy question to answer as well. Since we allocate our own fiber memory, the RSP has to point to its own stack. Since the compiler doesn't know where the stack top is, we need to make sure we know where it is. And this RSP does exactly that.
    这也是一个很容易回答的问题。由于我们分配自己的纤程内存,因此 RSP 必须指向它自己的堆栈。由于编译器不知道堆栈顶部在哪里,因此我们需要确保知道它在哪里。而这个 RSP 正是这样做的。
  • Why do we need to store the callee saved registers?
    为什么我们需要存储被调用者保存的寄存器?
    Imagine we have a function A made a fiber switch from fiber 0 to fiber 1. Assuming that the R12 register written right before the switch. After the swicth, the function A will be suspended and the fiber 1 will be either resumed or launched. If the fiber 1 was suspended before and gets resumed, the following instructions of fiber 1 may read the register R12 as well. However, it is by no means interested in reading the value of R12 written by function A, all it needs to know is what the R12 register's value was before it was suspended. On the other hand, the value written to the R12 register by function A may very likely to be read at a later point inside it as well. To prevent this value from getting lost after it is resumed in the future, it needs to be cached somewhere. The same goes true for not only R12, but also all callee saved registers. And this is why we need to keep all the callee saved registers.
    想象一下,我们有一个函数 A 将纤程从纤程 0 切换到纤程 1。假设 R12 寄存器就在切换之前写入。切换后,功能A将暂停,纤程1将恢复或启动。如果纤程 1 之前被挂起并恢复,则纤程 1 的后续指令也可以读取寄存器 R12 。但是,它对读取函数 A 写入的 R12 的值毫无兴趣,它只需要知道 R12 寄存器在挂起之前的值是什么。另一方面,由函数 A 写入 R12 寄存器的值也很可能稍后在其中被读取。为了防止该值在将来恢复后丢失,需要将其缓存在某个地方。同样的情况不仅适用于 R12 ,还适用于所有被调用者保存的寄存器。这就是为什么我们需要保留所有被调用者保存的寄存器。
  • Why don't we care about caller saved registers?
    为什么我们不关心调用者保存的寄存器?
    If we take a look at the same example as above, we should be mindful that the fiber switch is a function itself. Even if the fiber switch is just a regular subroutine, as long as it is not inlined, the compiler needs to make sure it restores the values of caller saved registers after it is called. In the above example, imagine after the fiber switch if some other fiber overwrites the value of the caller saved registers and switch the control back to fiber 0, it is still the compiler's responsibility to make sure the caller saved registers are properly restored before reusing them in the caller code. Such a restoring process is commonly performed through caching the value on stack. To some degree, we can regard the call stack itself as the partial cache of our fiber context so that it frees us from the need of doing so.
    如果我们看一下与上面相同的示例,我们应该注意纤程开关本身就是一个函数。即使纤程开关只是一个常规子例程,只要它不是内联的,编译器就需要确保它在调用后恢复调用者保存的寄存器的值。在上面的示例中,想象一下,在纤程切换之后,如果其他纤程覆盖了调用者保存的寄存器的值并将控制权切换回纤程 0,编译器仍然有责任确保调用者保存的寄存器在重用之前正确恢复在调用者代码中。这种恢复过程通常是通过将值缓存在堆栈上来执行的。在某种程度上,我们可以将调用堆栈本身视为 Fiber 上下文的部分缓存,这样我们就不需要这样做了。
    To emphasize it, it is quite important for us to make sure compiler won't optimize our fiber switch function into an inline version. As otherwise, we would need to be responsible for storing the caller saved registers in the fiber context as well. Depending on how aggressive compiler optimization is, it may not be good enough to simply put this function definition into another compilation unit, especially when link time optimization is enabled. The most secured way to make sure is to take a look at the assembly code produced by the compiler to be sure it does what we expect.
    强调一下,确保编译器不会将我们的 Fiber Switch 函数优化为内联版本对我们来说非常重要。否则,我们还需要负责将调用者保存的寄存器存储在 Fiber 上下文中。根据编译器优化的积极程度,简单地将此函数定义放入另一个编译单元中可能不够好,特别是在启用链接时优化时。最安全的方法是查看编译器生成的汇编代码,以确保它符合我们的预期。

At this point, I believe it should be clear why we are defining the fiber context structure the way it is. Thanks to the fact that all of the SIMD registers are caller saved, we only need to store a few registers in our fiber context.

至此,我相信我们应该清楚为什么我们要这样定义 Fiber 上下文结构。由于所有 SIMD 寄存器都是调用者保存的,因此我们只需要在 Fiber 上下文中存储一些寄存器。

Switch Between Fibers 纤程之间切换

Rather than starting from the CreateFiber, I choose to start with the SwitchFiber interface as the former requires knowledge about how the latter works. We already learned that CPU will only use its registers to talk to the rest of the system unless something could be compile time resolved, for example function addresses. Since static information is the same for all executing threads/fibers, we just care about registers for a fiber switch then. Because we are working with registers, we will have to touch assembly languages in order to achieve it. Below is the switch fiber implementation I've done on x64 architecture.

我选择从 SwitchFiber 界面开始,而不是从 CreateFiber 开始,因为前者需要了解后者如何工作。我们已经了解到,CPU 只会使用其寄存器与系统的其余部分通信,除非某些内容可以在编译时解决,例如函数地址。由于静态信息对于所有正在执行的线程/纤程都是相同的,因此我们只关心纤程开关的寄存器。因为我们正在使用寄存器,所以我们必须接触汇编语言才能实现它。下面是我在 x64 架构上完成的交换机纤程实现。

cpp 复制代码
 1.text
 2.align 4
 3_switch_fiber_internal:
 4    // Store callee-preserved registers
 5    movq        %rbx, 0x00(%rdi) /* FIBER_REG_RBX */
 6    movq        %rbp, 0x08(%rdi) /* FIBER_REG_RBP */
 7    movq        %r12, 0x10(%rdi) /* FIBER_REG_R12 */
 8    movq        %r13, 0x18(%rdi) /* FIBER_REG_R13 */
 9    movq        %r14, 0x20(%rdi) /* FIBER_REG_R14 */
10    movq        %r15, 0x28(%rdi) /* FIBER_REG_R15 */
11
12    /* call stores the return address on the stack before jumping */
13    movq        (%rsp), %rcx             
14    movq        %rcx, 0x40(%rdi) /* FIBER_REG_RIP */
15    
16    /* skip the pushed return address */
17    leaq        8(%rsp), %rcx            
18    movq        %rcx, 0x38(%rdi) /* FIBER_REG_RSP */
19
20    // Load context 'to'
21    movq        %rsi, %r8
22
23    // Load callee-preserved registers
24    movq        0x00(%r8), %rbx /* FIBER_REG_RBX */
25    movq        0x08(%r8), %rbp /* FIBER_REG_RBP */
26    movq        0x10(%r8), %r12 /* FIBER_REG_R12 */
27    movq        0x18(%r8), %r13 /* FIBER_REG_R13 */
28    movq        0x20(%r8), %r14 /* FIBER_REG_R14 */
29    movq        0x28(%r8), %r15 /* FIBER_REG_R15 */
30
31    // Load stack pointer
32    movq        0x38(%r8), %rsp /* FIBER_REG_RSP */
33
34    // Load instruction pointer, and jump
35    movq        0x40(%r8), %rcx /* FIBER_REG_RIP */
36    jmp         *%rcx

Below is the declaration of the fucntion interface

下面是函数接口的声明

cpp 复制代码
1void _switch_fiber_internal(FiberContexInternal* src_fiber, const FiberContexInternal* dst_fiber);

There are two parameters in this funciton, src_fiber and dst_fiber. In the above assembly code, RDI is the first parameter (src_fiber) and RSI is the second parameter (dst_fiber). The assembly code is so simple that it doesn't need detailed explanation. In a nutshell, it takes the contents of the relevant registers (RBX, RBP, R12 to R15, RIP, RSP) and store them in the fiber context that src_fiber points to, after which it also loads the content in the fiber context pointed by dst_fiber into the registers. After swapping the values in the registers, the CPU is then fooled about its execution sequence. It will forget all the previuos instruction context and pretend that this funciton is called from where dst_fiber is left of last time, which also includes the initial state of the dst_fiber.

该函数中有两个参数, src_fiberdst_fiber 。在上面的汇编代码中, RDI 是第一个参数 ( src_fiber ), RSI 是第二个参数 ( dst_fiber )。汇编代码很简单,不需要详细解释。简而言之,它获取相关寄存器的内容( RBXRBPR12R15RIP )并将它们存储在 src_fiber 指向的 Fiber 上下文中,之后还将 dst_fiber 指向的 Fiber 上下文中的内容加载到寄存器。交换寄存器中的值后,CPU 的执行顺序就会被欺骗。它将忘记所有先前的指令上下文,并假装从上次剩下的 dst_fiber 位置调用该函数,其中还包括 dst_fiber 的初始状态。

Next obvious question is that where does the value of the fiber context pointed by the dst_fiber come from. There are two cases then. If the fiber was suspended before, it must have went through the same interface, which must have populated the fiber context with the correct value through the first half of the _switch_fiber_internal function. Of course, it is programmers responsibility to make sure the fiber switch is legit. Incorrect fiber switch will easily crash the program.

下一个明显的问题是 dst_fiber 指向的 Fiber 上下文的值从何而来。那么有两种情况。如果纤程之前被挂起,它必须通过相同的接口,该接口必须通过 _switch_fiber_internal 函数的前半部分使用正确的值填充纤程上下文。当然,程序员有责任确保纤程交换机是合法的。不正确的纤程开关很容易导致程序崩溃。

However, if the fiber is newly created and never gets executed before, we also need to make sure it works as expected.

然而,如果纤程是新创建的并且之前从未执行过,我们还需要确保它按预期工作。

Creating a new Fiber

创造新的纤程

Now that we know how to switch between fibers, a question remained to be answered is how can we create a fiber from scratch so that it can be used as a destination fiber in the above switchfiber call.

现在我们知道如何在纤程之间进行切换,还有一个需要回答的问题是我们如何从头开始创建纤程,以便它可以用作上述 switch纤程调用中的目标纤程。

Let's first of all define a fiber main function, which serves as the beginning of a fiber's execution

首先我们定义一个 Fiber 主函数,作为 Fiber 执行的开始

cpp 复制代码
1void FiberMain(){
2    // do whatever you want to do in this fiber
3} 

My fiber entry is defined as above. Though, it is totally possible to define it other ways. This is just one possibility. The next step is to hook this function with a fiber so that when it first gains control, it will start from this function.

我的纤程入口定义如上。不过,完全可以用其他方式来定义它。这只是一种可能性。下一步是将这个函数与一个纤程挂钩,这样当它第一次获得控制权时,它将从此函数开始。

cpp 复制代码
 1bool _create_fiber_internal(void* stack, uint32_t stack_size, FiberContexInternal* context) {
 2    // it is the users responsibility to make sure the stack is 16 bytes aligned, which is required by the Arm64 architecture
 3    if((((uintptr_t)stack) & (FIBER_STACK_ALIGNMENT - 1)) != 0)
 4        return false;
 5
 6    uintptr_t* stack_top = (uintptr_t*)((uint8_t*)(stack) + stack_size);
 7    context->rip = (uintptr_t)FiberMain;
 8    context->rsp = (uintptr_t)&stack_top[-3];
 9    stack_top[-2] = 0;
10
11    return true;
12}

Above is an implementation for on x64 architecture. It is actually quite simple, all we need to do is to setup the stack pointer and instruction pointer. Since the instruction pointer points to the FiberMain, the fiber will be launching from this function entry point first, exactly meeting our expectation. For the stack, we can pass in any memory as long as we can be sure during the execution of the fiber, this memory won't get destroyed. The stack memory has to be 16 bytes aligned, which is required by the ABI. As mentioned before, the stack's address grows downward, meaning that every time we push something in the stack, the stack top address decreases. And because of this, we have to set the stack pointer to the end of the memory, rather than the beginning of the memory.

以上是 x64 架构上的实现。其实很简单,我们只需要设置堆栈指针和指令指针即可。由于指令指针指向 FiberMain ,所以 Fiber 将首先从该函数入口点启动,完全符合我们的预期。对于堆栈来说,我们可以传入任何内存,只要我们能够确定在纤程执行过程中,该内存不会被破坏。堆栈内存必须是 16 字节对齐,这是 ABI 所要求的。如前所述,堆栈的地址向下增长,这意味着每次我们向堆栈中压入一些内容时,堆栈顶部地址都会减少。正因为如此,我们必须将堆栈指针设置到内存的末尾,而不是内存的开头。

If we think about the first time such a fiber gets executed, the second half of the _switch_fiber_internal function will simply load garbage values in the callee saved registers except rsp and rip, but this is fine as the compiler will make sure that the callee saved registers will not be read before they are written to.

如果我们考虑第一次执行这样的纤程, _switch_fiber_internal 函数的后半部分将简单地将垃圾值加载到被调用者保存的寄存器中,除了 rsprip

There is one annoying thing in the above design. The FiberMain function doesn't have any connection with the creation code. Of course, it is possible to pass the information through global data with careful synchronization. A better alternative is to allow programmers to pass in one pointer to the FiberMain so that it can access to basic information from the FiberMain about its creation code. If you can pass in a pointer, you can pass in anything then.

上面的设计中有一件烦人的事情。 FiberMain 函数与创建代码没有任何联系。当然,可以通过仔细同步的全局数据传递信息。更好的替代方法是允许程序员向 FiberMain 传递一个指针,以便它可以从 FiberMain 访问有关其创建代码的基本信息。如果你可以传入一个指针,那么你就可以传入任何东西。

To make it happen, we need to add one more register in our fiber context. And this register is RDI, which is used to represent the first argument passed in a function.

为了实现这一点,我们需要在 Fiber 上下文中再添加一个寄存器。这个寄存器是 RDI ,它用来表示函数中传递的第一个参数。

cpp 复制代码
 1struct FiberContexInternal {
 2    // callee-saved registers
 3    Register rbx;
 4    Register rbp;
 5    Register r12;
 6    Register r13;
 7    Register r14;
 8    Register r15;
 9    // stack and instruction register
10    Register rsp;
11    Register rip;
12    // the first parameter
13    Register rdi;
14};

With this one extra register, we can simply pass a pointer from our redefined interface this way

有了这个额外的寄存器,我们可以通过这种方式简单地从重新定义的接口传递一个指针

cpp 复制代码
 1bool _create_fiber_internal(void* stack, uint32_t stack_size, void* arg, FiberContexInternal* context) {
 2    // it is the users responsibility to make sure the stack is 16 bytes aligned, which is required by the Arm64 architecture
 3    if((((uintptr_t)stack) & (FIBER_STACK_ALIGNMENT - 1)) != 0)
 4        return false;
 5
 6    uintptr_t* stack_top = (uintptr_t*)((uint8_t*)(stack) + stack_size);
 7    context->rip = (uintptr_t)FiberMain;
 8    context->rdi = (uintptr_t)arg;
 9    context->rsp = (uintptr_t)&stack_top[-3];
10    stack_top[-2] = 0;
11
12    return true;
13}

And of course, we need to make some adjustment in our assembly code as well.

当然,我们还需要对汇编代码进行一些调整。

cpp 复制代码
 1.text
 2.align 4
 3_switch_fiber_internal:
 4    // Store callee-preserved registers
 5    movq        %rbx, 0x00(%rdi) /* FIBER_REG_RBX */
 6    movq        %rbp, 0x08(%rdi) /* FIBER_REG_RBP */
 7    movq        %r12, 0x10(%rdi) /* FIBER_REG_R12 */
 8    movq        %r13, 0x18(%rdi) /* FIBER_REG_R13 */
 9    movq        %r14, 0x20(%rdi) /* FIBER_REG_R14 */
10    movq        %r15, 0x28(%rdi) /* FIBER_REG_R15 */
11
12    /* call stores the return address on the stack before jumping */
13    movq        (%rsp), %rcx             
14    movq        %rcx, 0x40(%rdi) /* FIBER_REG_RIP */
15    
16    /* skip the pushed return address */
17    leaq        8(%rsp), %rcx            
18    movq        %rcx, 0x38(%rdi) /* FIBER_REG_RSP */
19
20    // Load context 'to'
21    movq        %rsi, %r8
22
23    // Load callee-preserved registers
24    movq        0x00(%r8), %rbx /* FIBER_REG_RBX */
25    movq        0x08(%r8), %rbp /* FIBER_REG_RBP */
26    movq        0x10(%r8), %r12 /* FIBER_REG_R12 */
27    movq        0x18(%r8), %r13 /* FIBER_REG_R13 */
28    movq        0x20(%r8), %r14 /* FIBER_REG_R14 */
29    movq        0x28(%r8), %r15 /* FIBER_REG_R15 */
30
31    // Load first parameter
32    movq        0x30(%r8), %rdi /* FIBER_REG_RDI */
33
34    // Load stack pointer
35    movq        0x38(%r8), %rsp /* FIBER_REG_RSP */
36
37    // Load instruction pointer, and jump
38    movq        0x40(%r8), %rcx /* FIBER_REG_RIP */
39    jmp         *%rcx

With all the above change, we introduce an argument in the FiberMain. And that single argument allows us to access anything we want within the FiberMain.

通过上述所有更改,我们在 FiberMain 中引入一个参数。这个单一参数允许我们访问 FiberMain 中我们想要的任何内容。

cpp 复制代码
1void FiberMain(void* arg){
2    // do whatever you want to do in a this fiber
3} 

Careful readers may already notice a performance inefficiency here. As long as a fiber has been executed, line 32 is useless. In pratice, I doubt this one single cycle of instruction may have any performance impact at all. Similar to this inefficiency, if we have a new fiber gaining control through the fiber switch call, the instructions between 24 and 30 are all useless as well. A solution to this problem is to warm up the fiber upon its creation through a simplified assembly function that doesn't have line 24 and 30 by switching to the newly created fiber right after creation. And the fiber can switch back at the beginning of its FiberMain to the creation fiber immediately to give its control back. The caller code won't even notice a thing about such a round trip. And then we can remove the instruction loading the first paremter through a separate assembly function that is only used for future fiber switch. For simplicity, my implementation doesn't implement this optimization.

细心的读者可能已经注意到这里的性能低下。只要已经执行了一个纤程,第32行就没用了。实际上,我怀疑这一单个指令周期是否会对性能产生任何影响。与这种低效率类似,如果我们有一个新的纤程通过纤程交换机调用获得控制权,那么24到30之间的指令也都是无用的。此问题的解决方案是在创建纤程时通过简化的装配功能(没有第 24 行和第 30 行)在创建后立即切换到新创建的纤程来预热纤程。并且纤程可以在其 FiberMain 的开头立即切换回创建纤程以交回控制权。调用者代码甚至不会注意到这样的往返。然后我们可以通过单独的汇编函数删除加载第一个参数的指令,该函数仅用于将来的纤程交换机。为简单起见,我的实现没有实现此优化。

Converting between a Thread and a Fiber

线程和纤程之间的转换

Now we are able to create a fiber and switch to the fiber from a different fiber. There is one question left. Our program starts from a thread, not a fiber. We need to be able to convert the thread to a fiber so that it allows us to make the switch happens because we can't switch from a thread to a fiber.

现在我们可以创建纤程并从不同的纤程切换到该纤程。还剩下一个问题。我们的程序从线程开始,而不是纤程。我们需要能够将线程转换为纤程,以便我们能够进行切换,因为我们无法从线程切换到纤程。

In order to do so, we need to implement two different functions, ConvertToFiberFromThread and ConvertToThreadFromFiber. Let's start with the first one.

为此,我们需要实现两个不同的函数 ConvertToFiberFromThreadConvertToThreadFromFiber 。让我们从第一个开始。

Different from newly created fibers by the function CreateFiber, which are by default in its suspended mode, fibers created through ConvertToFiberFromThread are already running when they are 'created', or specifically, converted. This newly converted fiber should be used as source fiber so that it switches to some other fiber. Under no circumstances that we should switch from a fiber to a newly converted fiber produced by ConvertToFiberFromThread, which makes no sense.

与函数 CreateFiber 新创建的纤程不同,默认情况下处于挂起模式,通过 ConvertToFiberFromThread 创建的纤程在"创建"或具体转换时已经在运行。该新转换的纤程应用作源纤程,以便它切换到其他纤程。在任何情况下,我们都不应该从一个 Fiber 切换到由 ConvertToFiberFromThread 生成的新转换的 Fiber,这是没有意义的。

Taking advantages of this fact, we can imagine that the default stack pointer or instruction serves no purpose in such a converted pointer anywhere. No code will ever read those two members (RIP, RSP) in the fiber context before it is first written by a fiber switch. The same goes true for all registers in FiberContexInternal. This gives us the freedom to ignore such fields during a thread to fiber conversion.

利用这一事实,我们可以想象默认的堆栈指针或指令在任何地方的此类转换指针中都没有任何作用。在 Fiber 交换机首次写入之前,任何代码都不会读取 Fiber 上下文中的这两个成员( RIPRSP )。对于 FiberContexInternal 中的所有寄存器也是如此。这使我们可以在线程到纤程转换期间自由地忽略这些字段。

cpp 复制代码
1inline FiberHandle CreateFiberFromThread() {
2    Fiber* ptr = (Fiber*)TINY_FIBER_MALLOC(sizeof(Fiber));
3    ptr->context = {};
4    return ptr;
5}

Above is the function to convert from a thread to a fiber. Apart from allocating the fiber structure memory, not even the fiber stack, nothing else is done. Again, this is totally fine as this fiber context will not be read first.

上面是从线程转换为纤程的函数。除了分配纤程结构内存,甚至不分配纤程堆栈之外,什么也没做。同样,这完全没问题,因为不会首先读取此 Fiber 上下文。

It is pretty straightforward to figure out that ConvertToThreadFromFiber is simply an empty implementation. An alternative solution is to choose deleting the fiber in such a function to be more consistent with the Windows interface. However, in my own implementation, I hided the interface from the library. It is automatically done once the fiber's lifetime ends.

很容易看出 ConvertToThreadFromFiber 只是一个空的实现。另一种解决方案是选择删除此类功能中的纤程,以与Windows界面更加一致。然而,在我自己的实现中,我从库中隐藏了接口。一旦纤程的使用寿命结束,它就会自动完成。

Deleting a Fiber 删除纤程

Deleting a fiber is the simplest method compared with all the above methods. All we need to do at this point is to free the stack memory and the memory for the fiber structure itself.

与上述所有方法相比,删除纤程是最简单的方法。此时我们需要做的就是释放堆栈内存和纤程结构本身的内存。

To point it out again, deleting a running fiber can result in crash if it is being ran or will be ran in the future. It is programmers' responsibility to make sure when a fiber is deleted, nothing is using it.

再次指出,删除正在运行的纤程可能会导致正在运行或将来运行的纤程崩溃。程序员有责任确保当纤程被删除时,没有任何东西正在使用它。

原文地址

相关推荐
九圣残炎2 小时前
【springboot】简易模块化开发项目整合Redis
spring boot·redis·后端
.生产的驴2 小时前
Electron Vue框架环境搭建 Vue3环境搭建
java·前端·vue.js·spring boot·后端·electron·ecmascript
爱学的小涛2 小时前
【NIO基础】基于 NIO 中的组件实现对文件的操作(文件编程),FileChannel 详解
java·开发语言·笔记·后端·nio
爱学的小涛2 小时前
【NIO基础】NIO(非阻塞 I/O)和 IO(传统 I/O)的区别,以及 NIO 的三大组件详解
java·开发语言·笔记·后端·nio
北极无雪2 小时前
Spring源码学习:SpringMVC(4)DispatcherServlet请求入口分析
java·开发语言·后端·学习·spring
爱码少年2 小时前
springboot工程中使用tcp协议
spring boot·后端·tcp/ip
2401_8576226610 小时前
SpringBoot框架下校园资料库的构建与优化
spring boot·后端·php
2402_8575893610 小时前
“衣依”服装销售平台:Spring Boot框架的设计与实现
java·spring boot·后端
哎呦没12 小时前
大学生就业招聘:Spring Boot系统的架构分析
java·spring boot·后端
_.Switch12 小时前
Python Web 应用中的 API 网关集成与优化
开发语言·前端·后端·python·架构·log4j