动态宏观经济模型数值求解中的编程语言性能:速度基准与混合编程策略

注:英文引文,机翻未校。

文中提图、表原文缺失。

如有其他内容异常,请看原文。


A Comparison of Programming Languages in Economics

经济学中编程语言的对比研究

S. Bora˘gan Aruoba†

University of Maryland

S. 博拉甘・阿鲁巴†

马里兰大学

Jesús Fernández-Villaverde‡

University of Pennsylvania

赫苏斯・费尔南德斯 - 比利亚韦德‡

宾夕法尼亚大学

August 5, 2014

2014 年 8 月 5 日

Abstract

摘要

We solve the stochastic neoclassical growth model, the workhorse of modern macroeconomics, using C++11, Fortran 2008, Java, Julia, Python, Matlab, Mathematica, and R. We implement the same algorithm, value function iteration with grid search, in each of the languages. We report the execution times of the codes in a Mac and in a Windows computer and briefly comment on the strengths and weaknesses of each language.

本文使用 C++11、Fortran 2008、Java、Julia、Python、Matlab、Mathematica 和 R 八种编程语言求解现代宏观经济学的核心模型 ------ 随机新古典增长模型。我们在每种语言中均实现了相同的算法,即结合网格搜索的价值函数迭代法,同时报告了各语言代码在 Mac 和 Windows 系统中的运行时间,并简要分析了每种语言的优劣。

Key words : Dynamic Equilibrium Economies, Computational Methods, Programming Languages.
JEL classifications: C63, C68, E37.

关键词 :动态均衡经济、计算方法、编程语言
《经济文献杂志》分类号:C63、C68、E37

∗We thank Manuel Amador for his help with making our Python and Mathematica codes more idiomatic, Matthew MacKay and John Stachurski for their help with Numba, basthtage for moving our code to Cython, Matt Dziubinski and Santiago González for alternative implementations of our C++ code, Anastasios Stamulis for porting our code to C#, Luigi Bocola, Gustavo Camilo and Pablo Cuba-Borda for research assistance, Thomas Jones, Thibaut Lamadon, Florian Oswald, Pau Rabanal, Pablo Winant, and the members of the Julia user group for comments, and the NSF for financial support.

†University of Maryland, aruoba@econ.umd.edu.

‡University of Pennsylvania, NBER and CEPR jesusfv@econ.upenn.edu.

∗感谢 Manuel Amador 协助我们优化 Python 和 Mathematica 代码的编写风格,Matthew MacKay 和 John Stachurski 为 Numba 的使用提供帮助,basthtage 将我们的代码迁移至 Cython,Matt Dziubinski 和 Santiago González 为 C++ 代码提供了替代实现方案,Anastasios Stamulis 将代码移植至 C#,Luigi Bocola、Gustavo Camilo 和 Pablo Cuba-Borda 提供研究协助;同时感谢 Thomas Jones、Thibaut Lamadon、Florian Oswald、Pau Rabanal、Pablo Winant 以及 Julia 用户组成员提出的宝贵意见,还有美国国家科学基金会提供的资金支持。

†马里兰大学,邮箱:aruoba@econ.umd.edu

‡宾夕法尼亚大学、美国国家经济研究局、欧洲经济政策研究中心,邮箱:jesusfv@econ.upenn.edu

1. Introduction

1. 引言

Computation has become a central tool in economics. From the solution of dynamic equilibrium models in macroeconomics or industrial organization, to the characterization of equilibria in game theory, or in estimation by simulation, economists spend a considerable amount of their time coding and running fairly sophisticated software.

计算技术已成为经济学研究的核心工具。从宏观经济学和产业组织理论中动态均衡模型的求解,到博弈论中均衡特征的刻画,再到基于模拟的参数估计,经济学家会花费大量时间编写并运行复杂度较高的软件。

And while some effort has been focused on the comparison of different algorithms for the solution of common problems in economics (see, for instance, Aruoba, Fernández-Villaverde, and Rubio-Ramírez, 2006), there has been little formal comparison of programming languages. This is surprising because there is an ever-growing variety of programming languages and economists are often puzzled about which language is best suited to their needs.¹ Instead of a suite of benchmarks, researchers must rely on personal experimentations or on "folk wisdom." For example, it is still commonly believed that Fortran is the fastest available language or that C++ is too hard to learn.

尽管已有部分研究聚焦于经济学通用问题求解中不同算法的对比(例如 Aruoba、Fernández-Villaverde 和 Rubio-Ramírez,2006),但针对编程语言的正式对比研究却寥寥无几。这一现象令人意外,因为当下编程语言的种类持续增多,经济学家往往难以判断哪种语言最契合自身研究需求 ¹。研究者们缺乏一套标准化的基准测试体系,只能依靠个人实践或 "坊间经验" 做出选择。例如,目前仍有普遍观点认为 Fortran 是运行速度最快的编程语言,或是 C++ 的学习难度过高。

In this paper, we take a first step at correcting this unfortunate situation. The target audience for our results is younger economists (graduate students, junior faculty) or researchers who have used the computer less often in the past for numerical analysis and who are searching for guideposts in their first incursions into computation.

本文旨在率先改变这一现状,研究结论的目标受众为青年经济学家(研究生、青年教职人员),以及过往较少使用计算机进行数值分析、初次涉足计算经济学领域并亟需参考指引的研究者。

We solve the stochastic neoclassical growth model, the workhorse of modern macroeconomics, using C++11, Fortran 2008, Java, Julia, Python, Matlab, Mathematica, and R.² We implement the same algorithm, value function iteration with grid search for optimal future capital, in each of the languages and measure the execution time of the codes in a Mac and in a Windows computer. The advantage of our algorithm, value function iteration with grid search, is that it is "representative" of many economic computations: expensive loops, large matrices to store in memory, and so on. Thus, while our investigation does not entail a full suite of benchmarks, both our model and our solution method are among the best available choices for our investigation. In addition, our two machines, a Mac and a Windows computer, are the two most popular environments for software development for economists (in the Mac we compile and run some of the code from the command line, thus implying results very close to those that would come from equivalent Unix/Linux machines).

本文使用 C++11、Fortran 2008、Java、Julia、Python、Matlab、Mathematica 和 R 八种语言求解现代宏观经济学的核心模型 ------ 随机新古典增长模型 ²。我们在每种语言中实现了相同的算法,即通过网格搜索寻找最优未来资本存量的价值函数迭代法,并测算各语言代码在 Mac 和 Windows 计算机中的运行时间。该算法的优势在于,它是诸多经济学计算的 "代表性" 算法,涉及计算成本较高的循环、需要内存存储的大型矩阵等典型计算场景。因此,尽管本研究未构建一套完整的基准测试体系,但所选的研究模型和求解方法均是开展本次对比研究的最优选择之一。此外,Mac 和 Windows 系统是经济学家最常用的两大软件开发环境(在 Mac 系统中,我们通过命令行编译并运行部分代码,所得结果与在同配置的 Unix/Linux 系统中运行的结果高度接近)。

In section 4, we report speed results for each language (including several implementations of the same language and different compilers), but here is a brief summary of some of our main findings:

本文第四部分将报告各语言的运行速度结果(包括同一语言的多种实现方式和不同编译器的测试结果),以下先简要概述核心研究发现:

¹ This also stands in comparison with work in other fields, such as Prechelt (2000), Lubin and Dunning (2013), or web projects, such as The Computer Language Benchmarks Game (see http://benchmarksgame.alioth.debian.org/).

¹ 这一点与其他领域的研究形成鲜明对比,例如 Prechelt(2000)、Lubin 和 Dunning(2013)的研究,或是《计算机语言基准测试游戏》这类网络项目(参见 http://benchmarksgame.alioth.debian.org/)。

² From now on, we drop the year of the standard of the language unless it is needed.

² 除非有必要,否则后文将省略各编程语言的版本年份。

  1. C++ and Fortran are still considerably faster than any other alternative, although one needs to be careful with the choice of compiler.

    尽管编译器的选择会对运行速度产生显著影响,但 C++ 和 Fortran 的运行速度仍远高于其他所有编程语言。

  2. C++ compilers have advanced enough that, contrary to the situation in the 1990s and some folk wisdom, C++ code runs slightly faster (5-7 percent) than Fortran code.

    C++ 编译器已取得长足发展,与上世纪 90 年代的情况和部分坊间认知不同,C++ 代码的运行速度比 Fortran 略快(5%-7%)。

  3. Julia delivers outstanding performance. Execution speed is only between 2.64 and 2.70 times slower than the execution speed of the best C++ compiler.

    Julia 语言表现出色,其运行速度仅比最优 C++ 编译器的运行速度慢 2.64 至 2.70 倍。

  4. Baseline Python was slow. Using the Pypy implementation, it runs around 44 times slower than in C++. Using the default CPython interpreter, the code runs between 155 and 269 times slower than in C++.

    基础版本 Python 的运行速度较慢:使用 Pypy 解释器时,运行速度约为 C++ 的 1/44;使用默认的 CPython 解释器时,运行速度为 C++ 的 1/155 至 1/269。

  5. Matlab is between 9 to 11 times slower than the best C++ executable.

    Matlab 的运行速度比最优 C++ 可执行文件慢 9 至 11 倍。

  6. R runs between 475 to 491 times slower than C++. If the code is compiled, the code is between 243 to 282 times slower.

    R 的运行速度为 C++ 的 1/475 至 1/491;若对 R 代码进行编译,其运行速度为 C++ 的 1/243 至 1/282。

  7. Hybrid programming and special approaches can deliver considerable speed ups. For example, when combined with Mex files, Matlab is only 1.24 to 1.64 times slower than C++ and when combined with Rcpp, R is between 3.66 and 5.41 times slower. Similar numbers hold for Numba (a just-in-time compiler for Python that uses decorators) and Cython (a static compiler for writing C extensions for Python) in the Python ecosystem.

    混合编程和特殊优化方法能显著提升运行速度:例如,Matlab 结合 Mex 文件使用时,运行速度仅比 C++ 慢 1.24 至 1.64 倍;R 结合 Rcpp 包使用时,运行速度比 C++ 慢 3.66 至 5.41 倍。Python 生态中的 Numba(基于装饰器的 Python 即时编译器)和 Cython(用于编写 Python 的 C 扩展的静态编译器)也能实现相近的提速效果。

  8. Mathematica is only about three times slower than C++, but only after a considerable rewriting of the code to take advantage of the peculiarities of the language. The baseline version of our algorithm in Mathematica is considerably slower.

    若对 Mathematica 代码进行大幅重写,充分利用该语言的特性,其运行速度仅比 C++ 慢约 3 倍;而该算法的 Mathematica 基础版本运行速度则慢得多。

Some could argue that our results are not surprising as they coincide with the guesses of an experienced programmer. But we regard this comment as a point of strength, not of weakness. It is a validation that our exercise was conducted under reasonably fair conditions. We do not seek to overturn the experience of knowledgeable programmers, but to formalize such experience under well-described and explicitly controlled conditions and to report the information to others.

有观点认为,本研究结果与资深程序员的经验判断一致,并无新意。但我们认为,这恰恰是本研究的优势而非不足,这验证了我们的研究是在相对公平的条件下开展的。本研究并非要推翻资深程序员的实践经验,而是要在明确且可控的研究条件下,将这些经验正式化,并将研究结果分享给更多研究者。

We do not comment on the difficulty of implementation of the algorithm in each language, for a couple of reasons. First, such difficulty is subjective and depends on the familiarity of a researcher with a particular programming language or perhaps just with his predisposition toward a programming paradigm. Second, to make the comparison as unbiased as possible, we coded the same algorithm in each language without adapting it to the peculiarities of each language (which could reflect more about our knowledge of each language than of its objective virtues).³ Therefore, the final code looks remarkably similar among languages, with the exception of one version of the code in Mathematica. The codes are posted at our github repository https://github.com/jesusfv/Comparison-Programming-Languages-Economics and the reader is invited to gauge that difficulty for himself. The main point of this paper is to provide a measure of the "benefit" in a cost-benefit calculation for researchers who are considering learning a new language. The "cost" part will be subjective.

本文未分析在各语言中实现该算法的难易程度,原因有二:其一,算法实现的难易程度具有主观性,取决于研究者对特定编程语言的熟悉程度,甚至其对某种编程范式的偏好;其二,为保证对比的客观性,我们在各语言中编写的算法代码均未针对语言特性进行适配(若进行适配,所得结果更多反映的是我们对各语言的掌握程度,而非语言本身的客观特性)³。因此,除某一版本的 Mathematica 代码外,各语言的最终代码在结构上高度相似。所有代码均已上传至我们的 GitHub 仓库:https://github.com/jesusfv/Comparison-Programming-Languages-Economics,读者可自行评估各语言实现该算法的难易程度。本研究的核心目的,是为考虑学习新编程语言的研究者提供成本 - 收益分析中的 "收益" 量化指标,而其中的 "成本" 则具有主观性。

The rest of the paper is structured as follows. First, in section 2, we introduce our application and algorithm. In section 3 we motivate our selection of programming languages. In section 4, we report our results. Section 5 concludes.

本文的后续结构安排如下:第二部分介绍研究所用的模型和算法;第三部分说明编程语言的选择依据;第四部分报告研究结果;第五部分为结论。

³It also means that proposals to improve the coding should be made for all languages (unless there is an obvious problem with one of the languages). The game is not to write the best possible C++ code, it is to write C++ code that is comparable to, for example, Matlab code in computational complexity. We are not interested in speed itself, but on relative speed.

³ 这也意味着,若要对代码进行优化改进,需对所有语言的代码同步进行(除非某一语言的代码存在明显问题)。本研究并非要编写最优的 C++ 代码,而是要编写与 Matlab 等语言代码计算复杂度相当的 C++ 代码。我们关注的并非各语言的绝对运行速度,而是相对运行速度。

2. The Stochastic Neoclassical Growth Model

2. 随机新古典增长模型

We pick, for our exercise, the stochastic neoclassical growth model, the foundation of much work in macroeconomics. We solve the model with value function iteration and a grid search for the optimal values of future capital. In that way, we compare programming languages for their ability to handle a task such as value function iteration that appears everywhere in economics and within a well-understood economic environment.

本研究选取宏观经济学的核心基础模型 ------ 随机新古典增长模型作为测试对象,通过价值函数迭代法结合网格搜索求解最优未来资本存量,以此对比各编程语言处理价值函数迭代这类经济学通用计算任务的能力,且所有测试均在已被充分研究的经济环境中开展。

In this model, a social planner picks a sequence of consumption c t c_{t} ct and capital k t k_{t} kt to solve

在该模型中,社会计划者选择消费序列 c t c_t ct 和资本序列 k t k_t kt 以求解如下优化问题:

max ⁡ { c t , k t + 1 } E 0 ∑ t = 0 ∞ ( 1 − β ) β t log ⁡ c t \max {\left\{c{t}, k_{t+1}\right\}} \mathbb {E}{0} \sum{t=0}^{\infty}(1-\beta) \beta^{t} \log c_{t} {ct,kt+1}maxE0t=0∑∞(1−β)βtlogct

where E 0 \mathbb {E}_{0} E0 is the conditional expectation operation, β \beta β the discount factor, and the resource constraint is given by

其中 E 0 \mathbb {E}_0 E0 为条件期望算子, β \beta β 为贴现因子,资源约束条件为:

c t + k t + 1 = z t k t α + ( 1 − δ ) k t c_{t}+k_{t+1}=z_{t} k_{t}^{\alpha}+(1-\delta) k_{t} ct+kt+1=ztktα+(1−δ)kt

where productivity z t z_{t} zt takes values in a set of discrete points { z 1 , . . . , z n } \{z_{1}, ..., z_{n}\} {z1,...,zn} that evolve according to a Markov transition matrix Π \Pi Π. The initial conditions, k 0 k_{0} k0 and z 0 z_{0} z0, are given. While, in the interest of space, we have written the model in terms of the problem of a social planner, this is not required and we could deal, instead, with a competitive equilibrium.

生产率 z t z_t zt 取离散值集合 { z 1 , . . . , z n } \{z_1,...,z_n\} {z1,...,zn} 中的值,且遵循马尔可夫转移矩阵 Π \Pi Π 演化,初始资本 k 0 k_0 k0 和初始生产率 z 0 z_0 z0 为已知值。为简化篇幅,本文以社会计划者问题的形式构建模型,实际上该模型也可拓展至竞争性均衡框架。

For our calibration, we pick δ = 1 \delta=1 δ=1, which implies that the model has a closed-form solution k t + 1 = α β z t k t α k_{t+1}=\alpha \beta z_{t} k_{t}^{\alpha} kt+1=αβztktα and c t = ( 1 − α β ) z t k t α c_{t}=(1-\alpha \beta) z_{t} k_{t}^{\alpha} ct=(1−αβ)ztktα. This will allow us to assess the accuracy of the solution we compute. Then, we are only left with the need to choose values for β \beta β, α \alpha α, and the process for z t z_{t} zt. But since δ = 1 \delta=1 δ=1 is unrealistic, instead of targeting explicit moments of the data, we just pick conventional values for these parameters and processes. For β \beta β we pick 0.95, 1 / 3 1/3 1/3 for α \alpha α, and for z t z_{t} zt we have a 5-point Markov chain:

在模型校准中,我们设定折旧率 δ = 1 \delta=1 δ=1,此时模型存在解析解: k t + 1 = α β z t k t α k_{t+1}=\alpha \beta z_{t} k_{t}^{\alpha} kt+1=αβztktα, c t = ( 1 − α β ) z t k t α c_{t}=(1-\alpha \beta) z_{t} k_{t}^{\alpha} ct=(1−αβ)ztktα,这一设定便于我们检验数值求解结果的准确性。此时模型仅需确定贴现因子 β \beta β、资本产出弹性 α \alpha α 和生产率 z t z_t zt 的演化过程这三个参数。由于 δ = 1 \delta=1 δ=1 的设定与现实不符,我们未针对实际数据的矩条件进行校准,而是选取了该领域的常规参数值:贴现因子 β = 0.95 \beta=0.95 β=0.95,资本产出弹性 α = 1 / 3 \alpha=1/3 α=1/3;生产率 z t z_t zt 遵循 5 状态马尔可夫链演化,其状态空间为:

z t ∈ { 0.9792 , 0.9896 , 1.0000 , 1.0106 , 1.0212 } z_{t}\in \{0.9792, 0.9896, 1.0000, 1.0106, 1.0212\} zt∈{0.9792,0.9896,1.0000,1.0106,1.0212}

with transition matrix:

转移矩阵为:

Π = ( 0.9727 0.0273 0 0 0 0.0041 0.9806 0.0153 0 0 0 0.0082 0.9836 0.0082 0 0 0 0.0153 0.9806 0.0041 0 0 0 0.0273 0.9727 ) \Pi =\left (\begin {array}{ccccc} 0.9727 & 0.0273 & 0 & 0 & 0 \\ 0.0041 & 0.9806 & 0.0153 & 0 & 0 \\ 0 & 0.0082 & 0.9836 & 0.0082 & 0 \\ 0 & 0 & 0.0153 & 0.9806 & 0.0041 \\ 0 & 0 & 0 & 0.0273 & 0.9727 \end {array}\right) Π= 0.97270.00410000.02730.98060.00820000.01530.98360.01530000.00820.98060.02730000.00410.9727

The transition matrix is similar to the one that would come from a discretization of an AR (1) process for (log) productivity following Tauchen's (1986) procedure, except that we move mass from the diagonal to the upper and lower bands to induce more movements across states and to create a more challenging computation. The relative speed comparisons that we report below are robust to different calibrated values, including values of δ < 1 \delta<1 δ<1.

该转移矩阵的构建参考了 Tauchen(1986)的方法对生产率对数的一阶自回归(AR (1))过程进行离散化的结果,不同之处在于,我们将转移概率的对角项部分转移至非对角项,以增加生产率的状态转移频率,从而提升计算任务的难度。本文后续报告的各编程语言相对速度对比结果具有稳健性,不受参数校准值变化的影响,包括折旧率 δ < 1 \delta<1 δ<1 的情况。

The recursive formulation of this problem in terms of a value function V ( ⋅ , ⋅ ) V (\cdot, \cdot) V(⋅,⋅) and a Bellman operator is:

该问题基于价值函数 V ( ⋅ , ⋅ ) V (\cdot, \cdot) V(⋅,⋅) 和贝尔曼算子的递归形式可表示为:

V ( k , z ) = max ⁡ k ′ ( 1 − β ) log ⁡ ( z k α − k ′ ) + β E [ V ( k ′ , z ′ ) ∣ z ] V (k, z)=\max _{k'}(1-\beta) \log \left (z k^{\alpha}-k'\right)+\beta \mathbb {E}\left [V\left (k', z'\right) | z\right] V(k,z)=k′max(1−β)log(zkα−k′)+βE[V(k′,z′)∣z]

(where we have already imposed that δ = 1 \delta=1 δ=1). We solve this Bellman operator using value function iteration and grid search on k ′ k' k′. We take advantage of monotonicity in the policy function and of an envelope condition to avoid unnecessary computations. We use a grid of 17,820 points for k k k uniformly distributed 50 percent of the steady-state value of capital. We choose this grid size so that C++ or Fortran would solve the problem in about one second. Shorter run times would cause large relative measurement error (due to issues such as the situation of the cache at any given time). We impose a tolerance of 1.0 × 10 − 7 1.0\times10^{-7} 1.0×10−7 for convergence. The value function took 257 iterations to converge. We checked that all codes achieved convergence in the same number of iterations and that the computed value and policy functions were exactly the same.

(已代入 δ = 1 \delta=1 δ=1 的设定)。我们通过价值函数迭代法结合对未来资本 k ′ k' k′ 的网格搜索求解该贝尔曼算子,并利用政策函数的单调性和包络条件减少不必要的计算。我们将资本存量 k k k 的网格设定为 17,820 个点,均匀分布在稳态资本存量 50% 的区间内。该网格规模的选择标准为:使 C++ 或 Fortran 求解该问题的运行时间约为 1 秒 ------ 若运行时间过短,会因缓存状态等因素产生较大的相对测量误差。我们设定收敛精度为 1.0 × 10 − 7 1.0\times10^{-7} 1.0×10−7,价值函数经 257 次迭代后收敛。经检验,所有语言的代码均在 257 次迭代后收敛,且求解得到的价值函数和政策函数结果完全一致。

In Figure 1, we plot the value (top panel) and policy function for capital next period (middle panel) along the capital dimension, with each color representing a different value of z t z_{t} zt. The value and policy functions are, as expected, increasing and concave. We also plot the difference between the exact and approximated policy function for capital in percentage terms (bottom panel). The maximum error is only -0.0059 percent, which illustrates the high accuracy achieved with 17,820 points for k k k.

图 1 中,上图为资本维度上的价值函数,中图为下一期资本的政策函数,不同颜色代表生产率 z t z_t zt 的不同状态。正如理论预期,价值函数和政策函数均为单调递增的凹函数。下图为资本政策函数的解析解与数值解的百分比误差,最大误差仅为 - 0.0059%,这表明 17,820 个资本网格点的设定能实现较高的求解精度。

3. Selection of Programming Languages

3. 编程语言的选择

Since Fortran came around in 1957, hundreds of programming languages have been created. Even limiting ourselves to languages that have acquired a solid user base circa 2014, we need to choose among dozens of them.

自 1957 年 Fortran 语言诞生以来,人类已开发出数百种编程语言。即便仅限定在 2014 年左右拥有坚实用户基础的语言范围内,仍有数十种语言可供选择。

Fortunately, the task is simpler than it seems. There is little point to picking languages such as Perl or PHP, neither of which is particularly suited to, nor widely used for scientific computing. Also, many languages are close relatives of each other and one member of the family will suffice for our comparison. With our choices of languages, we cover a wide range of possibilities and, with the exception of the functional programming languages discussed below, we feel we have covered all the obvious choices for numerical computation.

所幸实际选择过程比想象中简单:Perl、PHP 等语言并不适用于科学计算,也未在该领域得到广泛应用,因此无需纳入测试;此外,许多编程语言同属一个体系,选取其中一种即可代表整个体系参与对比。我们所选的编程语言覆盖了各类典型类型,除下文提及的函数式编程语言外,已包含数值计算领域所有主流的选择。

3.1. Compiled Languages

3.1. 编译型语言

Among compiled languages, we select C++, Fortran, and Java. C++ is, perhaps, the most powerful language among those widely used. Together with C and Objective-C, it constitutes the backbone of much of the modern computing world. According to the well-cited TIOBE Index of programming language popularity (May 2014 edition), C is ranked number 1, Objective-C is ranked number 3, and C++ number 4, with a total rating of 34.70 percent.⁴ Our C++ code does not use any specific C++ features such as objects. Thus, the C and Objective-C codes (which can be found on the github page) are nearly equivalent. We checked, also, that the run time of the C and Objective-C codes was nearly exactly the same. Thus, we will only report the C++ running time.⁵

在编译型语言中,我们选取了 C++、Fortran 和 Java。C++ 或许是主流编程语言中功能最强大的一种,它与 C、Objective-C 共同构成了现代计算机技术的核心基础。根据颇具权威性的 TIOBE 编程语言流行度指数(2014 年 5 月版),C 语言排名第 1,Objective-C 排名第 3,C++ 排名第 4,三者总占比达 34.70%⁴。本研究中的 C++ 代码未使用面向对象等 C++ 专属特性,因此 C 和 Objective-C 版本的代码(可在 GitHub 页面查看)与 C++ 代码几乎等效;经检验,C 和 Objective-C 代码的运行时间也与 C++ 代码基本一致,因此本文仅报告 C++ 代码的运行时间⁵。

Two other relatives of C++ are C# and D. C# is widely used in the industry (C# is ranked 6th in the TIOBE Index with 3.75 percent). However, design considerations that make C# attractive for commercial applications also render it slower for numerical computation and, thus, it is rarely employed for the tasks we are concerned with in this paper.⁶ D, which generates code usually roughly of the same speed as C++, is less popular (ranked 26 with 0.60 percent). Including all five languages, the C family accumulates a popularity index of 39.04 percent. Swift, a proposed replacement for Objective-C, is not designed, at this moment, as a programming language for use outside OS X and iOS.

C++ 的另外两种同体系语言为 C# 和 D:C# 在工业界应用广泛(在 TIOBE 指数中排名第 6,占比 3.75%),但使其适用于商业应用的设计特点导致其数值计算速度较慢,因此极少被用于本文所研究的计算任务⁶;D 语言生成的代码运行速度与 C++ 大致相当,但普及度较低(排名第 26,占比 0.60%)。上述五种 C 体系语言的 TIOBE 指数总占比达 39.04%。而计划替代 Objective-C 的 Swift 语言,在当时的设计场景仅局限于 OS X 和 iOS 系统,并未面向其他系统开发。

Fortran, the oldest language of all, still maintains a significant presence in high performance scientific computing and among economists. Its latest incarnation, Fortran 2008, is updated with modern features and innovations such as coarrays. Reflecting this niche nature of Fortran, the TIOBE ranks it 32, with a 0.42 percent popularity.

Fortran 是所有编程语言中历史最悠久的一种,如今仍在高性能科学计算和经济学研究领域占据重要地位。其最新版本 Fortran 2008 融入了协同数组等现代编程特性和创新功能。由于 Fortran 的应用场景具有专业性,其在 TIOBE 指数中排名第 32,占比仅 0.42%。

Java is a common vehicle for undergraduate education and the availability of the Java Virtual Machine in practically all computer environments makes it an attractive choice. In the TIOBE Index, it is ranked 2nd with a popularity of 5.99 percent.

Java 是本科计算机教育的常用语言,且 Java 虚拟机几乎可在所有计算机环境中运行,这一特性使其成为极具吸引力的选择。Java 在 TIOBE 指数中排名第 2,流行度占比 5.99%。

The performance of compiled languages also depends on the compiler used to generate the executable files.⁷ Thus, we select a number of those. For C++, in the Mac machine, we pick GCC, Intel C++, and Clang (which shares the LLVM -lower level virtual machine -with XCode and delivers nearly identical speed) and in the Windows machine, GCC, Intel C++, and Visual C++. For Fortran, in both machines, we select GCC and Intel Fortran.⁸ For Java, we select the standard Oracle JDK.

编译型语言的运行性能还取决于生成可执行文件的编译器⁷,因此我们选取了多款主流编译器开展测试:在 Mac 系统中,C++ 的测试编译器为 GCC、Intel C++ 和 Clang(与 XCode 共用底层虚拟机 LLVM,运行速度与前两者基本一致);在 Windows 系统中,C++ 的测试编译器为 GCC、Intel C++ 和 Visual C++。在两款系统中,Fortran 的测试编译器均为 GCC 和 Intel Fortran⁸。Java 则选用标准的 Oracle JDK 编译器。

⁴ See the definition and interpretation of the index at:

⁴ 该指数的定义和解读参见:
http://www.tiobe.com/index.php/content/paperinfo/tpci/tpci_definition.htm.

⁵ For a comparison of syntaxes, see the Hyperpolyglot at

⁵ 各语言的语法对比可参见多语言语法参考网站:
http://hyperpolyglot.org/cpp.

⁶ Anastasios Stamulis reported that a straightforward conversion of our C++ to C# is around 20 percent slower.

⁶ Anastasios Stamulis 的测试结果显示,将我们的 C++ 代码直接转换为 C# 代码后,运行速度约慢 20%。

⁷ See, for example, the comparison at ⁷ 例如,可参见该页面的编译器对比:
http://www.polyhedron.com/fortran-compiler-comparisons.

⁸ We could not find data on compilers' market share, but our picks include the most popular compilers in user forums. Our experience with other compilers, such as PGI, has been less satisfactory in terms of the speed of the generated executables.

⁸ 我们未找到各编译器的市场份额数据,但所选的均为用户论坛中最受欢迎的编译器;我们曾测试 PGI 等其他编译器,但其生成的可执行文件运行速度表现不佳。

3.2. Scripting Languages

3.2. 脚本语言

We pick as our scripting languages Matlab, Mathematica, R, Python, and Julia. Matlab, Mathematica, and R are sufficiently known among economists that it is not necessary to elaborate on our choice.

我们选取的脚本语言包括 Matlab、Mathematica、R、Python 和 Julia。Matlab、Mathematica 和 R 在经济学界的应用已十分广泛,因此无需赘述选择理由。

Python is an elegant open-source language that has become popular in the scientific community (see Sargent and Stachurski, 2014), in particular the 2.7 version.⁹ Since there are different implementations of Python, we select CPython, the default Python interpreter that comes with Mac and Linux machines, and Pypy (http://pypy.org/), a speed-oriented replacement virtual machine that uses a just-in-time compiler. Our Python code for CPython and Pypy was exactly the same and it uses the Numpy library for matrix operations.

Python 是一款设计简洁的开源语言,已在科学界获得广泛应用(参见 Sargent 和 Stachurski,2014),其中 2.7 版本的使用尤为普遍⁹。由于 Python 存在多种解释器实现,我们选取了 Mac 和 Linux 系统的默认解释器 CPython,以及面向速度优化、基于即时编译器的替代虚拟机 Pypy(http://pypy.org/)。本研究中,适用于 CPython 和 Pypy 的 Python 代码完全一致,且均使用 Numpy 库进行矩阵运算。

Julia (http://julialang.org/) is a new open-source high-performance programming language with a syntax very close to Matlab, Lisp-style macros, and many other modern features, and it also uses a just-in-time compiler for speed based on the LLVM. Three particularly attractive features of Julia are:

Julia(http://julialang.org/)是一款新兴的开源高性能编程语言,其语法与 Matlab 高度相似,同时融入了 Lisp 风格的宏定义等诸多现代编程特性,且基于 LLVM 搭建了即时编译器以提升运行速度。Julia 有三个尤为突出的优势:

  1. Julia's default typing system is dynamic (to facilitate fast coding), but it is possible to indicate the type of certain values to avoid type-instability problems that often decrease speed in dynamically typed languages.
    默认采用动态类型系统,便于快速编写代码;同时支持手动指定变量类型,可避免动态类型语言中常见的类型不稳定性导致的速度损耗
  2. Julia can call C or Fortran functions without wrappers or APIs.
    无需封装器或应用程序编程接口(API),即可直接调用 C 或 Fortran 语言的函数。
  3. Julia has a library that imports Python modules and provides wrappers for all of the functions on them.
    拥有专用库可导入 Python 模块,并为所有 Python 函数提供封装。

We did not check Octave, an open source clone of Matlab, because it is well-known that it is considerably slower than Matlab. In preliminary testing, we checked that Gauss (http://www.aptech.com/) was roughly seven times slower than Matlab and, thus, we decided not to include it in our exercise.

我们未将 Matlab 的开源替代软件 Octave 纳入测试,因为其运行速度远慢于 Matlab 已是公认事实。在预测试中,我们发现 Gauss 的运行速度约为 Matlab 的 1/7,因此也未将其纳入正式测试。

⁹ A key advantage of Python is the existence of libraries such as Numpy, Scipy, SymPy, MatPlotLib, and pandas and of shells such as IPython. Some of these libraries have only been partially ported to Python 3+.

⁹ Python 的核心优势在于拥有丰富的类库,如 Numpy、Scipy、SymPy、MatPlotLib、pandas 等,同时还有 IPython 等交互式解释器;但其中部分类库仅完成了向 Python 3 + 版本的部分移植。

3.3. Functional Programming Languages

3.3. 函数式编程语言

The big missing items in our list of languages are those that belong to the functional programming family that inherits the insights from Lisp. In a companion paper (Amador, Aruoba, Fernández-Villaverde, 2014), we elaborate on the advantages of functional programming for economics and explain how to extend our benchmark investigation to functional languages such as Ocaml or Haskell. Since this comparison involves a number of issues of its own, we prefer to avoid them here to keep the paper focused.¹⁰

本研究未纳入测试的主要语言类型为继承了 Lisp 语言思想的函数式编程语言。在一篇配套论文中(Amador、Aruoba、Fernández-Villaverde,2014),我们详细分析了函数式编程在经济学研究中的优势,并说明了如何将本研究的基准测试拓展至 Ocaml、Haskell 等函数式语言。由于函数式语言的对比涉及诸多专属问题,为保证本文研究重点突出,我们未将其纳入本次测试 ¹⁰。

3.4. Hybrid and Special Approaches

3.4. 混合编程与特殊优化方法

Most languages allow for the used of mixed programming. This is particularly useful in Matlab and R, where one can send computer-intensive parts of the code to C++ and keep the rest of the code in an easier scripting language format. Thus, in addition to "pure" Matlab and R, we also use Mex files, where part of the code is written in C++ and compiled before the Matlab code runs, and Rcpp, a package in R that facilitates the integration of R and C++ . In both cases, we sent to C++ the Bellman operator that updates the value function and that consumes nearly all the computing time.

大多数编程语言均支持混合编程,这一特性在 Matlab 和 R 中尤为实用:可将代码中计算密集的部分用 C++ 编写,其余部分仍使用更易上手的脚本语言编写。因此,除了 "纯"Matlab 和 R 代码外,我们还测试了混合编程的效果:Matlab 结合 Mex 文件(将部分代码用 C++ 编写并在 Matlab 代码运行前完成编译),R 结合 Rcpp 包(一款便于 R 与 C++ 融合的 R 包)。在这两种混合编程方式中,我们均将占总计算时间绝大部分的、用于更新价值函数的贝尔曼算子部分用 C++ 实现。

As in the case of Matlab and R, we compile in Python the Bellman operator that updates the value function. We do so using two different approaches:

与 Matlab 和 R 的测试一致,我们也对 Python 中更新价值函数的贝尔曼算子部分进行编译优化,采用了两种不同方法:

  1. Numba , a just-in-time compiler that uses decorators to compile Python to LLVM.
    基于装饰器的即时编译器,可将 Python 代码编译为 LLVM 中间代码。
  2. Cython, a compiler that converts type-annotated Python into generated C code that can be imported as a module.¹¹
    静态编译器,可将添加了类型注解的 Python 代码转换为可作为模块导入的 C 代码 ¹¹。

Finally, we have Mathematica. Although Mathematica allows for multiparadigm programming (including our imperative algorithm), its kernel strongly prefers a more functionally oriented approach. Thus, we will also use its Compile function plus a rewriting of the code to take advantage of the peculiarities of the language. While this would make the results from this last computation hard to interpret, some readers may find them of interest.

最后是 Mathematica 语言:尽管 Mathematica 支持多范式编程(包括我们使用的命令式算法),但其内核更适配函数式编程思路。因此,我们还测试了通过 Mathematica 的 Compile 函数结合代码重写的方式,充分利用该语言的特性进行优化。尽管这种优化后的测试结果难以与其他语言进行直接对比,但部分读者可能会对其感兴趣。

¹⁰ We run, though, an experiment with Scala, a "trendy" language that allows for multi-paradigm programming by integrating imperative, object-orientation, and functional features. Our Scala code built with imperative features runs, not surprisingly, at roughly the same speed as the Java code (Scala compiled Java bytecode runs in the Java Virtual Machine) and, thus, we decided not to include it in our results.

¹⁰ 我们曾对 Scala 语言进行预测试,这是一款兼具命令式、面向对象和函数式编程特性的热门多范式语言。不出所料,使用命令式特性编写的 Scala 代码运行速度与 Java 代码大致相当(Scala 编译后的字节码运行在 Java 虚拟机中),因此未将其纳入正式测试结果。

¹¹ The Python ecosystem is incredibly rich and continuously expanding. Thus, it is well beyond our abilities to survey every single possibility. The interested reader can check a list of compilers at http://compilers.pydata.org/.

¹¹ Python 的生态体系极为丰富且在持续拓展,我们无法对所有优化方式进行测试;感兴趣的读者可在该页面查看 Python 相关编译器列表:http://compilers.pydata.org/

4. Results

4. 研究结果

We report our results in Table 1, where we show the average run time and the relative performance of each code in terms of the best performer in each group (C++ with GCC in the Mac machine and C++ with Visual C++ in the Windows machine). For those codes that run in less than 60 seconds, we average 10 runs (after a warm-up) to smooth out small differences caused by the operating system. In the codes that run in more than 60 seconds, we report only one run, as any small difference does not have a material effect on relative performance. Also, we report elapsed time, not watch time, except for R, where we report user time (to avoid the problems of the overhead of the REPL shell).¹² At the bottom of the table, and separated by a double line, we report the hybrid and special cases: Matlab with Mex files, R with Rcpp, Numba, Cython, and the rewrite of Mathematica.

本文的测试结果如表 1 所示,报告了各语言代码的平均运行时间,以及以各系统中最优性能为基准的相对运行表现(Mac 系统为 GCC 编译器的 C++,Windows 系统为 Visual C++ 编译器的 C++)。对于运行时间不足 60 秒的代码,我们在预热后运行 10 次并取平均值,以消除操作系统带来的微小误差;对于运行时间超过 60 秒的代码,仅报告一次运行结果,因为微小的时间差异不会对相对性能产生实质性影响。除 R 语言报告用户时间(以避免交互式解释器的额外开销带来的误差)外,其余语言均报告实际耗时 ¹²。在表 1 底部,我们用双横线分隔,报告了混合编程和特殊优化的测试结果:Matlab 结合 Mex 文件、R 结合 Rcpp、Python 的 Numba 优化、Python 的 Cython 优化,以及重写后的 Mathematica 代码。

Our first result is that C++ and Fortran still maintain a considerable speed advantage with respect to all other alternatives. For example, Java is between 2.10 and 2.69 times slower than C++, Matlab around 10 times slower, and the Pypy implementation of Python around 48 times slower.

首个研究结果为:C++ 和 Fortran 相较于其他所有编程语言,仍保持着显著的速度优势。例如,Java 的运行速度比 C++ 慢 2.10 至 2.69 倍,Matlab 约慢 10 倍,Pypy 版本的 Python 约慢 48 倍。

Second, C++ compilers have advanced enough that, contrary to the situation in the 1990s, C++ code runs slightly faster (5-7 percent) than Fortran code. The many other strengths of C++ in terms of capabilities (full object orientation, template meta-programming, lambda functions, large user base) make it an attractive language for graduate students to learn. On the other hand, Fortran is simple and compact -and, thus, relatively easy to learn -and it can take advantage of large amounts of legacy code.

第二个研究结果为:C++ 编译器已取得长足发展,与上世纪 90 年代的情况不同,C++ 代码的运行速度比 Fortran 略快(5%-7%)。C++ 还具备诸多其他优势,如完整的面向对象特性、模板元编程、lambda 表达式以及庞大的用户群体,这些都使其成为研究生值得学习的编程语言。而 Fortran 语言的语法简洁紧凑,学习难度相对较低,同时还能充分利用大量的遗留代码。

Third, even for our very simple code, there are noticeable differences among compilers. We find speed improvements of more than 100 percent between different executables of the same underlying code (and using equivalent compilation flags). While the open-source GCC compilers are superior in a Mac/Unix/Linux environment (for which they have been explicitly developed) to the Intel compilers, GCC compilers do less well in a Windows machine. The deterioration in performance of the Clang compiler was expected given that the goal of the LLVM behind it is to minimize compilation time and executable file sizes, both important goals when developing general-use applications but often (but not always!) less relevant for numerical computation.

第三个研究结果为:即便对于我们编写的简单代码,不同编译器的编译效果也存在显著差异。我们发现,同一代码在使用相同编译标志的情况下,不同编译器生成的可执行文件运行速度差异可达 100% 以上。开源的 GCC 编译器在为其量身打造的 Mac/Unix/Linux 系统中表现优于 Intel 编译器,但在 Windows 系统中的表现则较差。Clang 编译器的性能表现不佳符合预期,因为其底层的 LLVM 架构的设计目标是最小化编译时间和可执行文件体积,这两个目标在通用应用开发中至关重要,但在数值计算中往往(并非总是)无关紧要。

Fourth, Java is between 2.2 to 2.69 times slower than C++. This difference in speed plus Java's issues with floating point arithmetic in high-performance scientific computation suggests that there is no obvious advantage for choosing Java over C++ unless portability across platforms or the wide availability of Java programmers is an important factor.

第四个研究结果为:Java 的运行速度比 C++ 慢 2.2 至 2.69 倍。除了速度差距,Java 在高性能科学计算的浮点运算中也存在问题,因此除非跨平台可移植性或 Java 程序员的易获取性是关键考量因素,否则选择 Java 相较于 C++ 并无明显优势。

Julia, with its just-in-time compiler, delivers an outstanding performance. Execution speed is only 2.64 to 2.70 times slower than the speed of C++. Julia is slightly faster than Java and close to 4 times faster than Matlab. Given how close Julia's syntax is to Matlab's, the fact that it is open-source, and that the language has been designed from scratch for easy parallelization, many researchers may want to learn more about it. However, Julia's standard is still evolving (causing potential backward incompatibilities in the future) and there are only a few libraries for it at the moment.

Julia 语言凭借其即时编译器表现出色,运行速度仅比 C++ 慢 2.64 至 2.70 倍,略快于 Java,且约为 Matlab 的 4 倍。Julia 的语法与 Matlab 高度相似,同时兼具开源特性,且从设计之初就考虑了并行化的便捷性,这些都使其成为众多研究者值得深入学习的语言。但目前 Julia 的语言标准仍在发展中(未来可能存在向后兼容性问题),且相关类库数量较少。

Matlab runs between 9 to 11 times slower than the best C++ executable. The difference in performance between compiled languages and this widely used scripting language seems to have stabilized over the last decade.

Matlab 的运行速度比最优 C++ 可执行文件慢 9 至 11 倍。在过去十年中,编译型语言与这款应用广泛的脚本语言之间的性能差距似乎已趋于稳定。

In the Pypy implementation, the Python code runs around 44-45 times slower than in C++. In the "traditional" implementation of Python (often called CPython), the code runs between 155 and 269 times slower than in C++. Other benchmarks have also found similar results. For example, the Computer Languages Benchmark Game finds many examples where Python is over 100 times slower then C++.¹³

Pypy 版本 Python 代码的运行速度约为 C++ 的 1/44 至 1/45,而传统的 CPython 版本 Python 代码的运行速度仅为 C++ 的 1/155 至 1/269。其他基准测试也得出了类似的结果,例如《计算机语言基准测试游戏》中就有诸多 Python 运行速度比 C++ 慢 100 倍以上的案例 ¹³。

R runs between 475 to 491 times slower than C++, although the performance improves somewhat (to between 243 and 281 times slower) if the R code is compiled using the R compiler package. This poor performance is well-understood in the R community and it is due, in part, to some choices in the original design of R back in the 1990s, when nobody could have forecasted its future success. In fact, there are a number of initiatives to increase R's speed, such as pqR, Renjin, and Riposte.¹⁴ Of course, the strength of R is in statistics and econometrics where the overwhelming richness of existing packages (over 5,500 at the CRAN repository as of May 2014) makes it an outstanding alternative. Mathematica, in its imperative version, runs up to 809 times slower than C++.

R 语言的运行速度为 C++ 的 1/475 至 1/491;若使用 R 编译器包对代码进行编译,性能会略有提升,运行速度为 C++ 的 1/243 至 1/281。R 语言的性能短板在其社区内已是共识,部分原因在于上世纪 90 年代 R 语言的初始设计并未预见其未来的成功,导致部分设计选择影响了运行速度。事实上,目前已有多项提升 R 语言速度的研究计划,如 pqR、Renjin 和 Riposte¹⁴。当然,R 语言的核心优势在于统计学和计量经济学领域,其拥有极为丰富的第三方包(截至 2014 年 5 月,CRAN 仓库中的包数量已超过 5500 个),这使其成为该领域的绝佳选择。而命令式编程版本的 Mathematica 代码运行速度最慢,可达 C++ 的 1/809。

We move now to analyzing the hybrid and special cases. When we use a Mex file written in C++, Matlab runs 1.29 and 1.64 times slower than C++. When we use Rcpp in R, the resulting code runs between 3.66 and 5.41 times slower than C++. While Mex files were faster, we found Rcpp to be elegant and easy to use. These numbers suggest that a researcher can use the friendly environment of Matlab or R for everyday tasks (data handling, plots, etc.) and rely on Mex files or Rcpp for the heavy computations, especially those involving loops.

接下来分析混合编程和特殊优化的测试结果:Matlab 结合 C++ 编写的 Mex 文件使用时,运行速度仅比 C++ 慢 1.29 至 1.64 倍;R 结合 Rcpp 包使用时,运行速度比 C++ 慢 3.66 至 5.41 倍。尽管 Mex 文件的提速效果更显著,但 Rcpp 包的设计更简洁易用。这些结果表明,研究者可利用 Matlab 或 R 友好的操作环境完成日常工作(如数据处理、绘图等),而将计算密集型任务,尤其是涉及大量循环的任务,通过 Mex 文件或 Rcpp 包交由 C++ 处理。

In the Python world, Numba's decorated code runs between 1.57 and 1.62 times slower than the best C++ executable and Cython code runs between 1.41 and 2.49 times slower than C++. Both approaches demonstrate a great performance.¹⁵

在 Python 生态中,经 Numba 装饰器优化后的代码运行速度仅比最优 C++ 可执行文件慢 1.57 至 1.62 倍,经 Cython 优化后的代码运行速度比 C++ 慢 1.41 至 2.49 倍,两种优化方法均展现出优异的性能 ¹⁵。

Mathematica is a particular case. If we just use the function Compile to compile the Bellman operator, performance improves, but not dramatically. If, instead, we both rewrite the code to have a more functionally oriented structure and use the function Compile, Mathematica runs between 1.67 and 2.22 times slower than C++. As we mentioned before, we do not emphasize this performance, as the code was tuned to Mathematica requirements, something we did not do for other languages.

Mathematica 是一个特殊的案例:若仅使用 Compile 函数编译贝尔曼算子,其性能会有所提升,但提升效果并不显著;而若同时将代码重写为更贴合函数式编程的结构并使用 Compile 函数优化,Mathematica 的运行速度仅比 C++ 慢 1.67 至 2.22 倍。正如我们此前所述,我们并不强调这一优化后的性能表现,因为该代码是针对 Mathematica 的特性专门调整的,而我们并未对其他语言的代码进行此类定制化优化。

We close with three caveats about our exercise. First, our computational task (value function iteration with grid search, monotonicity in the decision rule, and an envelope condition), is not well-suited to vectorization. The argument is explained in detail in the appendix. In particular, the nesting of a while loop with three for loops and an if control statement, far from being a poor programming choice, saves considerable time in execution. We have vectorized versions of our code in Matlab or R (languages that often profit from vectorization) that run slower than the baseline codes. Furthermore, the deterioration in performance becomes worse as the number of grid points increases.

最后,我们对本次研究提出三点说明:第一,本次研究的计算任务(结合网格搜索的价值函数迭代,同时利用决策规则的单调性和包络条件)并不适合向量化优化,相关原因将在附录中详细说明。具体而言,将 while 循环、三层 for 循环和 if 条件判断嵌套使用,绝非拙劣的编程选择,反而能大幅节省运行时间。我们曾在 Matlab 和 R(两类通常能从向量化中显著获益的语言)中编写了向量化版本的代码,但其运行速度反而慢于基础版本;此外,随着网格点数量的增加,向量化版本的性能衰减会更加明显。

Second, we did not try to take advantage of the particular features of each programming language (for example, we do not change the order or the iteration of the loops between rows and columns to suit the preference of each languages; a choice that degrades, for instance, the performance of C++). In our personal assessment, that would make the comparison extremely cumbersome. However, the reader is invited to look at our github page and check how much her favorite language can improve from being coded by an expert.

第二,我们并未尝试利用各编程语言的专属特性进行优化(例如,未根据各语言的内存访问偏好调整行列的循环迭代顺序,而这种调整本可以改善 C++ 等语言的运行性能)。在我们看来,若针对各语言进行定制化优化,会使本次对比研究变得极为复杂。但读者可前往我们的 GitHub 页面查看代码,并自行测试将代码由专业人员针对某一语言优化后,其性能能提升多少。

Third, and also beyond this paper, we do not compare how easy it is to parallelize the code written in each language. This may be an important factor, with some languages such as Julia that are designed from scratch being easy to parallelize and others having more issues with it (for example, in Python due to its Global Interpreter Lock that synchronizes the execution of threads).

第三,本次研究也未对比各语言代码的并行化实现难度,而这一因素可能至关重要。部分语言如 Julia 从设计之初就考虑了并行化的便捷性,而其他语言的并行化则存在较多问题(例如 Python,其全局解释器锁会同步线程的执行,限制了并行化效果),这一问题超出了本文的研究范围。

¹² The details of each machine and the compilation instructions are reported in the appendix.

¹² 各测试计算机的详细参数和编译指令见附录。

¹³ See also Lubin and Dunning (2013), https://modelingguru.nasa.gov/docs/DOC-1762, or http://wiki.scipy.org/PerformancePython.

¹³ 相关研究还可参见 Lubin 和 Dunning(2013)、https://modelingguru.nasa.gov/docs/DOC-1762http://wiki.scipy.org/PerformancePython。

¹⁴ See the discussions and speed tests in http://www.pqr-project.org/, http://www.renjin.org/, and https://github.com/jtalbot/riposte.

¹⁴ 相关讨论和速度测试见 http://www.pqr-project.org/、http://www.renjin.org/https://github.com/jtalbot/riposte。

¹⁵ Without compiling the Bellman equation as a separate function, Numba was slower than Pypy.

¹⁵ 若未将贝尔曼方程编译为独立函数,Numba 的优化效果会弱于 Pypy。

5. Concluding Remarks

5. 结论

In this short paper we have taken a first step at a comparison of programming languages in economics. Our focus on speed should not be taken as the only important metric for language comparison. Other issues (ease of programming, existence of auxiliary tools, or vibrant communities of fellow programmers) should be considered as well. Also, different programming languages can be used by one researcher to address different problems (for example, a complicated value function iteration in C++ and a statistical analysis of some data in R).

本文是经济学领域中编程语言对比研究的一次初步尝试。需要说明的是,我们将运行速度作为核心对比指标,并不意味着它是编程语言对比的唯一重要标准,其他因素如编程便捷性、配套工具的丰富度、活跃的开发者社区等也应纳入考量。此外,研究者可根据不同的研究任务选择不同的编程语言 ------ 例如,使用 C++ 求解复杂的价值函数迭代问题,使用 R 进行数据的统计分析。

Nevertheless, speed has three inherent advantages. First, it is easier to measure. Second, speed comparisons give us an indication of the potential benefits for researchers from mastering a new programming language. Third, many real-life applications in macro are considerably more computationally intensive than our simple exercise. As we increase, for example, the number of state variables or we nest a value function iteration in an estimation loop, speed considerations become central for many research projects where the code may take weeks to run.

尽管如此,运行速度仍具有三项内在优势:第一,运行速度的量化测量更为简便;第二,通过速度对比,研究者能清晰判断掌握一门新编程语言所能带来的潜在性能收益;第三,宏观经济学中许多实际研究任务的计算复杂度远高于本次研究的测试任务 ------ 例如,当增加状态变量的数量,或在估计循环中嵌套价值函数迭代时,代码的运行时间可能长达数周,此时运行速度就成为诸多研究项目的核心考量因素。

Our simple exercise leaves many questions unanswered. For example: How do our results extend to other problems, such as those in econometrics? Are there improvements in our algorithm that would benefit one programming language much more than others? Can we re-arrange loops in ways that change relative speeds? However, our results in and themselves should be of interest to a wide audience of researchers. We hope to see more comparisons of programming languages in economics in the future and a discussion of our coding choices through our github repository, where readers can fork their own versions of our programs.

本次简单的测试研究仍留下诸多待解答的问题:例如,本文的研究结果是否适用于计量经济学等其他领域的计算任务?对本文所使用算法的优化是否会使某一编程语言的性能提升远高于其他语言?调整循环的结构是否会改变各语言间的相对运行速度?尽管如此,本文的研究结果本身仍能为广大研究者提供参考。我们期待未来经济学领域能出现更多编程语言的对比研究,也欢迎读者通过我们的 GitHub 仓库对本次研究的代码选择展开讨论,读者也可在该仓库中复刻代码并进行个性化修改。

References

参考文献

1\] Amador, M., S.B. Aruoba, J. Fernández-Villaverde (2014). "Functional Programming in Economics." Mimeo in preparation. \[1\] Amador, M.、Aruoba, S.B.、Fernández-Villaverde, J.(2014),《经济学中的函数式编程》,工作论文(撰写中)。 \[2\] Aruoba, S.B., J. Fernández-Villaverde, and J. Rubio-Ramírez (2006). "Comparing Solution Methods for Dynamic Equilibrium Economies." Journal of Economic Dynamics and Control 30, 2477-2508. \[2\] Aruoba, S.B.、Fernández-Villaverde, J.、Rubio-Ramírez, J.(2006),《动态均衡经济的求解方法对比》,《经济动态与控制期刊》,第 30 卷,2477-2508 页。 \[3\] Lubin, M. I, Dunning (2013). "Computing in Operations Research using Julia." Mimeo, MIT. \[3\] Lubin, M.I.、Dunning(2013),《基于 Julia 的运筹学计算》,麻省理工学院工作论文。 \[4\] Prechelt, L. (2000). "An Empirical Comparison of Seven Programming Languages." IEEE Computer 33 (10), 23-29. \[4\] Prechelt, L.(2000),《七种编程语言的实证对比》,《电气和电子工程师协会计算机期刊》,第 33 卷(第 10 期),23-29 页。 \[5\] Sargent, T. and J. Stachurski (2014). Quantitative Economics. Mimeo, http://quant-econ.net/_static/pdfs/quant-econ.pdf. \[5\] Sargent, T.、Stachurski, J.(2014),《量化经济学》,工作论文,http://quant-econ.net/_static/pdfs/quant-econ.pdf。 \[6\] Tauchen, G. (1986), "Finite State Markov-chain Approximations to Univariate and Vector Autoregressions." Economics Letters 20, 177-181. \[6\] Tauchen, G.(1986),《单变量和向量自回归的有限状态马尔可夫链近似》,《经济学通讯》,第 20 卷,177-181 页。 ### 6. Appendix ### 6. 附录 #### 6.1. Compilation Flags 6.1. 编译标志 Our Mac machine had an Intel Core i7 @2.3 GHz processor, with 4 physical cores, and 16 GB of RAM. It ran OSX 10.9.2. Our Windows machine had an Intel Core i7-3770 CPU @3.40GHz processor, with 4 physical cores, hyperthreading, and 12 GB of RAM. It ran Windows 7, Ultimate-SP1. 本次测试使用的 Mac 计算机配置为:英特尔酷睿 i7 处理器(主频 2.3GHz),4 个物理核心,16GB 运行内存,操作系统为 OSX 10.9.2。Windows 计算机配置为:英特尔酷睿 i7-3770 处理器(主频 3.40GHz),4 个物理核心,支持超线程技术,12GB 运行内存,操作系统为 Windows 7 旗舰版 SP1。 The compilation flags were: 所使用的编译标志如下: 1. GCC compiler (Mac): GCC 编译器(Mac 系统): g++ -o testc -O3 RBC_CPP.cpp 2. GCC compiler (Windows): GCC 编译器(Windows 系统): g++ -Wl,--stack,4000000, -o testc -O3 RBC_CPP.cpp 3. Clang compiler: Clang 编译器: clang++ -o testclang -O3 RBC_CPP.cpp 4. Intel compiler: Intel 编译器: icpc -o testc -O3 RBC_CPP.cpp 5. Visual C: Visual C 编译器: cl /F 4000000 /o testvcpp /O2 RBC_CPP.cpp 6. GCC compiler: GCC Fortran 编译器: gfortran -o testf -O3 RBC_F90.f90 7. Intel compiler: Intel Fortran 编译器: ifortran -o testf -O3 RBC_F90.f90 8. Java Compiler: Java 编译器: javac RBC_Java.java and run as 运行命令为 java RBC_Java -XX:+AggressiveOpts #### 6.2. Vectorization and the Properties of the Solution #### 6.2. 向量化与解的性质 In the main paper, we use value function iteration with grid search over capital as our solution method. In particular, we take a value function V n − 1 ( k , z ) V\^{n-1}(k, z) Vn−1(k,z), we apply the Bellman operator: 在正文中,我们采用的求解方法是对资本存量进行网格搜索的价值函数迭代法。具体而言,给定价值函数 V n − 1 ( k , z ) V\^{n-1}(k, z) Vn−1(k,z),应用贝尔曼算子: V n ( k , z ) = max ⁡ k ′ ( 1 − β ) log ⁡ ( z k α − k ′ ) + β E \[ V n − 1 ( k ′ , z ′ ) ∣ z \] V\^{n}(k, z)=\\max _{k'}(1-\\beta) \\log \\left (z k\^{\\alpha}-k'\\right)+\\beta \\mathbb {E}\\left \[V\^{n-1}\\left (k', z'\\right) \| z\\right\] Vn(k,z)=k′max(1−β)log(zkα−k′)+βE\[Vn−1(k′,z′)∣z

and we get a new value function V n ( k , z ) V^{n}(k, z) Vn(k,z). Using standard arguments, one can show that, for any initial V 0 ( k , z ) V^{0}(k, z) V0(k,z), V n ( k , z ) → V ( k , z ) V^{n}(k, z) \to V (k, z) Vn(k,z)→V(k,z) as n → ∞ n \to \infty n→∞ in the sup norm.

得到新的价值函数 V n ( k , z ) V^{n}(k, z) Vn(k,z)。通过标准的理论推导可证明,对于任意初始价值函数 V 0 ( k , z ) V^0 (k,z) V0(k,z),当迭代次数 n → ∞ n \to \infty n→∞ 时, V n ( k , z ) V^n (k,z) Vn(k,z) 会在一致范数下收敛到真实价值函数 V ( k , z ) V (k,z) V(k,z)。

There are two computational costs in value function iteration. First, we need to evaluate the operator for any value of the state variables, k k k and z z z. In the main paper, we have 17,820 points in the grid of capital ( k i k_{i} ki, for i ∈ { 1 , . . . , 17820 } i \in \{1, ..., 17820\} i∈{1,...,17820}) and 5 points in the grid of productivity ( z z z, for i ∈ { 1 , . . . , 5 } i \in \{1, ..., 5\} i∈{1,...,5}), for a total of 89,100 points. Second, we need to solve the max ⁡ k ′ \max {k'} maxk′ problem using grid search. That is, for each of the 89,100 points, we need to search among the 17,820 possible choices of k m ′ k{m}' km′, for m ∈ { 1 , . . . , 17820 } m \in \{1, ..., 17820\} m∈{1,...,17820}.¹⁶

价值函数迭代法存在两项计算成本:第一,需要对状态变量资本 k k k 和生产率 z z z 的所有取值计算贝尔曼算子。在正文中,我们设定资本存量的网格点为 17,820 个( k i k_i ki, i ∈ { 1 , . . . , 17820 } i \in \{1,...,17820\} i∈{1,...,17820}),生产率的网格点为 5 个( z i z_i zi, i ∈ { 1 , . . . , 5 } i \in \{1,...,5\} i∈{1,...,5}),状态变量的总取值数为 89,100 个。第二,需要通过网格搜索求解 max ⁡ k ′ \max_{k'} maxk′ 的优化问题,即对 89,100 个状态点中的每一个,都需要在 17,820 个潜在的未来资本存量 k m ′ k_m' km′( m ∈ { 1 , . . . , 17820 } m \in \{1,...,17820\} m∈{1,...,17820})中搜索最优解 ¹⁶。

To ease the computational burden of the maximization problem, we take advantage of two key properties of the solution. First, the monotonicity of the decision rule. That is, if we know that for state variables k i k_{i} ki and z j z_{j} zj, the optimal choice is

为降低最大化问题的计算负担,我们利用了该模型解的两个核心性质:第一,决策规则的单调性。即若已知状态变量 k i k_i ki 和 z j z_j zj 对应的最优未来资本存量为:

k m ′ = g ( k i , z j ) , k_{m}'=g\left (k_{i}, z_{j}\right), km′=g(ki,zj),

then we also know that k n ′ = g ( k i + 1 , z j ) ≥ k m ′ = g ( k i , z j ) k_{n}'=g (k_{i+1}, z_{j}) \geq k_{m}'=g (k_{i}, z_{j}) kn′=g(ki+1,zj)≥km′=g(ki,zj). The decision rule is also monotone along the productivity dimension (although we do not exploit monotonicity along the second dimension in our algorithm: we found in preliminary testing that the improvements in speed from doing so were minimal).

则可推知状态变量 k i + 1 k_{i+1} ki+1 和 z j z_j zj 对应的最优未来资本存量 k n ′ = g ( k i + 1 , z j ) ≥ k m ′ = g ( k i , z j ) k_n'=g (k_{i+1},z_j) \geq k_m'=g (k_i,z_j) kn′=g(ki+1,zj)≥km′=g(ki,zj)。决策规则在生产率维度也具有单调性(但我们在算法中并未利用这一性质,预测试结果显示,利用该性质带来的速度提升微乎其微)。

Second, we know that an envelope condition applies. More concretely, if

第二,该模型的解满足包络条件。具体而言,若满足:

( 1 − β ) log ⁡ ( z j k i α − k n + 1 ′ ) + β E [ V n − 1 ( k n + 1 ′ , z ′ ) ∣ z ] < ( 1 − β ) log ⁡ ( z j k i α − k n ′ ) + β E [ V n − 1 ( k n ′ , z ′ ) ∣ z ] \begin {gathered} (1-\beta) \log \left (z_{j} k_{i}^{\alpha}-k_{n+1}'\right)+\beta \mathbb {E}\left [V^{n-1}\left (k_{n+1}', z'\right) | z\right] \\ <(1-\beta) \log \left (z_{j} k_{i}^{\alpha}-k_{n}'\right)+\beta \mathbb {E}\left [V^{n-1}\left (k_{n}', z'\right) | z\right] \end {gathered} (1−β)log(zjkiα−kn+1′)+βE[Vn−1(kn+1′,z′)∣z]<(1−β)log(zjkiα−kn′)+βE[Vn−1(kn′,z′)∣z]

then we know that the optimal choice of k ′ k' k′ cannot be higher than k n ′ k_{n}' kn′, or k n ′ = g ( k i , z j ) < k n + 1 ′ k_{n}'=g (k_{i}, z_{j}) < k_{n+1}' kn′=g(ki,zj)<kn+1′ In other words, once we have reached the optimal choice of k ′ k' k′, higher future capital only decreases the value function (the increase in continuation utility would be lower than the cost of a lower current consumption).

则可确定最优未来资本存量 k ′ k' k′ 不会大于 k n ′ k_n' kn′,即 k n ′ = g ( k i , z j ) < k n + 1 ′ k_n'=g (k_i,z_j) < k_{n+1}' kn′=g(ki,zj)<kn+1′。换言之,一旦找到使价值函数最大化的最优未来资本存量,更高的未来资本存量只会导致价值函数下降 ------ 因为未来效用的提升无法抵消当期消费减少带来的效用损失。

These two properties allow us to write a particularly efficient algorithm. We copy the code here from the Matlab version, but all the other codes are nearly identical:

利用这两个性质,我们可以编写一套效率极高的算法。以下是 Matlab 版本的核心代码,其他语言的代码与之几乎完全一致:

matlab 复制代码
for nCapitalNextPeriod = gridCapitalNextPeriod:nGridCapital
    consumption = mOutput (nCapital,nProductivity)-vGridCapital (nCapitalNextPeriod);
    valueProvisional = (1-bbeta)*log (consumption)...
        +bbeta*expectedValueFunction (nCapitalNextPeriod,nProductivity);
    if (valueProvisional>valueHighSoFar)
        valueHighSoFar = valueProvisional;
        capitalChoice = vGridCapital (nCapitalNextPeriod);
        gridCapitalNextPeriod = nCapitalNextPeriod;
    else
        break; % We break when we have achieved the max 找到最大值后跳出循环
    end
end

¹⁶Given our choice of capital grid, all the choices of k ′ k' k′ are feasible for any point in the state space.

¹⁶基于本文设定的资本存量网格,所有潜在的未来资本存量 k ′ k' k′ 在任意状态点下均满足可行性约束。

What are we doing here? The counter nCapitalNextPeriod will search over optimal values of k ′ k' k′ given some values k i k_{i} ki and z j z_{j} zj. But we do not initialize the counter at 1 (except with the first point of the grid). We initialize the counter at gridCapitalNextPeriod, that is, the optimal choice of capital in the previous point of the grid k i − 1 k_{i-1} ki−1 and z j z_{j} zj. Then, we evaluate consumption and the value function for the choice of k ′ k' k′ capital given by the counter. If the choice improves the previous evaluation of the value function (valueProvisional>valueHighSoFar), we move to the next point of the grid. Otherwise, we break the search since we have already found our optimal choice.

上述代码的核心逻辑是什么?循环变量 nCapitalNextPeriod 的作用是,针对给定的状态变量 k i k_i ki 和 z j z_j zj,搜索最优的未来资本存量 k ′ k' k′。但我们并未将循环变量的初始值设为 1(网格第一个点除外),而是将其设为 gridCapitalNextPeriod,即前一个状态点 k i − 1 k_{i-1} ki−1 和 z j z_j zj 对应的最优未来资本存量在网格中的位置。随后,计算该循环变量对应的未来资本存量下的当期消费和价值函数临时值;若该临时值高于当前找到的最优价值函数值(valueProvisional>valueHighSoFar),则继续搜索下一个网格点;否则,直接跳出循环,因为此时已找到最优的未来资本存量。

A good way to see the efficiency of this algorithm is to note that we will only need to check a few points in the vector nCapitalNextPeriod. For example, in the last iteration of the value function (iteration 257), we search on average 2.65 points of k ′ k' k′ for each value k i k_{i} ki and z j z_{j} zj, instead of 17,820 (as a naive search would require). In other words, instead of having to evaluate the value function 1.5878e+09 times in iteration 257, we only evaluate it 235,656 times. The average number of searches is very stable from iteration 1 and already at iteration 9 it has settled down at 2.65 points.

该算法的高效性可通过一组数据直观体现:在价值函数的迭代过程中,我们仅需检查少量网格点即可找到最优解。例如,在第 257 次迭代(收敛迭代)中,针对每个状态点 k i k_i ki 和 z j z_j zj,我们平均仅需搜索 2.65 个未来资本存量网格点,而非朴素搜索所需的 17,820 个。换言之,在第 257 次迭代中,我们无需对价值函数进行 15.878 × 10 8 15.878\times10^8 15.878×108 次计算,仅需计算 235,656 次即可。从第一次迭代开始,平均搜索的网格点数量就保持稳定,在第 9 次迭代后,该数值就稳定在 2.65 个。

A cursory inspection of our code, where we nest a while loop with three for loops and an if control statement, could suggest the possibility of improving performance by taking advantage of vectorization. Unfortunately, our algorithm is unlikely to benefit from vectorization. The reason is that vectorization cannot easily accommodate monotonicity and the envelope condition. We could, for example, evaluate the value function for

若粗略观察我们的代码,会发现其中嵌套了 while 循环、三层 for 循环和 if 条件判断,可能会认为通过向量化优化能提升性能。但遗憾的是,我们的算法无法从向量化中获益,原因在于向量化难以兼容决策规则的单调性和包络条件。例如,我们可以通过向量化函数对所有潜在的未来资本存量 k ′ k' k′ 计算价值函数:

( 1 − β ) log ⁡ ( z i k j α − k ′ ) + β E [ V n − 1 ( k ′ , z ′ ) ∣ z ] (1-\beta) \log \left (z_{i} k_{j}^{\alpha}-k'\right)+\beta \mathbb {E}\left [V^{n-1}\left (k', z'\right) | z\right] (1−β)log(zikjα−k′)+βE[Vn−1(k′,z′)∣z]

for all possible values of k ′ k' k′ using vectorized functions and obtain a vector V ^ ( k ′ ) \widehat {V}(k') V (k′). But then we would need to search V ^ ( k ′ ) \widehat {V}(k') V (k′) for its maximum value, a costly task. Monotonicity and the envelope condition tell us that we do not need to compute the whole of V ^ ( k ′ ) \widehat {V}(k') V (k′), just 2.65 points on average. Furthermore, as the number of grid points k ′ k' k′ increases, the performance of vectorization deteriorates more and more because we need to search among more values of k ′ k' k′. Note that this is not directly linked to the efficiency with which we store matrices in memory. Therefore, the inner for loops and the if control statement are the consequence of understanding the mathematical structure of our problem.得到价值函数向量 V ^ ( k ′ ) \widehat {V}(k') V (k′),但后续需要在该向量中搜索最大值,这一过程的计算成本极高。而单调性和包络条件让我们无需计算整个价值函数向量 V ^ ( k ′ ) \widehat {V}(k') V (k′),平均仅需计算 2.65 个点即可。此外,随着未来资本存量网格点数量的增加,向量化的性能会持续衰减,因为需要在更多的取值中搜索最大值。需要说明的是,这一问题与矩阵的内存存储效率并无直接关联。因此,代码中内层的 for 循环和 if 条件判断,是我们充分理解问题的数学结构后做出的最优编程选择。

The two outer for loops are just convenient ways to move along the state space without the need to store large matrices. Similarly, the while loop is unavoidable: the very core of value function iteration is that it is an iteration.

代码中的两层外层 for 循环,仅是遍历状态空间的便捷方式,无需存储大型矩阵;而 while 循环则是价值函数迭代法的核心,无法避免 ------ 因为价值函数迭代法的本质就是通过反复迭代实现收敛。

We can illustrate the drawbacks of vectorization running Matlab and R code with and without vectorization and for three different grids for capital: 179, 1,782, and 17,820 grid points. Table A.1 reports the running time in seconds. We can see that our original code is substantially faster than a vectorized version and that the deterioration in performance increases with the number of grid points.

为直观展示向量化的弊端,我们*在 Matlab 和 R 中分别编写了基础版(非向量化)和向量化版本的代码,并在三种不同的资本存量网格规模(179、1,782、17,820 个网格点)下进行测试,结果如表 A.1 所示(单位:秒)。可以看到,基础版代码的运行速度远快于向量化版本,且随着网格点数量的增加,向量化版本的性能衰减会更加明显。

Our vectorization was very simple: we eliminated the inner loop and the conditional statement and replaced them with vector operations on the whole grid of k ′ k' k′. While a more carefully designed vectorization could reduce the distance with respect to our loop code, vectorization would always need to beat a very efficient loop search that uses monotonicity and an envelope condition. Therefore, we are not sanguine about the chances of vectorization in this particular problem.

我们采用的向量化方式十分简洁:删除内层循环和条件判断语句,替换为对整个未来资本存量网格的向量化运算。尽管更精细的向量化设计可能缩小其与基础版循环代码的性能差距,但向量化始终无法超越结合了单调性和包络条件的高效循环搜索。因此,我们认为在本次研究的特定计算任务中,向量化优化并无实际价值。

Table A.1: Loop vs. Vectorization
表 A.1:循环版与向量化版代码性能对比

k ′ k' k′ Grid Points k ′ k' k′ 网格点 179 1,782 17,820
Matlab, loop Matlab 循环版 0.09 0.80 7.91
Matlab, vectorized Matlab 向量化版 1.51 76.04 3,408.18
R, loop R 循环版 2.14 22.00 345.55
R, vectorized R 向量化版 2.29 108.72 10,785.20

reference