符号执行技术实践-求解程序密码

符号执行是什么

符号执行(Symbolic Execution)是一种程序分析技术，它使用符号值而不是具体的数值来执行程序。与传统的程序执行不同，符号执行将程序的输入用符号变量表示，然后沿着程序的执行路径收集这些符号变量必须满足的约束条件。

如果我们把普通的程序执行比作用具体的数字走迷宫，那么符号执行就像是用代数表达式走迷宫，记录下每一步的选择条件，最终我们可以通过求解这些条件来找到能够到达特定目标的具体数值。

在工作中，如果我们需要找到正确输入才能到达特定代码块的问题时，那么符号执行或许能够帮到你，本文我们将通过angr 二进制分析框架中的符号执行，对一个示例程序的密码进行求解。

基本概念

符号变量(Symbolic Variables) 程序输入被抽象为符号变量，如x、y、z等，而不是具体的数值。这些变量在执行过程中会形成复杂的符号表达式。

路径约束(Path Constraints) 程序执行过程中遇到的每个条件分支都会产生约束条件。例如，当程序执行if (x > 10)时，会产生约束条件x > 10或x ≤ 10。

符号状态(Symbolic State) 包含当前所有变量的符号表达式以及到达当前程序点的路径约束集合。符号状态完整描述了程序在某个执行点的符号化信息。

约束求解器(Constraint Solver) 负责求解路径约束的工具，通常使用SMT(Satisfiability Modulo Theories)求解器，如Z3、CVC4等。它们能够判断约束集合是否可满足，并给出满足约束的具体解。

工作原理

符号执行的核心思想在于用抽象的符号来代替程序运行时的具体数值。当程序启动时，符号执行引擎不会给变量赋予确定的值，而是将它们标记为未知的符号变量。随着程序的执行，这些符号变量参与各种运算和条件判断，逐渐构建出复杂的符号表达式。

在程序执行过程中，每当遇到分支语句时，符号执行引擎可以同时追踪多条可能的执行路径，并记录下沿途的所有条件判断，形成该路径的约束集合。

这些约束集合本质上是一组数学方程和不等式，精确描述了什么样的输入值能够让程序沿着特定路径执行。符号执行引擎通过调用约束求解器，运用数学算法计算出满足所有条件的具体数值解。

当求解器成功找到解时，我们就获得了能够精确引导程序按预期路径执行的具体输入值。从而得到我们想要的输入，这种方式为程序密码破解、漏洞发现和软件测试提供了强有力的自动化手段。

主要挑战

路径爆炸：随着分支数量增加，路径呈指数增长。解决方案包括启发式搜索和路径合并。
约束求解：复杂约束可能导致求解器效率低下。优化方法包括移除不相关约束和增量求解。

实践 - 求解程序密码

环境

系统: Ubuntu 22.04

python 3.12: apt install python3.12

angr: 通过 pip install angr 安装

示例程序

find.c.jinja

c++ 复制代码

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define USERDEF "{{ userdef }}"
#define LEN_USERDEF {{ len_userdef }}

char msg[] =
  "{{ description }}";

void print_msg() {
  printf("%s", msg);
}

int complex_function(int value, int i) {
#define LAMBDA 3
  if (!('A' <= value && value <= 'Z')) {
    printf("Try again.\n");
    exit(1);
  }
  return ((value - 'A' + (LAMBDA * i)) % ('Z' - 'A' + 1)) + 'A';
}

int main(int argc, char* argv[]) {
  char buffer[9];

  //print_msg();

  printf("Enter the password: ");
  scanf("%8s", buffer);

  for (int i=0; i<LEN_USERDEF; ++i) {
    buffer[i] = complex_function(buffer[i], i);
  }

  if (strcmp(buffer, USERDEF)) {
    printf("Try again.\n");
  } else {
    printf("Good Job.\n");
  }
}

generate.py

python 复制代码

#!/usr/bin/env python3
import sys, random, os, tempfile, jinja2

def generate(argv):
  if len(argv) != 3:
    print('Usage: ./generate.py [seed] [output_file]')
    sys.exit()

  seed = argv[1]
  output_file = argv[2]

  random.seed(seed)
  userdef_charset = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
  userdef = ''.join(random.choice(userdef_charset) for _ in range(8))

  template = open(os.path.join(os.path.dirname(os.path.realpath(__file__)), 'find.c.jinja'), 'r').read()
  t = jinja2.Template(template)
  c_code = t.render(userdef=userdef, len_userdef=len(userdef), description = '')

  with tempfile.NamedTemporaryFile(delete=False, suffix='.c', mode='w') as temp:
    temp.write(c_code)
    temp.seek(0)
    os.system('gcc -fno-pie -no-pie -fcf-protection=none -m32 -o ' + output_file + ' ' + temp.name)

if __name__ == '__main__':
    generate(sys.argv)

符号执行求解

scaffold.py

python 复制代码

import angr
import sys

def main(argv):
  # Create an Angr project.
  # If you want to be able to point to the binary from the command line, you can
  # use argv[1] as the parameter. Then, you can run the script from the command
  # line as follows:
  # python ./scaffold00.py [binary]
  path_to_binary = 'test'  # :string
  project = angr.Project(path_to_binary)

  # Tell Angr where to start executing (should it start from the main()
  # function or somewhere else?) For now, use the entry_state function
  # to instruct Angr to start from the main() function.
  initial_state = project.factory.entry_state(
    add_options = { angr.options.SYMBOL_FILL_UNCONSTRAINED_MEMORY,
                    angr.options.SYMBOL_FILL_UNCONSTRAINED_REGISTERS}
  )

  # Create a simulation manager initialized with the starting state. It provides
  # a number of useful tools to search and execute the binary.
  simulation = project.factory.simgr(initial_state)

  # Explore the binary to attempt to find the address that prints "Good Job."
  # You will have to find the address you want to find and insert it here. 
  # This function will keep executing until it either finds a solution or it 
  # has explored every possible path through the executable.
  print_good_address = 0x80492F0  # :integer (probably in hexadecimal)
  simulation.explore(find=print_good_address)

  # Check that we have found a solution. The simulation.explore() method will
  # set simulation.found to a list of the states that it could find that reach
  # the instruction we asked it to search for. Remember, in Python, if a list
  # is empty, it will be evaluated as false, otherwise true.
  if simulation.found:
    # The explore method stops after it finds a single state that arrives at the
    # target address.
    solution_state = simulation.found[0]

    # Print the string that Angr wrote to stdin to follow solution_state. This 
    # is our solution.
    print(solution_state.posix.dumps(sys.stdin.fileno()).decode())
  else:
    # If Angr could not find a path that reaches print_good_address, throw an
    # error. Perhaps you mistyped the print_good_address?
    raise Exception('Could not find the solution')

if __name__ == '__main__':
  main(sys.argv)

效果

bash 复制代码

python3 generate.py 999 test
python3 scaffold.py
UVXEXQJO

./test
Enter the password: UVXEXQJO
Good Job.

安全防范

针对符号执行等自动化分析技术的威胁，Virbox Protector 可以有效防范这类场景。代码虚拟化技术通过将原始指令转换为自定义虚拟机指令，从根本上改变了程序的执行模型，使符号执行工具难以正确理解和建模程序逻辑。代码混淆则通过插入大量冗余指令和复杂的控制流变换，人为制造路径爆炸问题，让符号执行引擎在面对指数级增长的可能路径时因资源耗尽而无法完成分析。此外，导入表加密隐藏了关键API调用信息，反调试机制主动检测并阻止分析工具运行，代码加密采用动态解密技术增加了静态分析难度，内存完整性校验防止运行时篡改。这些防护技术相互配合，特别是虚拟化和混淆的深度应用，能够有效扰乱符号执行的分析过程，显著提升软件的抗逆向分析能力。