算法计算与训练如何支持低开销流式计算? deepseek背后的smallpond需要些新改造

目前随着AI的热潮，越来越多的厂商在探索如何降低AI带来的成本开销. 本文尝试结合当下开源的AI框架，对于端侧AI的低成本构建, 给出自己的思考和理解

背景

Smallpond 提供了两套 API（具体介绍见下文），一套是 High-level 的 Dataframe API，一套是 Low-level 的Logicalplan API。前者简单、易理解，使用上非常类似 Pandas、PySpark 等引擎；后者灵活度高，可以实现更加复杂的数据处理逻辑。

该框架设计的初衷是作为一个离线计算框架，并且支持cpu, memory和gpu的资源隔离(任务级别)

数据流的数据进来后，先落盘（3fs)。然后在worker进程中计算的时候，挂载到内存数据库(duckdb, 支持了OLAP功能的sqllite). 通过arrow(c++内存格式)进行数据的combine,merge等操作

项目地址:

smallpond: github.com/deepseek-ai...

3fs: github.com/deepseek-ai...fs

比较适合项目的api是Logical Plan

ini 复制代码

from smallpond.logical.dataset import ParquetDataSet
from smallpond.logical.node import Context, DataSourceNode, DataSetPartitionNode, SqlEngineNode, LogicalPlan
from smallpond.execution.driver import Driver

def my_pipeline(input_paths: List[str], npartitions: int):
   ctx = Context()
   dataset = ParquetDataSet(input_paths)
   node = DataSourceNode(ctx, dataset)
   node = DataSetPartitionNode(ctx, (node,), npartitions=npartitions)
   node = SqlEngineNode(ctx, (node,), "SELECT * FROM {0}")
   return LogicalPlan(ctx, node)

if __name__ == "__main__":
   driver = Driver()
   driver.add_argument("-i", "--input_paths", nargs="+")
   driver.add_argument("-n", "--npartitions", type=int, default=10)

   plan = my_pipeline(**driver.get_arguments())
   driver.run(plan)

python script.py -i "path/to/*.parquet" -n 10 Ray # Ray 引擎 (ray集群调度任务)
python script.py -i "path/to/*.parquet" -n 10 scheduler # built-in 引擎 (自研集群调度任务)

架构图

架构解析

在最上层的client层，会有DataFrame和LogicalPlan两层，其中LogicalPlan代表一个任务的编排阶段
在拓扑层分为两层，逻辑拓扑和物理拓扑.逻辑拓扑负责将source,compute, shuffle, hashpartition分成几个算子调用具体的任务，在转换成物理Task的时候即加载对应的插件
在调度层分为自研任务调度(buildin)引擎和ray集群任务调度
在存储层ray任务调度模式,落盘到3FS的文件格式以parquet(列存)形式存在，在调度器接受到一个由逻辑Node转换成物理Task的请求后，通过3fs挂载到worker进程内的duckdb
在任务调度的时候，调度器会抽象出WorkQueue的概念，将提交的任务进行排队。分worker进程对应一个任务队列，如果在队列中的资源总量超标，则监控面板上会提示资源使用率。用户根据资源量决定要提交到哪台机器上
如果选用 Built-in 执行引擎，这个存储还是 task 的序列化存储用于从 driver 节点向 executor 节点派发任务。除了 3FS 存储，Smallpond 还支持 fsspec 接口，从而对接其他存储。
执行层完全类似 Spark 的实现，有 Logicalplan，如果选用 Dataframe 接口，还有优化器支持。最后的物理计划生成 task，会被调度器扔到远端的 worker 计算。task 的执行有两种选择：DuckDB 和 Arrow（官方文档未给出）。

如何引入流式计算

kafka的数据怎么接入

可以基于smallpond提供的SqlEngineNode进行接入

python 复制代码

shuffled_urls = SqlEngineNode(
        ctx,
        (urls_partitions,),
        r"insert nodeId, device_time from {0} order by sort_key",
        cpu_limit=2,
    )

秒级数据怎么落盘

在smallpond的Task类型当中，有一种基于arrow的计算任务。可以将流进来的数据转换为pandas坐标格式，再转换为arrow(c++内存格式). 作为存储格式通过duckdb进行秒级数据的计算。并且dump成parquet文件格式到3fs. 可以做数据合并使用

scss 复制代码

def test_sucheon_feature():
    feature = {'pageBatchID': '',
               'uniq_array': [
                   'feature3:19.31,19.39,19.34,19.39,19.36,19.39,19.39,19.36,19.39,19.37,19.34,19.31,19.35,19.33,19.35,19.35,19.42,19.37,19.35,19.37,19.37,19.34,19.34,19.34',
                   'feature2:23.71,23.79,23.68,23.65,23.62,23.67,23.67,23.76,23.73,23.69,23.6,23.49,23.68,23.68,23.52,23.6,23.87,23.8,23.61,23.7,23.72,23.49,23.68,23.58',
                   'meanHf:1216900104915',
                   'mean:255675728',
                   'std:7667',
                   'customFeature:',
                   'meanLf:19047538128',
                   'temperature:-274',
                   'bandSpectrum:21.58,20.05,19.57,19.56,19.84,19.77,20.05,19.51,19.21,19.03,19.33,18.58,18.25,17.94,17.45,17.25,16.91,16.4,15.88,15.4',
                   'feature4:8.97,8.91,9.02,8.97,8.96,9.01,8.97,8.99,8.97,8.98,8.87,8.89,8.97,8.91,8.9,8.89,9.03,9.02,8.83,8.98,8.97,8.84,8.91,8.84',
                   'extend:{"SerialData":"","GpioData":-1}',
                   'feature1:27.75,27.81,27.77,27.81,27.82,27.86,27.85,27.82,27.82,27.84,27.84,27.81,27.83,27.81,27.86,27.84,27.83,27.84,27.82,27.84,27.87,27.86,27.82,27.83',
                   'peakFreqs:4.54,5.6,5.96,6.15,6.49,6.65,6.76,6.87,7.03,7.13,8.03,8.56,9.0,9.39',
                   'peakPowers:18.43,17.56,17.36,17.11,17.37,17.21,17.01,16.62,17.97,16.45,16.36,16.14,16.53,16.48'],
               'time': 0, 'nodeId': 852, 'uuid': '704A0ED3C5CA', 'device_time': '2023-03-24T10:00:00'}

    feature_str = json.dumps(feature)
    tmp_abspath = "/usr/local/share/tmp/"
    parquet_origin = feature_str
    columns = ["pageBatchID", "uniq_array", "time", "nodeId", "uuid", "device_time"]
    print("---------------------------------")
    os.makedirs(tmp_abspath, exist_ok=True)
    dump_path = os.path.join(tmp_abspath, f"xxxxx-test.pickle")

    data = []
    num_rows = 10000000
    for _ in range(num_rows):
        url, domain = generate_url_and_domain()
        date = generate_random_date()
        content = generate_content()

        data.append({"pageBatchID": url, "uniq_array": domain, "time": date, "nodeId": content, "uuid": url, "device_time": date})

    df = pd.DataFrame(data)

    table = pa.Table.from_pandas(df=df)
    ctx = Context()
    task = ArrowComputeTask(ctx, None, None, None,
                            1000, True, "ZSTD",
                            3, True, "test-duckdb-0",
                            "/usr/local/share/duckdb/tmp/", 2, 0, 100 * MB)

    task.dump_output(output_table=table)

时序性如何保证

在smallpond引入流式计算意味着数据的打散的情况下，数据的连续性会破坏，算法计算的时候需要遵循最终一致性。对于这一点 smallpond提供了unionNode，可以将计算后的数据进行合并。因为数据算完的结果是顺序追加的。所以在存储的语义上也是顺序的。此外logicaPlan也提供了sort逻辑算子用于排序。值得一提的是，这一切都是在duckdb(内存型数据库)上进行的操作

python 复制代码

def test_nested_partition(self):
        ctx = Context()
        parquet_files = ParquetDataSet(["tests/data/mock_urls/*.parquet"])
        data_source = DataSourceNode(ctx, parquet_files)

        SqlEngineNode.default_cpu_limit = 1
        SqlEngineNode.default_memory_limit = 1 * GB
        initial_reduce = r"select host, count(*) as cnt from {0} group by host"
        combine_reduce_results = (
            r"select host, cast(sum(cnt) as bigint) as cnt from {0} group by host"
        )
        join_query = r"select host, cnt from {0} where (exists (select * from {1} where {1}.host = {0}.host)) and (exists (select * from {2} where {2}.host = {0}.host))"

       ...........
        #union合并算子
        union_url_count_by_hosts = UnionNode(
            ctx, (url_count_by_hosts1, url_count_by_hosts2)
        )
        union_url_count_by_hosts_x_urls = UnionNode(
            ctx,
            (
                url_count_by_hosts_x_urls1,
                url_count_by_hosts_x_urls2,
                join_count_by_hosts_x_urls1,
                join_count_by_hosts_x_urls2,
            ),
        )

        with tempfile.TemporaryDirectory(dir=self.output_root_abspath) as output_dir:
            data_sink = DataSinkNode(
                ctx,
                (
                    url_count_by_hosts_expected,
                    union_url_count_by_hosts,
                    union_url_count_by_hosts_x_urls,
                ),
                output_path=output_dir,
                manifest_only=True,
            )
            plan = LogicalPlan(ctx, data_sink)
            exec_plan = self.execute_plan(plan, remove_empty_parquet=True)
            # verify results
            self._compare_arrow_tables(
                exec_plan.get_output("url_count_by_hosts_x_urls1").to_arrow_table(),
                exec_plan.get_output("url_count_by_hosts_x_urls2").to_arrow_table(),
            )
            self._compare_arrow_tables(
                exec_plan.get_output("join_count_by_hosts_x_urls1").to_arrow_table(),
                exec_plan.get_output("join_count_by_hosts_x_urls2").to_arrow_table(),
            )
            self._compare_arrow_tables(
                exec_plan.get_output("url_count_by_hosts_x_urls1").to_arrow_table(),
                exec_plan.get_output("join_count_by_hosts_x_urls1").to_arrow_table(),
            )
            self._compare_arrow_tables(
                exec_plan.get_output("url_count_by_hosts1").to_arrow_table(),
                exec_plan.get_output("url_count_by_hosts2").to_arrow_table(),
            )
            self._compare_arrow_tables(
                exec_plan.get_output("url_count_by_hosts_expected").to_arrow_table(),
                exec_plan.get_output("url_count_by_hosts1").to_arrow_table(),
            )

如何解决内存占用高的问题

smallpond选择了将状态托管到磁盘，做数据存储和计算的存算分离

既然存储选择了从3fs挂载到duckdb, 那么怎么挂载呢? 示意图如下

有些朋友可能会注意到读写是专门抽象出了存储层。对于duckdb而言，可以选择将数据基于内存模式计算，也可以挂载到磁盘. 如果在集群模式下，单台节点的内存显然是不够用的. 所以通常需要将数据溢写到磁盘，再从磁盘当中加载数据集。并通过arrow这种列式内存格式做计算降低开销

那么当单台机器资源不足的时候，如何将任务准确的调度给空闲的机器呢?

在现代计算机的体系里，有一种非一致性内存管理的架构叫做NUMA, 这是一种把一个系统的资源抽象成机房。如果绑定某个cpu核，同时也会绑定cpu核对应的虚拟地址段的起始和末始地址。进而使用远端机的内存段

使用方式

复制代码

 numactl -N 1 -m 10M

代表分配到NUMA系统当中编号为1的CPU.因为通过NUMA架构管理的CPU排列会重新组织，可以理解成一个机架上的不同CPU.

但是python进程里的内存是单独开辟的，众所皆知python有一个很让人诟病的特性。就是如果你的数据结构里的对象相互指向，哪怕程序对象的引用计数减到0.也不会触发GC.

所以这里的算法状态没有选择直接用python的字典或者列表作为缓存，因为无法保证瞬时内存的及时回收. 而是选择托管计算状态到Ray (弹性资源管理， c++进程的多线程并发计算, 以及分布式对象存储的状态管理)

ini 复制代码

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="smallpond worker")
    parser.add_argument(
        "--ray_address",
        required=True,
        help="The address of the Ray cluster to connect to",
    )
    parser.add_argument(
        "--log_dir", required=True, help="The directory where logs will be stored"
    )
    parser.add_argument(
        "--bind_numa_node",
        action="store_true",
        help="Bind executor processes to numa nodes",
    )

    args = parser.parse_args()
    log_path = os.path.join(args.log_dir, f"{socket.gethostname()}.log")

    # limit the number of CPUs to the number of physical cores
    cpu_count = psutil.cpu_count(logical=False)
    memory = psutil.virtual_memory().total

    if args.bind_numa_node:
        import numa
        # 申请本任务要绑定远端numa面板的哪个CPU编号
        numa_node_count = numa.info.get_num_configured_nodes()
        cpu_count_per_socket = cpu_count // numa_node_count
        memory_per_socket = memory // numa_node_count
        for i in range(numa_node_count):
            subprocess.run(
                [
                    "numactl",
                    "-N",
                    str(i),
                    "-m",
                    str(i),
                    "ray",
                    "start",
                    "--address",
                    args.ray_address,
                    "--num-cpus",
                    str(cpu_count_per_socket),
                    "--memory",
                    str(memory_per_socket),
                ],
                check=True,
            )
    else:
        subprocess.run(
            [
                "ray",
                "start",
                "--address",
                args.ray_address,
                "--num-cpus",
                str(cpu_count),
            ],
            check=True,
        )

    # keep printing logs
    while True:
        try:
            subprocess.run(["tail", "-F", log_path], check=True)
        except subprocess.CalledProcessError as e:
            # XXX: sometimes it raises `No such file or directory`
            # don't know why. just ignore it
            print(e)

如果一个节点上，有两个worker。计算的数据需要来自同一份数据。怎么避免内存复制呢?

smallpond选择了引入MPI框架.MPI 是一个广泛用于并行计算和分布式计算的标准，它允许多个进程在不同的计算机或同一台计算机的不同核心上并行执行，并通过消息传递来交换数据. 简单来说,如果进程之间通过消息传递进行通信，每个进程不仅执行计算任务，还需要与其他进程交换数据。一般来说想到的是rpc或者其他网络调用的方式。但是在大规模的科学计算或者机器学习上。算法运行的效率是会跟通信模式挂钩的。

这里就不得不提到一个概念: RDMA

RDMA为了解决网络传输中客户端与服务器端数据处理的延迟而产生的。它将数据直接从一台计算机的内存传输到另一台计算机，无需双方操作系统的介入。这允许高吞吐、低延迟的网络通信，尤其适合在大规模并行计算机集群中使用

RDMA与传统模式的区别如下:

而MPI对于高性能通信的亲和性之所以好，理由是因为对RDMA模式的支持性最好

Remote：数据通过网络与远程机器间进行数据传输。
Direct：没有内核的参与，有关发送传输的所有内容都卸载到网卡上。
Memory：在用户空间虚拟内存与RNIC网卡直接进行数据传输不涉及到系统内核，没有额外的数据移动和复制。

在worker进程间读取任务数据的时候，基于MPI框架(c++写的进程通信)进行通信

借助 MPI 框架，用户可以自定义任务，使用 MPI 做高效的集合通信。用户完全可以自定义自己的 Python 脚本，而用户可以在自己的 Python 脚本里写 MPI 程序，从而使用 MPI 做高效的集合通信。同时，worker 也做了 NUMA 绑定，做到更加高效的内存存取。

在smallpond中选择了通过MPI管理节点间的worker进程通信

python 复制代码

class MPI(Platform):
    """
    MPI platform.
    """

    @staticmethod
    def is_available() -> bool:
        return shutil.which("mpirun") is not None

    def start_job(
        self,
        num_nodes: int,
        entrypoint: str,
        args: List[str],
        envs: dict = {},
        extra_opts: dict = {},
    ) -> List[str]:
        mpirun_cmd = ["mpirun", "-n", str(num_nodes)]
        for key, value in envs.items():
            mpirun_cmd += ["-x", f"{key}={value}"]
        mpirun_cmd += ["python", entrypoint] + args

        logger.debug(f"start job with command: {' '.join(mpirun_cmd)}")
        subprocess.Popen(
            mpirun_cmd,
            stdout=subprocess.DEVNULL,
            stderr=subprocess.STDOUT,
            text=True,
        )

        return []