一、引言:从算力管理到开发者赋能
1.1 openFuyao技术定位
openFuyao社区致力于构建面向多样化算力集群的开放软件生态,专注于推动云原生与AI原生技术的高效协同,促进有效算力的极致释放。作为一个开放的云原生异构算力平台,openFuyao的核心价值在于:
· *统一资源池化* :将CPU、NPU、KAE等异构硬件资源纳入统一管理体系 · *智能调度引擎* :基于应用特性和硬件能力的智能匹配,最大化资源利用效率 · *自动化运维* :通过Operator模式实现硬件的自动发现、配置和生命周期管理 · *开放生态体系*:为硬件厂商、平台开发者、应用开发者提供标准化接口
1.2 本文目标与读者定位
本文面向希望深入了解openFuyao技术实现原理并进行二次开发或集成的开发者,内容涵盖:
· *核心组件技术原理* :NPU Operator、KAE Operator、NFD等组件的实现机制 · *扩展开发指南* :如何开发自定义扩展组件并接入openFuyao生态 · *API与接口规范* :关键接口说明与调用方式 · *最佳实践*:典型场景下的集成方案与代码示例
二、核心技术架构与实现原理
2.1 整体架构设计
2.1.1 分层架构模型
openFuyao采用"核心平台+可插拔组件"架构,通过模块化设计实现灵活扩展:

2.1.2 核心设计原则
· *声明式配置* :通过CRD(Custom Resource Definition)定义资源状态,Operator负责协调实际状态与期望状态 · *控制器模式* :采用Kubernetes Operator Framework,实现自动化的资源生命周期管理 · *插件化扩展* :支持通过标准接口扩展新的硬件类型和调度策略 · *可观测性优先*:内置完整的监控、日志、追踪体系
2.2 NPU Operator技术实现
2.2.1 Operator工作原理
NPU Operator基于Kubernetes Operator Framework实现,核心工作流程如下:

关键实现机制:
- CRD 监控:Operator通过监控NPUClusterPolicy CRD实例化的CR变化实现对管理组件状态的修改
- 节点标签发现:通过NFD在节点上标记的标签,利用npu-feature-discovery组件给节点标记上符合昇腾组件调度的节点标签
- 组件编排:根据CR配置自动部署和管理Driver、Device Plugin、Runtime等组件
2.2.2 组件栈详解
NPU Operator管理的核心组件包括:
表 1 NPU Operator管理的组件
|-----------------------|--------------|---------------------------------------------------|
| 组件名称 | 部署方式 | 功能说明 |
| 昇腾驱动和固件 | 容器化部署 | 作为硬件设备和操作系统之间的桥梁,用来让操作系统识别并与硬件设备进行通信 |
| Ascend Device Plugin | 容器化部署 | 基于Kubernetes设备插件机制,增加昇腾AI处理器的设备发现、设备分配、设备健康状态上报功能 |
| Ascend Operator | 容器化部署 | Volcano协助组件,负责管理acjob类型的任务,将AI框架训练任务所需环境变量注入容器 |
| Ascend Docker Runtime | 容器化部署 | 容器引擎插件,为所有AI作业提供NPU容器化支持 |
| NPU Exporter | 容器化部署 | 昇腾AI处理器资源数据的实时监测,包括处理器利用率、温度、电压、内存使用情况 |
| Volcano | 容器化部署 | 从底层组件获取集群资源信息,通过感知昇腾芯片之间的网络连接方式,选择最佳调度策略 |
2.2.3 CRD规范定义
NPUClusterPolicy CRD定义了NPU资源的期望状态:
apiVersion: npu.openfuyao.com/v1
kind: NPUClusterPolicy
metadata:
name: cluster
spec:驱动配置
driver:
managed: true
version: "24.1.RC3"
imageSpec:
registry: cr.openfuyao.cn
repository: openfuyao/npu-driver-installer
tag: latest设备插件配置
devicePlugin:
managed: true
imageSpec:
registry: cr.openfuyao.cn
repository: openfuyao/ascend-image/ascend-k8sdeviceplugin
tag: v6.0.0容器运行时配置
ociRuntime:
managed: true
imageSpec:
registry: cr.openfuyao.cn
repository: openfuyao/npu-container-toolkit
tag: latestVolcano调度器配置
vcscheduler:
managed: true
imageSpec:
registry: cr.openfuyao.cn
repository: openfuyao/ascend-image/vc-scheduler
tag: v1.9.0-v6.0.0
2.3 KAE Operator技术实现
2.3.1 KAE硬件加速原理
KAE(Kunpeng Accelerator Engine)是鲲鹏920系列处理器中携带的硬件加速能力。KAE Operator的核心职责是自动管理配置KAE所需的所有软件组件。
Operator观察的资源变化:
· *CRD变化* :Operator自定义资源(CRD)发生变化 · *节点变化* :集群中的节点发生变化(如集群添加节点,节点标签变化等) · *DaemonSet变化*:由Operator创建的DaemonSets发生变化
Operator管理的资源类型:
· *标签管理* :根据NFD服务发现的系统、内核、硬件设备进行标签管理 · *组件管理*:根据CRD配置,组装资源文件安装服务
2.3.2 KAE设备资源模型
KAE Device Plugin会将节点上的KAE设备作为Kubernetes扩展资源上报:
查看节点KAE资源
kubectl describe nodes <nodeName>
输出示例
Allocatable:
cpu: 8
memory: 16005200Ki
openfuyao.com/kae.hpre: 2 # HPRE设备数量
pods: 110
2.3.3 工作负载配置示例
使用KAE加速的工作负载配置:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-kae-accelerated
spec:
replicas: 3
selector:
matchLabels:
app: nginx-kae
template:
metadata:
labels:
app: nginx-kae
spec:
containers:
- name: nginx
image: nginx:latest
resources:
requests:
openfuyao.com/kae.hpre: 1 # 请求KAE HPRE设备
volumeMounts:
- name: openssl-conf
mountPath: /etc/ssl/openssl.cnf
subPath: openssl.cnf
env:
- name: OPENSSL_CONF
value: /etc/ssl/openssl.cnf
volumes:
- name: openssl-conf
configMap:
name: kae-openssl-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: kae-openssl-config
data:
openssl.cnf: |
openssl_conf=openssl_def
[openssl_def]
engines=engine_section
[engine_section]
kae=kae_section
[kae_section]
engine_id=kae
dynamic_path=/usr/local/lib/engines-1.1/kae.so
KAE_CMD_ENABLE_ASYNC=1
default_algorithms=ALL
init=1
2.4 NFD节点特征发现机制
2.4.1 NFD工作流程
NFD(Node Feature Discovery)是openFuyao架构中的关键组件,负责自动识别节点硬件特征:
硬件特征 → 特征源 → NFD Agent → 标签处理 → 节点标签 → 调度器
特征源类型:
· *系统特征源* :CPU型号、核心数、频率、指令集;内存容量、类型、速度 · *硬件特征源* :GPU/NPU设备数量、型号;KAE、FPGA等专用硬件 · *自定义特征源*:支持用户自定义特征发现逻辑
2.4.2 自定义NodeFeatureRules
开发者可以通过NodeFeatureRules扩展特征发现能力:
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: custom-npu-rule
spec:
rules:
- name: "detect-ascend-npu"
labels:
accelerator: "npu"
npu-vendor: "ascend"
matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["19e5"]} # 华为PCI Vendor ID
class: {op: In, value: ["1200"]} # 加速器设备类
- name: "detect-kae-hpre"
labels:
accelerator: "kae"
kae-hpre: "true"
matchFeatures:
- feature: kernel.loadedmodule
matchExpressions:
hisi_hpre: {op: Exists}
2.5 众核调度技术实现
2.5.1 调度算法原理
众核调度针对256核及以上的服务器场景,采用多维加权评分算法:
业务类型评分公式:
I/O密集型:
score = 0.6×(1 - DiskIOUsage) + 0.2×(1 - CPUUsage) + 0.2×(1 - MemUsage)
内存敏感型:
score = 0.6×(1 - MemUsage) + 0.2×(1 - CPUUsage) + 0.2×(1 - DiskIOUsage)
算力密集型:
score = 0.6×(1 - CPUUsage) + 0.2×(1 - MemUsage) + 0.2×(1 - DiskIOUsage)
2.5.2 核心调度代码实现
// calculateTypeScore computes the score for a specific business type
func calculateTypeScore(businessType string, cpuUsage, memUsage, diskIOUsage float64) float64 {
const (
mainWeight = 0.6
minorWeight = 0.2
averageWeight = 0.33
)
switch businessType {
case "io-intensive":
return (1-diskIOUsage)*mainWeight + (1-cpuUsage)*minorWeight + (1-memUsage)*minorWeight
case "memory-sensitive":
return (1-memUsage)*mainWeight + (1-cpuUsage)*minorWeight + (1-diskIOUsage)*minorWeight
case "compute-intensive":
return (1-cpuUsage)*mainWeight + (1-memUsage)*minorWeight + (1-diskIOUsage)*minorWeight
default:
// 未知类型使用均衡评分
return (1-cpuUsage)*averageWeight + (1-memUsage)*averageWeight + (1-diskIOUsage)*averageWeight
}
}
2.5.3 Webhook验证机制
众核调度提供Webhook拦截错误请求:
// validateBusinessType 验证业务类型注解
func validateBusinessType(annotations map[string]string) (bool, string) {
businessType, exists := annotations["business.workload/type"]
if !exists {
return true, ""
}
allowedTypes := map[string]bool{
"io-intensive": true,
"memory-sensitive": true,
"compute-intensive": true,
}
types := strings.Split(businessType, ",")
var invalidTypes []string
for _, t := range types {
t = strings.TrimSpace(t)
if t != "" && !allowedTypes[t] {
invalidTypes = append(invalidTypes, t)
}
}
if len(invalidTypes) > 0 {
return false, fmt.Sprintf("无效的业务类型: %s。有效类型: io-intensive, memory-sensitive, compute-intensive",
strings.Join(invalidTypes, ", "))
}
return true, ""
}
三、扩展组件开发指南
3.1 扩展组件架构概述
openFuyao提供可插拔扩展能力,允许开发者自主开发组件以扩展平台的后端能力与前端界面。
3.1.1 扩展组件系统架构

3.2 前端扩展开发
3.2.1 扩展配置文件
创建extension.js配置文件:
// extension.js
const config = {
menu: {
pluginName: 'my-extension', // 扩展组件唯一名称
},
};
export default config;
3.2.2 入口函数修改
修改前端入口函数的挂载点:
// 修改前 index.html
<div id="root"></div>
// 修改后 index.html
<div id="my-extension_root"></div> // 使用扩展组件名作为前缀
// 修改前 index.jsx
import React from 'react';
import ReactDOM from 'react-dom';
import App from './App';
ReactDOM.render(<App />, document.querySelector('#root'));
// 修改后 index.jsx
import React from 'react';
import ReactDOM from 'react-dom';
import App from './App';
import extensionConfig from './extension.js';
ReactDOM.render(
<App />,
document.querySelector(`#${extensionConfig.menu.pluginName}_root`)
);
3.2.3 Vite构建配置
// vite.config.js
import { defineConfig } from 'vite';
import react from '@vitejs/plugin-react';
import cssInjectedByJsPlugin from 'vite-plugin-css-injected-by-js';
import postcssPrefixSelector from 'postcss-prefix-selector';
import path from 'path';
import extension from './src/extension.js';
const { pluginName } = extension.menu;
export default defineConfig({
plugins: [
react(),
cssInjectedByJsPlugin(),
],
build: {
lib: {
entry: path.resolve(__dirname, 'index.html'),
name: pluginName,
formats: ['es'],
},
rollupOptions: {
output: {
entryFileNames: `${pluginName}/${pluginName}.mjs`,
},
},
},
css: {
postcss: {
plugins: [
postcssPrefixSelector({
prefix: `#${pluginName}_root`,
transform(prefix, selector, prefixedSelector) {
if (selector.startsWith('body') || selector.startsWith('html')) {
return selector;
}
return prefixedSelector;
},
}),
],
},
},
define: {
'process.env': 'new Object({ NODE_ENV: "production" })',
},
});
3.3 ConsolePlugin CRD配置
3.3.1 CRD规范定义
apiVersion: console.openfuyao.com/v1beta1
kind: ConsolePlugin
metadata:
name: my-extension
spec:
# 扩展组件唯一名称
pluginName: my-extension
# 显示在菜单上的名称
displayName: "我的扩展"
# 挂载位置: "/" (导航栏) 或 "/container_platform" (左侧菜单)
entrypoint: /container_platform
# 菜单顺序(可选)
order: 100
# 二级菜单配置(可选)
subPages:
- pageName: dashboard
displayName: "仪表板"
- pageName: settings
displayName: "设置"
# 后端服务配置
backend:
type: Service
service:
name: my-extension-backend
namespace: openfuyao-system
port: 8080
basePath: /api
# 是否启用
enabled: true
表 2 ConsolePlugin参数说明
|---------------------------|------------|------------|------------------------------|
| 参数 | 类型 | 必需 | 描述 |
| pluginName | string | 是 | 扩展组件的唯一名称,需与前端配置一致 |
| displayName | string | 是 | 展示在菜单上的扩展组件名称 |
| entrypoint | string | 是 | 挂载位置,支持/和/container_platform |
| order | string | 否 | 扩展组件在菜单上的顺位 |
| subPages | array | 否 | 二级菜单配置 |
| backend.type | string | 是 | 后端访问方式,目前仅支持Service |
| backend.service.name | string | 是 | 后端Service名称 |
| backend.service.namespace | string | 是 | 后端Service命名空间 |
| backend.service.port | int32 | 否 | 后端Service端口,默认80 |
| enabled | boolean | 否 | 是否启用,默认true |
3.4 后端服务开发规范
3.4.1 API路径规范
扩展组件后端接口必须以/rest/{pluginName}开头:
// Go示例 - 使用Gin框架
package main
import (
"github.com/gin-gonic/gin"
)
func main() {
r := gin.Default()
// 扩展组件API路由组
api := r.Group("/rest/my-extension")
{
api.GET("/dashboard", getDashboard)
api.GET("/metrics", getMetrics)
api.POST("/config", updateConfig)
}
r.Run(":8080")
}
func getDashboard(c *gin.Context) {
c.JSON(200, gin.H{
"status": "ok",
"data": map[string]interface{}{
"totalNodes": 10,
"activePods": 50,
},
})
}
3.4.2 Kubernetes Service配置
apiVersion: v1
kind: Service
metadata:
name: my-extension-backend
namespace: openfuyao-system
labels:
app: my-extension
spec:
type: ClusterIP
ports:
- port: 8080
targetPort: 8080
protocol: TCP
selector:
app: my-extension-backend
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-extension-backend
namespace: openfuyao-system
spec:
replicas: 2
selector:
matchLabels:
app: my-extension-backend
template:
metadata:
labels:
app: my-extension-backend
spec:
containers:
- name: backend
image: my-registry/my-extension-backend:v1.0.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
四、认证鉴权集成开发
4.1 OAuth2.0认证体系
openFuyao平台提供统一的用户登录认证服务,支持标准的OAuth2.0登录流程。
4.1.1 认证架构

4.2 OAuth-Proxy集成
4.2.1 Deployment配置
将OAuth-Proxy作为边车容器加入扩展组件:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-extension-frontend
namespace: openfuyao-system
spec:
replicas: 1
selector:
matchLabels:
app: my-extension-frontend
template:
metadata:
labels:
app: my-extension-frontend
spec:
serviceAccountName: my-extension-oauth-proxy
containers:
# 扩展组件前端容器
- name: frontend
image: my-registry/my-extension-frontend:v1.0.0
ports:
- name: http
containerPort: 80
protocol: TCP
volumeMounts:
- name: nginx-config
mountPath: /etc/nginx/nginx.conf
subPath: nginx.conf
# OAuth-Proxy边车容器
- name: oauth-proxy
image: cr.openfuyao.cn/openfuyao/oauth-proxy:v24.09
ports:
- containerPort: 9093
name: proxy
args:
- --https-address=
- --http-address=:9093
- --email-domain=*
- --provider=openfuyao
- --client-id=oauth-proxy
- --client-secret=SECRETTS
- --upstream=http://localhost:80
- --openfuyao-delegate-urls={"/":{"resource": "services/proxy", "group": ""}}
- --redirect-url=/
- --login-url=/oauth2/oauth/authorize
- --redeem-url=/oauth2/oauth/token
- --root-prefix=/my-extension
- --cookie-httponly
volumes:
- name: nginx-config
configMap:
name: my-extension-nginx
4.2.2 ServiceAccount与RBAC配置
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-extension-oauth-proxy
namespace: openfuyao-system
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: my-extension-oauth-proxy-webhook-auth
subjects:
- kind: ServiceAccount
name: my-extension-oauth-proxy
namespace: openfuyao-system
roleRef:
kind: ClusterRole
name: webhook-auth
apiGroup: rbac.authorization.k8s.io
4.2.3 Service配置
apiVersion: v1
kind: Service
metadata:
name: my-extension-frontend
namespace: openfuyao-system
spec:
type: ClusterIP
ports:
- port: 80
targetPort: proxy # 指向OAuth-Proxy端口
selector:
app: my-extension-frontend
4.3 Values.yaml配置模板
values.yaml
service:
containerPort: 80
backend: "http://my-extension-backend.openfuyao-system.svc"
enableOAuth: true
oauthProxy:
containerPort: 9093
image:
repository: "cr.openfuyao.cn/openfuyao/oauth-proxy"
pullPolicy: Always
tag: "v24.09"
client:
id: "oauth-proxy"
secret: "SECRETTS"
urls:
host: "/"
loginURI: "/oauth2/oauth/authorize"
redeemURI: "/oauth2/oauth/token"
rootPrefix: "/my-extension"
五、监控系统集成开发
5.1 监控架构概述
openFuyao监控系统基于Prometheus生态构建,支持自定义Exporter和ServiceMonitor配置。

5.2 自定义Exporter开发
5.2.1 Exporter Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-exporter
namespace: my-exporter-namespace
labels:
app: my-exporter
spec:
replicas: 1
selector:
matchLabels:
app: my-exporter
template:
metadata:
labels:
app: my-exporter
spec:
containers:
- name: my-exporter
image: my-registry/my-exporter:latest
ports:
- containerPort: 9100
name: metrics
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "200m"
memory: "256Mi"
5.2.2 Exporter Service
apiVersion: v1
kind: Service
metadata:
name: my-exporter
namespace: my-exporter-namespace
labels:
app: my-exporter
spec:
ports:
- port: 9100
targetPort: 9100
name: metrics
selector:
app: my-exporter
5.3 ServiceMonitor配置
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-exporter-servicemonitor
namespace: my-exporter-namespace
labels:
app: my-exporter
spec:
采集端点配置
endpoints:
- interval: 30s # 采集间隔
port: metrics # 端口名称
path: /metrics # 指标路径
scheme: http
服务选择器
selector:
matchLabels:
app: my-exporter
命名空间选择器
namespaceSelector:
matchNames:
- my-exporter-namespace
5.4 Go语言Exporter示例
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// 定义指标
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "my_app_requests_total",
Help: "Total number of requests",
},
[]string{"method", "endpoint"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "my_app_request_duration_seconds",
Help: "Request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
activeConnections = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "my_app_active_connections",
Help: "Number of active connections",
},
)
)
func init() {
// 注册指标
prometheus.MustRegister(requestsTotal)
prometheus.MustRegister(requestDuration)
prometheus.MustRegister(activeConnections)
}
func main() {
// 暴露/metrics端点
http.Handle("/metrics", promhttp.Handler())
// 启动HTTP服务
http.ListenAndServe(":9100", nil)
}
六、Ray分布式计算集成
6.1 Ray架构概述
openFuyao Ray提供Ray集群及作业的管理能力,支持RayCluster、RayJob、RayService等多种作业形态。
6.1.1 系统架构

6.2 RayCluster配置示例
6.2.1 基础RayCluster
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: my-ray-cluster
namespace: default
spec:
rayVersion: '2.41.0'
# Head节点配置
headGroupSpec:
serviceType: NodePort
rayStartParams:
num-cpus: "0"
template:
spec:
containers:
- name: ray-head
image: docker.io/rayproject/ray:2.41.0
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
# Worker节点配置
workerGroupSpecs:
- replicas: 2
minReplicas: 1
maxReplicas: 5
groupName: default-worker
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: docker.io/rayproject/ray:2.41.0
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "4"
memory: "8Gi"
6.2.2 带NPU加速的RayCluster
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: ray-npu-cluster
spec:
rayVersion: '2.41.0'
headGroupSpec:
serviceType: ClusterIP
rayStartParams:
num-cpus: "0"
template:
spec:
containers:
- name: ray-head
image: cr.openfuyao.cn/openfuyao/ray-npu:2.41.0
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
workerGroupSpecs:
- replicas: 2
groupName: npu-worker
rayStartParams:
num-gpus: "1" # Ray将NPU视为GPU资源
template:
spec:
nodeSelector:
accelerator: npu
containers:
- name: ray-worker
image: cr.openfuyao.cn/openfuyao/ray-npu:2.41.0
resources:
requests:
cpu: "4"
memory: "16Gi"
huawei.com/Ascend910: "1" # 请求NPU资源
limits:
cpu: "8"
memory: "32Gi"
huawei.com/Ascend910: "1"
6.3 RayJob配置示例
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: my-ray-job
spec:
# 作业入口
entrypoint: python /app/train.py --epochs 100
# 运行时环境
runtimeEnvYAML: |
pip:
- torch
- numpy
working_dir: /app
# 关联的RayCluster配置
rayClusterSpec:
rayVersion: '2.41.0'
headGroupSpec:
rayStartParams:
num-cpus: "0"
template:
spec:
containers:
- name: ray-head
image: docker.io/rayproject/ray:2.41.0
resources:
requests:
cpu: "500m"
memory: "1Gi"
workerGroupSpecs:
- replicas: 2
groupName: worker
template:
spec:
containers:
- name: ray-worker
image: docker.io/rayproject/ray:2.41.0
resources:
requests:
cpu: "2"
memory: "4Gi"
# 作业完成后是否删除集群
shutdownAfterJobFinishes: true
# 作业超时时间(秒)
activeDeadlineSeconds: 3600
七、Helm Chart打包规范
7.1 Chart.yaml配置
扩展组件打包为Helm Chart时,需要添加openfuyao-extension关键词:
# Chart.yaml
apiVersion: v2
name: my-extension
description: A Helm chart for openFuyao my-extension component
type: application
version: 1.0.0
appVersion: "1.0.0"
# 关键词标识为openFuyao扩展组件
keywords:
- openfuyao-extension
- monitoring
- custom
# 依赖项
dependencies:
- name: common
version: "1.x.x"
repository: "https://charts.bitnami.com/bitnami"
# 维护者信息
maintainers:
- name: Your Name
email: your.email@example.com
7.2 Values.yaml模板
# values.yaml
# 全局配置
global:
imageRegistry: cr.openfuyao.cn
imagePullSecrets: []
# 前端服务配置
frontend:
enabled: true
replicaCount: 1
image:
repository: openfuyao/my-extension-frontend
tag: latest
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 80
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# 后端服务配置
backend:
enabled: true
replicaCount: 2
image:
repository: openfuyao/my-extension-backend
tag: latest
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 8080
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
# OAuth配置
enableOAuth: true
oauthProxy:
image:
repository: openfuyao/oauth-proxy
tag: v24.09
client:
id: oauth-proxy
secret: SECRETTS
# openFuyao平台集成
openfuyao: true
7.3 目录结构
my-extension/
├── Chart.yaml
├── values.yaml
├── charts/
│ └── (子chart)
├── templates/
│ ├── _helpers.tpl
│ ├── deployment-frontend.yaml
│ ├── deployment-backend.yaml
│ ├── service-frontend.yaml
│ ├── service-backend.yaml
│ ├── configmap.yaml
│ ├── consoleplugin.yaml
│ ├── serviceaccount.yaml
│ └── NOTES.txt
└── README.md
- API接口参考
请参考openFuyao官方文档:https://docs.openfuyao.cn/docs/
九、常见问题与最佳实践
9.1 NPU相关问题
Q1: NPU设备无法被发现
· 检查NPU驱动是否正确安装:npu-smi info · 检查NFD是否正常运行:kubectl get pods -n openfuyao-system | grep nfd · 查看NFD日志了解发现失败原因
Q2: 应用无法获得NPU资源
· 检查节点是否具有NPU标签:kubectl get nodes --show-labels | grep npu · 检查应用的nodeSelector是否正确配置 · 检查资源配额是否充足
9.2 KAE相关问题
Q1: KAE加速效果不明显
· 检查KAE驱动是否正确安装 · 确认应用正确配置了OpenSSL引擎 · 分析应用性能瓶颈,确认是否适合KAE加速
Q2: 阻止在某些节点上安装KAE驱动程序
kubectl label nodes $NODE openfuyao.com/kae.deploy.driver=false
9.3 扩展组件开发问题
Q1: 扩展前端页面无法加载
· 检查ConsolePlugin CR是否正确创建 · 确认前端JS模块路径为/dist/{pluginName}.mjs · 查看console-service日志分析问题
Q2: 后端API无法访问
· 确认后端接口以/rest/{pluginName}开头 · 检查Service和Deployment是否正常运行 · 验证ConsolePlugin中的backend配置是否正确
9.4 最佳实践建议
9.4.1 资源配置建议
生产环境资源配置建议
resources:
requests:
cpu: "500m" # 根据实际负载调整
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
9.4.2 高可用部署
多副本部署
spec:
replicas: 3
Pod反亲和性
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions: - key: app
operator: In
values: - my-extension
topologyKey: kubernetes.io/hostname
9.4.3 日志与监控
配置日志轮转
logRotate:
compress: true
logFile: /var/log/my-extension/app.log
logLevel: info
maxAge: 7
rotate: 30
总结
本文详细介绍了openFuyao多样化算力使能平台的技术实现原理与开发者集成指南,涵盖以下核心内容:
· *核心组件原理* :深入解析了NPU Operator、KAE Operator、NFD等组件的工作机制和实现细节 · *扩展开发指南* :提供了完整的扩展组件开发流程,包括前端挂载、后端服务、ConsolePlugin配置 · *认证鉴权集成* :详细说明了OAuth-Proxy边车模式的集成方案 · *监控系统集成* :介绍了自定义Exporter和ServiceMonitor的开发方法 · *Ray分布式计算*:提供了RayCluster、RayJob的配置示例
openFuyao通过"核心平台+可插拔组件"的架构设计,为开发者提供了灵活的扩展能力。无论是硬件厂商希望集成新的加速设备,还是应用开发者需要构建定制化的管理界面,都可以通过本文介绍的标准接口和开发规范快速实现。
通过本文的学习,开发者可以:
- 理解openFuyao的技术架构和核心组件工作原理
- 掌握扩展组件的开发流程和最佳实践
- 快速集成自定义功能到openFuyao生态系统
- 构建高效、可靠的云原生异构算力管理方案