在EKS上部署ray serve框架

理解ray cluster

Head Node

每个 Ray 集群都有一个节点被指定为集群的头节点。头节点与其他工作节点相同，不同之处在于它还运行负责集群管理的单例进程，例如自动缩放器、GCS 以及运行 Ray 作业的 Ray 驱动程序进程。

Worker Node

工作节点不运行任何头节点管理进程，仅用于运行 Ray 任务和 Actor 中的用户代码。它们参与分布式调度，以及 Ray 对象在集群内存中的存储和分发。

Autoscaling

Ray Autoscaler是一个在头节点上运行的进程（如果使用 Kubernetes，则作为头 Pod 中的 Sidecar 容器运行）。当 Ray 工作负载的资源需求超过集群的当前容量时，自动扩缩器将尝试增加工作节点的数量。当工作节点处于空闲状态时，自动扩缩器将从集群中移除工作节点。

Ray on K8s

推荐使用 KubeRay Operator 来实现这一点。该 Operator 提供了一种 Kubernetes 原生的方式来管理 Ray 集群。每个 Ray 集群由一个头节点 Pod 和一组工作节点 Pod 组成。可选的自动扩缩支持允许 KubeRay Operator 根据 Ray 工作负载的需求调整 Ray 集群的大小，并根据需要添加和删除 Ray Pod。KubeRay 支持异构计算节点（包括 GPU），并且支持在同一个 Kubernetes 集群中运行多个使用不同 Ray 版本的 Ray 集群。

部署KubeRay

安装operator

csharp 复制代码

helm repo add kuberay https://ray-project.github.io/kuberay-helm/ 
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0

验证安装

arduino 复制代码

kubectl get pod    

NAME                                READY   STATUS    RESTARTS   AGE
kuberay-operator-67c6897444-pswq9   1/1     Running   0          33m

为cpu和gpu资源创建karpenter node pool

yaml 复制代码

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: cpu-pool
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]
        - key: NodeGroupType
          operator: In
          values: ["x86-cpu-karpenter"] 
        - key: type
          operator: In
          values: ["karpenter"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: cpu-node-class
      taints: []
      expireAfter: 720h # 30 * 24h = 720h
  limits:
    cpu: 100
    memory: 200Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30m
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-pool
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g5"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["2xlarge", "4xlarge", "8xlarge", "12xlarge", "16xlarge", "24xlarge"]
        - key: NodeGroupType
          operator: In
          values: ["g5-gpu-karpenter"]
        - key: type
          operator: In
          values: ["karpenter"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-node-class
      taints:
        - key: "nvidia.com/gpu"
          effect: "NoSchedule"
      expireAfter: 720h # 30 * 24h = 720h
  limits:
    cpu: 200
    memory: 400Gi
    nvidia.com/gpu: 10
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30m
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-node-class
spec:
  amiFamily: AL2023
  amiSelectorTerms:
    - tags:
        karpenter.sh/discovery: "ray-on-eks"
    - name: gpu-ami
    - id: ami-0d86ad7fcc6222c91
  role: "KarpenterNodeRole-ray-on-eks" # 替换为您的集群名称
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "ray-on-eks" # 替换为您的集群名称
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "ray-on-eks" # 替换为您的集群名称
  blockDeviceMappings:
          - deviceName: /dev/xvda
            ebs:
              volumeSize: 900Gi
              volumeType: gp3
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: cpu-node-class
spec:
  amiFamily: AL2023
  amiSelectorTerms:
    - tags:
        karpenter.sh/discovery: "ray-on-eks"
    - name: cpu-ami
    - id: ami-0d395232050cbe374
  role: "KarpenterNodeRole-ray-on-eks" # 替换为您的集群名称
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "ray-on-eks" # 替换为您的集群名称
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "ray-on-eks" # 替换为您的集群名称
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 900Gi
        volumeType: gp3

可以看到我们分别创建了CPU和GPU的节点池，并为GPU节点打上了"nvidia.com/gpu"的污点标签。

验证创建结果

vbnet 复制代码

$ kubectl get nodepools   
NAME       NODECLASS        NODES   READY   AGE
cpu-pool   cpu-node-class   0       True    5d23h
gpu-pool   gpu-node-class   1       True    5d23h

在集群安装Nvidia插件

bash 复制代码

   kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

使用RayService，部署qwen2.5 7B模型。注意将hf_token替换为自己的hugging face token。

yaml 复制代码

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-serve-llm
spec:
  serveConfigV2: |
    applications:
    - name: llms
      import_path: ray.serve.llm:build_openai_app
      route_prefix: "/"
      args:
        llm_configs:
        - model_loading_config:
            model_id: qwen2.5-7b-instruct
            model_source: Qwen/Qwen2.5-7B-Instruct
          engine_kwargs:
            dtype: bfloat16
            max_model_len: 1024
            device: auto
            gpu_memory_utilization: 0.75
          deployment_config:
            autoscaling_config:
              min_replicas: 1
              max_replicas: 4
              target_ongoing_requests: 64
            max_ongoing_requests: 128
  rayClusterConfig:
    rayVersion: "2.46.0"
    headGroupSpec:
      rayStartParams:
        num-cpus: "0"
        num-gpus: "0"
      template:
        spec:
          tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"
          containers:
          - name: ray-head
            image: rayproject/ray-llm:2.46.0-py311-cu124
            ports:
            - containerPort: 8000
              name: serve
              protocol: TCP
            - containerPort: 8080
              name: metrics
              protocol: TCP
            - containerPort: 6379
              name: gcs
              protocol: TCP
            - containerPort: 8265
              name: dashboard
              protocol: TCP
            - containerPort: 10001
              name: client
              protocol: TCP
            resources:
              limits:
                cpu: 2
                memory: 4Gi
              requests:
                cpu: 2
                memory: 4Gi
    workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      groupName: gpu-group
      rayStartParams:
        num-gpus: "4"
      template:
        spec:
          tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"
          containers:
          - name: ray-worker
            image: rayproject/ray-llm:2.46.0-py311-cu124
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: hf_token
            resources:
              limits:
                cpu: 32
                memory: 32Gi
                nvidia.com/gpu: "4"
              requests:
                cpu: 32
                memory: 32Gi
                nvidia.com/gpu: "4"

---

apiVersion: v1
kind: Secret
metadata:
  name: hf-token
type: Opaque
stringData:
  hf_token: xxxxxxxxxxxxxxxxxxxx

该RayService会创建如下Serive资源：

ray-serve-llm-head-svc
- 该service指向RayCluster的head pod，作为Ray集群的Head节点，负责集群管理、任务调度和资源分配
ray-serve-llm-serve-svc
- 该Service暴露Ray Serve的http接口，接口通常为8000。该Service用于接收部署的Serve application请求。

测试部署

swift 复制代码

$ kubectl port-forward svc/ray-serve-llm-serve-svc 8000:8000
  
$ curl --location 'http://localhost:8000/v1/chat/completions' --header 'Content-Type: application/json' --data '{
    "model": "qwen2.5-7b-instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Provide steps to serve an LLM using Ray Serve."
        }
    ]
}'
{"id":"qwen2.5-7b-instruct-8526c24f-0656-4b82-bada-c52b6d5a9894","object":"chat.completion","created":1761200852,"model":"qwen2.5-7b-instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Sure! Ray Serve is a high-performance and easy-to-use framework for serving machine learning models in Python. Here are the steps to serve an LLM (Large Language Model) using Ray Serve:\n\n### Step 1: Install Ray and Ray Serve\nFirst, you need to install the Ray library and Ray Serve component. You can do this using pip:\n\n```bash\npip install ray[serve]\n```\n\n### Step 2: Import Necessary Libraries\nImport the required libraries in your Python script:\n\n```python\nimport ray\nfrom ray import serve\n```\n\n### Step 3: Initialize Ray and Ray Serve\nInitialize Ray and Ray Serve. This can be done at the beginning of your script:\n\n```python\n# Start Ray\nray.init()\n\n# Initialize Ray Serve\nserve.init()\n```\n\n### Step 4: Define Your LLM Model\nDefine your language model. This could be a pre-trained model from libraries like Hugging Face or any custom model. Here's an example using Hugging Face's `transformers` library:\n\n```python\nfrom transformers import pipeline\n\n# Load your language model\nllm = pipeline(\"text-generation\", model=\"your-model-name\")\n```\n\n### Step 5: Create a Ray Serve Model Class\nCreate a Ray Serve model class using the `ray.remote` decorator to make it a remote class. This class will define the inference method called by the Ray Serve HTTP endpoint.\n\n```python\n@serve.deployment\nclass LLMModel:\n    def __init__(self, llm):\n        self.llm = llm\n\n    def __call__(self, input_text):\n        # Generate text using the LLM\n        return self.llm(input_text, max_length=200)[0]['generated_text']\n```\n\n### Step 6: Deploy the Model as a Ray Serve Endpoint\nDeploy the Ray Serve model as an HTTP endpoint:\n\n```python\n# Deploy the model\nmodel = LLMModel.bind(llm)\n\n# Create a HTTP endpoint\nllm_endpoint = serve.create_endpoint(\"llm-endpoint\", \"/llm\")\n\n# Deploy the model to the endpoint\nserve.create_backend(model, llm_endpoint)\nserve.create_routes(llm_endpoint)\n```\n\n### Step 7: Make Predictions\nYou can now make predictions by sending HTTP requests to the endpoint. Ray Serve can handle large traffic and can scale automatically.\n\nHere's a simple example of how you might use `requests` to send a text request to the endpoint:\n\n```python\nimport requests\n\n# Define the input text\ninput_text = \"Once upon a time\"\n\n# Make a request to the endpoint\nresponse = requests.post(\"http://localhost:8000/llm\", json={\"text\": input_text})\n\n# Print the response\nprint(response.json()[\"predictions\"])\n```\n\n### Step 8: Scale Horizontally (Optional)\nRay Serve can scale your models horizontally. You can add more replicas to your backend to increase throughput:\n\n```python\nserve.scale(llm_endpoint, 4)  # Scale to 4 replicas\n```\n\n### Step 9: Monitor and Adjust (Optional)\nYou can monitor the performance and health of your services using Ray dashboard:\n\n```python\n# Start the Ray dashboard\nray.init(python superhero=True)\n\nimport ray.dashboard.modules.serve.sdk\nserve_dashboard_actor = ray.get_actor(\" ServeDashboardActor\")\nserve_dashboard_url = ray/dashboard/serve\nprint(f\"Browse to {serve_dashboard_url} for more info.\")\n```\n\n### Conclusion\nThis process will set up a Ray Serve endpoint to serve your language model. You can further customize the model, handle different inference tasks, and scale as needed.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":30,"total_tokens":775,"completion_tokens":745,"prompt_tokens_details":null},"prompt_logprobs":null}%