一、我见过的最混乱的团队
2018年,我入职了一家创业公司。
第一天,leader让我"部署一下"某个服务。我问他文档在哪里,他说:"文档?没写过,但代码在Git上,服务器密码我发你。"
那天下午,我经历了:
- 在20多台服务器上逐一登录部署
- 用U盘拷贝更新包(没看错,真的是U盘)
- 凌晨3点回滚,因为新版本有bug
- 回滚后发现代码和数据库不一致
那个项目,10个人开发,部署一次要3小时,发布一次要全团队通宵。
后来我接触了DevOps,才知道原来部署可以这么简单。
二、DevOps:从概念到落地
2.1 什么是DevOps?
DevOps = Development + Operations
开发(Dev)和运维(Ops)的融合
核心理念:
- 开发即运维:开发者自己负责部署和运维
- 自动化一切:构建、测试、部署全部自动化
- 持续改进:快速迭代,小步快跑
- 以业务价值为导向:减少浪费,提高交付效率
2.2 DevOps的价值
没有DevOps:
- 代码写完要等2周才能上线
- 部署靠手工,出错率高
- 环境不一致,"在我电脑上是好的"
- 问题定位靠猜测,排查时间长
有DevOps:
- 代码提交后自动构建、自动测试、自动部署
- 部署时间从3小时缩短到5分钟
- 环境标准化,"所有人的环境都一样"
- 问题可追溯,快速定位
三、CICD流水线设计
3.1 流水线核心阶段
yaml
# Jenkinsfile (Jenkins Pipeline)
pipeline {
agent any
environment {
REGISTRY = 'registry.example.com'
APP_NAME = 'order-service'
DOCKER_IMAGE = "${REGISTRY}/${APP_NAME}:${BUILD_NUMBER}"
}
stages {
stage('Checkout') {
steps {
checkout scm
script {
env.GIT_COMMIT_SHORT = sh(
script: "git rev-parse --short HEAD",
returnStdout: true
).trim()
}
}
}
stage('Build') {
steps {
sh '''
mvn clean package -DskipTests
if [ ! -f target/*.jar ]; then
echo "构建失败:未找到JAR文件"
exit 1
fi
'''
}
}
stage('Unit Tests') {
steps {
sh 'mvn test'
}
post {
always {
junit 'target/surefire-reports/*.xml'
}
}
}
stage('Code Quality') {
steps {
sh '''
mvn sonar:sonar \
-Dsonar.projectKey=${APP_NAME} \
-Dsonar.host.url=http://sonar:9000 \
-Dsonar.login=${SONAR_TOKEN}
'''
}
}
stage('Security Scan') {
steps {
sh '''
# OWASP依赖检查
mvn org.owasp:dependency-check-maven:check
'''
}
post {
always {
archiveArtifacts artifacts: 'target/dependency-check-report.html'
}
}
}
stage('Build Docker Image') {
steps {
sh """
docker build -t ${DOCKER_IMAGE} .
docker tag ${DOCKER_IMAGE} ${REGISTRY}/${APP_NAME}:${GIT_COMMIT_SHORT}
docker push ${REGISTRY}/${APP_NAME}:${GIT_COMMIT_SHORT}
"""
}
}
stage('Deploy to Test') {
when {
branch 'develop'
}
steps {
sh """
kubectl set image deployment/${APP_NAME} \
${APP_NAME}=${DOCKER_IMAGE} \
-n test
kubectl rollout status deployment/${APP_NAME} -n test
"""
}
}
stage('Deploy to Staging') {
when {
branch 'main'
}
steps {
sh """
kubectl set image deployment/${APP_NAME} \
${APP_NAME}=${DOCKER_IMAGE} \
-n staging
kubectl rollout status deployment/${APP_NAME} -n staging
"""
input message: '人工审批?', ok: '确认部署'
}
}
stage('Deploy to Production') {
when {
tag "*"
}
steps {
sh """
kubectl set image deployment/${APP_NAME} \
${APP_NAME}=${DOCKER_IMAGE} \
-n production
kubectl rollout status deployment/${APP_NAME} -n production
"""
}
}
}
post {
always {
echo "清理工作..."
}
success {
echo "流水线执行成功!"
// 发送通知
dingtalk "✅ ${APP_NAME} 构建成功\n版本: ${DOCKER_IMAGE}"
}
failure {
echo "流水线执行失败!"
// 发送告警
dingtalk "❌ ${APP_NAME} 构建失败\n版本: ${DOCKER_IMAGE}\n日志: ${env.BUILD_URL}"
}
}
}
3.2 GitLab CI配置
yaml
# .gitlab-ci.yml
stages:
- build
- test
- security
- package
- deploy
variables:
DOCKER_DRIVER: overlay2
MAVEN_OPTS: "-Dmaven.repo.local=.m2/repository"
cache:
paths:
- .m2/repository
- target/
# 构建阶段
build:
stage: build
image: maven:3.8-openjdk-11
script:
- mvn clean package -DskipTests
artifacts:
paths:
- target/*.jar
expire_in: 1 hour
# 单元测试
test:
stage: test
image: maven:3.8-openjdk-11
script:
- mvn test
- mvn jacoco:report
coverage: '/Total.*?([0-9]{1,3})%/'
artifacts:
reports:
junit: target/surefire-reports/*.xml
coverage_report:
coverage_format: jacoco
path: target/site/jacoco/jacoco.xml
# 安全扫描
security:
stage: security
image: aquasec/trivy:latest
script:
- trivy image --exit-code 0 --severity HIGH,CRITICAL $IMAGE_NAME
allow_failure: true # 允许失败,不阻断流水线
# Docker打包
docker:
stage: package
image: docker:latest
services:
- docker:dind
script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
only:
- develop
- main
# 开发环境部署
deploy-test:
stage: deploy
image: bitnami/kubectl:latest
script:
- kubectl config set-cluster k8s-test --server=$K8S_TEST_SERVER
- kubectl config set-credentials gitlab --token=$K8S_TEST_TOKEN
- kubectl config set-context gitlab-test --cluster=k8s-test --user=gitlab
- kubectl config use-context gitlab-test
- kubectl set image deployment/order-service order-service=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
environment:
name: test
url: https://test.example.com
only:
- develop
# 生产环境部署
deploy-prod:
stage: deploy
image: bitnami/kubectl:latest
script:
- kubectl config set-cluster k8s-prod --server=$K8S_PROD_SERVER
- kubectl config set-credentials gitlab --token=$K8S_PROD_TOKEN
- kubectl config set-context gitlab-prod --cluster=k8s-prod --user=gitlab
- kubectl config use-context gitlab-prod
- kubectl set image deployment/order-service order-service=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
- kubectl rollout status deployment/order-service
environment:
name: production
url: https://prod.example.com
when: manual # 手动触发
only:
- tags
四、Docker镜像优化
4.1 多阶段构建
dockerfile
# Dockerfile (多阶段构建)
# 阶段1:构建
FROM maven:3.8-openjdk-11 AS builder
WORKDIR /build
# 复制依赖(加速构建)
COPY pom.xml .
RUN mvn dependency:go-offline -B
# 复制源码并构建
COPY src ./src
RUN mvn clean package -DskipTests
# 阶段2:运行
FROM openjdk:11-jre-slim
WORKDIR /app
# 从构建阶段复制JAR
COPY --from=builder /build/target/*.jar app.jar
# 添加健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s \
CMD wget --quiet --tries=1 --spider http://localhost:8080/actuator/health || exit 1
# 创建非root用户
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser
ENTRYPOINT ["java", "-Xms256m", "-Xmx512m", "-jar", "app.jar"]
4.2 镜像大小对比
bash
# 普通镜像 vs 优化后镜像
# openjdk:11 → 800MB
# openjdk:11-jre → 400MB
# openjdk:11-jre-slim → 200MB
# amazoncorretto:11-alpine → 180MB
# 优化策略:
# 1. 使用精简基础镜像
# 2. 多阶段构建
# 3. 减少层数
# 4. .dockerignore排除不需要的文件
bash
# .dockerignore
.git
.gitignore
*.md
target/*.jar.original
*.log
.java-version
.idea
.vscode
node_modules
五、环境管理与配置
5.1 多环境配置
yaml
# config.yaml
environments:
dev:
api_url: http://dev-api.example.com
db_host: dev-mysql.example.com
redis_host: dev-redis.example.com
log_level: DEBUG
replicas: 1
test:
api_url: http://test-api.example.com
db_host: test-mysql.example.com
redis_host: test-redis.example.com
log_level: INFO
replicas: 2
staging:
api_url: http://staging-api.example.com
db_host: staging-mysql.example.com
redis_host: staging-redis.example.com
log_level: INFO
replicas: 3
production:
api_url: http://api.example.com
db_host: prod-mysql.example.com
redis_host: prod-redis.example.com
log_level: WARN
replicas: 10
5.2 K8s环境隔离
yaml
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
env: production
team: backend
---
apiVersion: v1
kind: Namespace
metadata:
name: staging
labels:
env: staging
team: backend
---
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: production
spec:
replicas: 5
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
version: v1
spec:
containers:
- name: order-service
image: registry.example.com/order-service:latest
ports:
- containerPort: 8080
env:
- name: SPRING_PROFILES_ACTIVE
value: "production"
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 15
六、蓝绿部署与金丝雀发布
6.1 蓝绿部署
bash
# 蓝绿部署示意图
# 当前:Blue版本处理所有流量
# 发布:Green版本部署完成后,一次性切换流量
# 1. 当前状态:Blue版本处理100%流量
kubectl get svc order-service
# NAME TYPE CLUSTER-IP PORT(S) SELECTOR
# order-service ClusterIP 10.0.1.100 8080 app=order-service,version=blue
# 2. 部署Green版本
kubectl apply -f order-service-green.yaml
# Green版本已部署,但流量还是Blue
# 3. 切换流量(蓝绿切换)
kubectl patch service order-service \
-p '{"spec":{"selector":{"version":"green"}}}'
# 4. 验证Green版本
# 如果有问题,快速回滚
kubectl patch service order-service \
-p '{"spec":{"selector":{"version":"blue"}}}'
# 5. 确认无误后,删除Blue版本
kubectl delete deployment order-service-blue
6.2 金丝雀发布
yaml
# 金丝雀发布:只让小部分用户使用新版本
# 概念:用一只"金丝雀"先试探新版本是否有问题
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
namespace: production
data:
default.conf: |
upstream order_backend {
server order-service-blue:8080;
}
# 金丝雀:10%流量到Green版本
upstream order_backend_canary {
server order-service-green:8080;
}
server {
listen 80;
# 路径匹配(基于URL)
location /api/v1/ {
# 10%流量到金丝雀版本
set $targetBackend "order_backend";
if ($request_uri ~ "^/test.*") {
set $targetBackend "order_backend_canary";
}
proxy_pass http://$targetBackend;
}
}
---
# 基于权重的Service
apiVersion: v1
kind: Service
metadata:
name: order-service-canary
spec:
selector:
app: order-service
version: green
ports:
- port: 80
targetPort: 8080
6.3 Argo Rollouts(金丝雀更专业的方案)
yaml
# Rollout配置
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: order-service
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5 # 先5%流量
- pause: {duration: 10m} # 暂停10分钟观察
- setWeight: 20 # 20%
- pause: {} # 手动确认
- setWeight: 50 # 50%
- pause: {duration: 5m}
- setWeight: 100 # 100%
canaryMetadata:
labels:
role: canary
stableMetadata:
labels:
role: stable
trafficRouting:
nginx:
stableIngress: order-stable
additionalIngressAnnotations:
canary-by-header: X-Canary
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: order-service-canary
七、测试自动化
7.1 测试金字塔
┌─────────────────────────────────┐
│ E2E Tests(端到端测试) │ 少量、关键路径
│ 模拟真实用户操作,覆盖核心流程 │
├─────────────────────────────────┤
│ Integration Tests(集成测试) │ 中等数量
│ 测试多个组件协作 │
├─────────────────────────────────┤
│ Unit Tests(单元测试) │ 大量、快速
│ 测试单个类/方法的正确性 │
└─────────────────────────────────┘
建议比例:单元测试70%,集成测试20%,E2E测试10%
7.2 自动化测试配置
java
// 单元测试示例
class OrderServiceTest {
@Mock
private OrderRepository orderRepository;
@Mock
private InventoryClient inventoryClient;
@InjectMocks
private OrderService orderService;
@Test
void testCreateOrder_success() {
// given
Order order = new Order();
order.setId("123");
when(orderRepository.save(any())).thenReturn(order);
when(inventoryClient.check(any(), anyInt())).thenReturn(
Inventory.builder().available(true).build()
);
// when
Order result = orderService.createOrder(
CreateOrderRequest.builder()
.userId("user1")
.skuId("sku1")
.quantity(1)
.build()
);
// then
assertNotNull(result);
assertEquals("123", result.getId());
verify(orderRepository, times(1)).save(any());
}
}
// 集成测试示例
@SpringBootTest
@AutoConfigureMockMvc
class OrderControllerIntegrationTest {
@Autowired
private MockMvc mockMvc;
@Autowired
private ObjectMapper objectMapper;
@Test
void testCreateOrder_endpoint() throws Exception {
CreateOrderRequest request = new CreateOrderRequest();
request.setUserId("user1");
request.setSkuId("sku1");
request.setQuantity(1);
mockMvc.perform(post("/api/orders")
.contentType(MediaType.APPLICATION_JSON)
.content(objectMapper.writeValueAsString(request)))
.andExpect(status().isOk())
.andExpect(jsonPath("$.code").value(0))
.andExpect(jsonPath("$.data.orderId").exists());
}
}
八、踩坑实录
坑1:流水线执行太慢
每次构建要30分钟,开发者等不起。
解决:优化构建缓存,拆分为并行阶段。
yaml
# 优化前:30分钟
# 优化后:8分钟
# 优化策略:
# 1. Maven依赖缓存
cache:
paths:
- .m2/repository
# 2. Docker层缓存
docker build:
script:
- docker build --cache-from $PREV_IMAGE ...
# 3. 并行执行独立任务
parallel:
- stage: test-unit
- stage: test-integration
- stage: security-scan
坑2:环境不一致
开发环境好好的,测试环境就挂了。
解决:使用容器化环境,Docker Compose启动完整测试环境。
yaml
# docker-compose.test.yml
version: '3.8'
services:
app:
build: .
depends_on:
mysql:
condition: service_healthy
redis:
condition: service_started
environment:
SPRING_PROFILES_ACTIVE: test
SPRING_DATASOURCE_URL: jdbc:mysql://mysql:3306/test
SPRING_REDIS_HOST: redis
mysql:
image: mysql:8.0
environment:
MYSQL_DATABASE: test
MYSQL_ROOT_PASSWORD: test
healthcheck:
test: ["CMD", "mysqladmin", "ping", "-h", "localhost"]
interval: 5s
timeout: 3s
retries: 10
redis:
image: redis:7-alpine
坑3:回滚不及时
发布后发现问题,手动回滚花了1小时。
解决:提前准备回滚脚本,自动化回滚。
bash
# 回滚脚本
#!/bin/bash
DEPLOYMENT=$1
NAMESPACE=${2:-production}
# 获取当前版本
CURRENT_IMAGE=$(kubectl get deployment $DEPLOYMENT -n $NAMESPACE \
-o jsonpath='{.spec.template.spec.containers[0].image}')
# 获取历史版本
PREVIOUS_IMAGE=$(kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE \
| grep -A1 "$CURRENT_IMAGE" | tail -1 | awk '{print $2}')
# 回滚
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
# 验证
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE
echo "已回滚到版本: $PREVIOUS_IMAGE"
九、总结
DevOps让交付更高效:
- 流水线自动化:代码提交→自动构建→自动测试→自动部署
- 环境标准化:开发、测试、生产环境一致
- 快速反馈:问题早发现,早解决
- 安全集成:安全扫描成为流水线的一部分
- 可追溯:每次部署都有记录,可快速回滚
最佳实践:
- 流水线要快:用缓存,并行执行
- 自动化一切:减少人工操作
- 回滚要快:提前准备回滚方案
- 监控要全:能看到部署前后的变化
- 文化要变:开发者也要懂运维
血的教训:
DevOps不仅是工具,更是文化。如果团队不愿意改变习惯,再好的工具也没用。推广DevOps要从培训开始,让大家理解它的价值。
思考题: 你的团队现在部署一次要多久?有哪些环节可以自动化?
个人观点,仅供参考