CI/CD 故障的特点是"上一次能跑,这一次不行"——多数是环境/依赖/凭证变化引起,而非代码问题。诊断思路:找出和上次成功构建的差异。流水线越长,断点越难定位;最有效的方法是分阶段验证,在每个 stage 加足够的日志。
每个阶段都可能失败,排错时先定位"哪一阶段失败",再深挖。日志输出习惯:每个阶段开头打印关键变量。
# 1. 看 Console Output 完整日志 # 定位第一个 ERROR / FAILED / Exception 关键字 # 2. 复现:在同一 agent 上手动执行同样的命令 ssh jenkins-agent su - jenkins cd /var/lib/jenkins/workspace/<job> # 执行 Jenkinsfile 里的命令 # 3. 对比上次成功构建的环境差异 # - 代码版本(diff git log) # - 依赖版本(npm/maven 版本) # - 环境变量 # - Agent 节点
| 错误关键字 | 原因 | 解决 |
|---|---|---|
| No space left on device | workspace 磁盘满 | 清理旧构建 |
| Permission denied | 文件权限 / docker.sock | chmod / 加 docker 组 |
| command not found | PATH 不对 | 用绝对路径或加 tool 配置 |
| npm/maven 拉包失败 | 仓库不通 / 私库认证 | 配 .npmrc / settings.xml |
| OutOfMemoryError | JVM 堆不够 | JAVA_OPTS=-Xmx2g |
| Timeout | 构建超时 | 调大 timeout 或拆 job |
pipeline { agent any stages { stage('Diagnose') { steps { // 每个阶段开头打印环境 sh ''' echo "=== Environment ===" pwd whoami env | sort df -h . docker version 2>&1 || true ''' } } stage('Build') { steps { sh 'set -ex; mvn clean package -B' // set -e 出错立即停止 // set -x 每行命令打印,debug 友好 } } } post { failure { archiveArtifacts artifacts: '**/target/*.log', allowEmptyArchive: true } } }
# 1. Jenkins UI → Manage Jenkins → Nodes 看状态 # 2. Agent 机器上看进程 ps -ef | grep jenkins systemctl status jenkins-agent # 3. Agent 日志 tail -f /var/log/jenkins/agent.log
| 类型 | 原理 | 常见问题 |
|---|---|---|
| SSH | Master 主动 ssh 到 Agent | SSH key 失效 / 防火墙 |
| JNLP / Inbound | Agent 主动连 Master | 50000 端口不通 |
| K8s Pod | 动态 Pod,JNLP 启动 | 镜像问题 / 资源不足 |
# Master 测试能否 ssh 到 Agent su jenkins ssh -i /var/lib/jenkins/.ssh/id_rsa jenkins@agent-host # Agent 上检查 jenkins 用户权限 groups jenkins cat ~jenkins/.ssh/authorized_keys
# 默认 50000(可在 Configure Global Security 改) telnet jenkins-master 50000 # 启动 inbound agent 标准命令 java -jar agent.jar \ -jnlpUrl http://jenkins:8080/computer/agent1/jenkins-agent.jnlp \ -secret <secret> \ -workDir /home/jenkins/agent
pipeline { agent { kubernetes { yaml ''' apiVersion: v1 kind: Pod spec: containers: - name: maven image: maven:3.8-jdk-11 command: ["cat"] tty: true resources: requests: memory: "2Gi" cpu: "1" ''' } } stages { stage('Build') { steps { container('maven') { sh 'mvn clean package' } } } } }
kubectl describe pod <agent-pod> 看 events。
input step 没人确认)lock 资源没释放)# 1. 看正在执行的步骤 # Pipeline UI → Steps,看哪一步在 running # 2. 强制终止 # 先点 Stop(发 SIGINT) # 30 秒后点第二次 Stop(发 SIGKILL) # 还不行:Script Console 执行 Jenkins.instance.getItemByFullName("my-job") .getBuildByNumber(123).doKill()
pipeline { agent any options { // 整个流水线 30 分钟超时 timeout(time: 30, unit: 'MINUTES') // 失败重试 retry(2) } stages { stage('Build') { options { // 单 stage 也加超时 timeout(time: 10, unit: 'MINUTES') } steps { sh 'mvn clean package' } } stage('Manual Approve') { steps { // input 必须加超时,否则等死 timeout(time: 5, unit: 'MINUTES') { input message: 'Deploy to prod?' } } } } }
--max-time 30,wget 加 --timeout=30,sh 命令外包 timeout 命令。任何"等"的操作没超时,迟早卡死流水线。
| 错误 | 原因 | 解决 |
|---|---|---|
| Cannot connect to docker daemon | docker.sock 权限/服务挂 | 启动 docker / 加用户到 docker 组 |
| pull access denied | 基础镜像私有/不存在 | docker login 或换镜像 |
| COPY failed: no such file | 构建上下文不对 | 检查 .dockerignore 和路径 |
| returned a non-zero code | 某条 RUN 命令失败 | 看上一行报错 |
| no space left on device | 构建机磁盘满 | docker system prune |
| network timeout | RUN 阶段拉包慢 | 换镜像源 / 加代理 |
# 多阶段构建,最终镜像最小 FROM maven:3.8-jdk-11 AS builder WORKDIR /build COPY pom.xml . RUN mvn dependency:go-offline # 利用缓存,依赖单独一层 COPY src ./src RUN mvn package -DskipTests FROM openjdk:11-jre-slim WORKDIR /app COPY --from=builder /build/target/*.jar app.jar EXPOSE 8080 ENTRYPOINT ["java", "-jar", "/app/app.jar"]
# 错误:每次改代码全部重新装依赖 COPY . /app RUN npm install # 正确:依赖文件单独一层 COPY package*.json /app/ RUN npm install COPY . /app # 代码变也不重装依赖
# APT 换源(Debian/Ubuntu 基础镜像) RUN sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list # pip 换源 RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple # npm 换源 RUN npm config set registry https://registry.npmmirror.com # Maven 换源(挂载 settings.xml) COPY settings.xml /root/.m2/settings.xml
DOCKER_BUILDKIT=1),支持并行构建多阶段、更好的缓存、secret 挂载。Docker 20.10+ 默认启用。
| 错误 | 原因 |
|---|---|
| unauthorized: authentication required | 没 login 或 credential 过期 |
| denied: requested access to the resource is denied | 账号没该项目推送权限 |
| name unknown | 项目不存在 |
| tag does not exist | 本地没这个 tag |
| EOF / connection reset | 镜像太大 / 网络中断 |
# 1. 登录 docker login harbor.example.com -u <user> -p <password> # 凭证会存在 ~/.docker/config.json # 2. 给镜像打标签(必须包含完整仓库地址) docker tag myapp:v1 harbor.example.com/myproject/myapp:v1 # 3. push docker push harbor.example.com/myproject/myapp:v1 # 4. 验证 docker manifest inspect harbor.example.com/myproject/myapp:v1
stage('Push') { steps { withCredentials([usernamePassword( credentialsId: 'harbor-cred', usernameVariable: 'USER', passwordVariable: 'PASS' )]) { sh ''' echo "$PASS" | docker login harbor.example.com -u "$USER" --password-stdin docker tag myapp:${BUILD_NUMBER} harbor.example.com/proj/myapp:${BUILD_NUMBER} docker push harbor.example.com/proj/myapp:${BUILD_NUMBER} docker logout harbor.example.com ''' } } }
docker login -p $PASS 会让密码出现在进程列表和日志中。用 stdin 传递更安全。
# 错误:x509: certificate signed by unknown authority # 方案 A:把 Harbor 证书加到 Docker 信任 mkdir -p /etc/docker/certs.d/harbor.example.com cp harbor-ca.crt /etc/docker/certs.d/harbor.example.com/ca.crt systemctl restart docker # 方案 B:把 Harbor 加到 insecure-registries(开发环境) cat > /etc/docker/daemon.json <<EOF { "insecure-registries": ["harbor.example.com"] } EOF systemctl restart docker # containerd 配置(K8s 节点) vim /etc/containerd/config.toml # [plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.example.com".tls] # ca_file = "/etc/containerd/certs.d/harbor-ca.crt" systemctl restart containerd
# 通过 Harbor API 看项目配额 curl -u admin:Harbor12345 \ https://harbor.example.com/api/v2.0/quotas # 清理老镜像(Harbor UI → Project → Repositories) # 配置 retention policy 自动清理
# 1. 看 Runner 注册状态 gitlab-runner list gitlab-runner status # 2. 看 Runner 服务 systemctl status gitlab-runner journalctl -u gitlab-runner -f # 3. 验证 Runner 能连 GitLab gitlab-runner verify # 4. 测试单个 job(不真实跑,只验证配置) gitlab-runner exec docker test_job
| 问题 | 表现 | 解决 |
|---|---|---|
| Runner 没接到任务 | UI 显示 Runner online,但 pipeline 一直 pending | 检查 tags 是否匹配 |
| job 启动慢 | 等很久才开始 | 调高 concurrent / 加 Runner |
| shell executor 权限 | job 用 gitlab-runner 用户跑,权限不足 | 给 gitlab-runner 用户加权限 |
| docker executor 镜像拉不到 | job 报 pull image 失败 | 配 docker pull policy |
| cache 失效 | 每次重装依赖 | 检查 cache:key 配置 |
stages: - build - test - deploy # 全局缓存 cache: key: "$CI_COMMIT_REF_SLUG" # 每个分支独立缓存 paths: - node_modules/ - .m2/repository/ variables: DOCKER_DRIVER: overlay2 DOCKER_TLS_CERTDIR: "" build_job: stage: build image: maven:3.8-jdk-11 tags: - linux # 匹配 Runner tag script: - mvn package -DskipTests artifacts: paths: - target/*.jar expire_in: 1 week rules: - if: '$CI_COMMIT_BRANCH == "main"' - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
tags: [docker],但所有 Runner 都没有 docker 这个 tag,job 永远 pending。Runner 注册时指定 tag,或在 UI 中修改。
// 1. 关键 step 加 retry stage('Pull Deps') { steps { retry(3) { sh 'mvn dependency:resolve' } } } // 2. waitForCondition 替代 sleep stage('Wait for Service') { steps { timeout(time: 60, unit: 'SECONDS') { waitUntil { script { def r = sh( script: 'curl -fs http://service/health', returnStatus: true ) return r == 0 } } } } } // 3. 测试失败也归档,方便分析 post { always { junit '**/target/surefire-reports/*.xml' archiveArtifacts artifacts: '**/target/screenshots/*', allowEmptyArchive: true } }
# 1. 拉包加重试和超时 curl --max-time 30 --retry 3 --retry-delay 5 <url> wget --timeout=30 --tries=3 <url> # 2. docker pull 失败重试 for i in 1 2 3; do docker pull image:tag && break sleep 10 done # 3. npm/maven 加 mirrors 兜底
# 把不稳定的测试隔离 test:stable: script: - mvn test -Dgroups="!flaky" # 排除 flaky 组 test:flaky: script: - mvn test -Dgroups="flaky" retry: 2 # 自动重试 allow_failure: true # 失败不阻塞流水线
| 类型 | 用途 |
|---|---|
| Username with password | 登录数据库/仓库 |
| SSH Username with private key | 远程登录/Git |
| Secret text | API token / webhook |
| Secret file | kubeconfig / 证书 |
| Certificate | SSL/TLS 证书 |
// 用户名密码 withCredentials([usernamePassword( credentialsId: 'harbor-cred', usernameVariable: 'USER', passwordVariable: 'PASS' )]) { sh 'echo "$PASS" | docker login -u "$USER" --password-stdin' } // SSH key withCredentials([sshUserPrivateKey( credentialsId: 'git-key', keyFileVariable: 'SSH_KEY' )]) { sh 'GIT_SSH_COMMAND="ssh -i $SSH_KEY" git clone git@host:repo.git' } // kubeconfig 文件 withCredentials([file( credentialsId: 'kubeconfig-prod', variable: 'KUBECONFIG' )]) { sh 'kubectl get pods' }
# 在 GitLab Project → Settings → CI/CD → Variables 配置 # 关键选项: # - Protected:只在保护分支可用 # - Masked:日志中遮蔽 # - Expand variable reference:是否展开 ${VAR} deploy: script: - echo "Deploying..." - curl -H "Authorization: Bearer $DEPLOY_TOKEN" ... only: - main
stage('Deploy') { steps { sh ''' set -e # 1. 应用新版本 kubectl set image deployment/myapp \\ myapp=harbor.example.com/proj/myapp:${BUILD_NUMBER} \\ -n production # 2. 等待 rollout 完成,超时则失败 kubectl rollout status deployment/myapp -n production --timeout=5m # 3. 健康检查 curl -f http://service/health || exit 1 ''' } post { failure { sh ''' echo "Deploy failed, rolling back..." kubectl rollout undo deployment/myapp -n production kubectl rollout status deployment/myapp -n production ''' } } }
# 查看部署历史 kubectl rollout history deployment/myapp -n prod # 看具体某个版本的详情 kubectl rollout history deployment/myapp -n prod --revision=5 # 回滚到上一版本 kubectl rollout undo deployment/myapp -n prod # 回滚到指定版本 kubectl rollout undo deployment/myapp -n prod --to-revision=3 # 暂停 / 恢复 rollout kubectl rollout pause deployment/myapp -n prod kubectl rollout resume deployment/myapp -n prod
| 策略 | 原理 | 适用 |
|---|---|---|
| RollingUpdate | 逐步替换 Pod | 默认,大多数业务 |
| Recreate | 先删完旧的再起新的 | 有状态服务,版本不兼容 |
| Blue-Green | 切流量到新版本 | 需要快速回滚 |
| Canary | 少量新版本验证 | 核心业务,需小流量验证 |
# 决策树 # 1. rollout status 显示 timeout? kubectl get pod -l app=myapp -n prod kubectl describe pod <pod> -n prod | tail -20 # - ImagePullBackOff → 镜像问题 # - CrashLoopBackOff → 应用启动失败 # - Pending → 资源不足 # 2. 健康检查失败? kubectl logs <pod> -n prod --tail=100 # 看应用日志找根因 # 3. 紧急回滚 kubectl rollout undo deployment/myapp -n prod # 1 分钟内能回到上一个稳定版本