在使用蓝盾「Docker公共构建机」一段时间后,我们发现构建镜像偶发性超时。排查后发现是由于集群的 Node 节点的磁盘满了,本文会介绍如何清理构建缓存。
1. 背景 我们发现构建镜像偶发性超时,排查发现是上了 Docker-in-Docker 构建镜像之后发生的,而且发生频率越来越高,进一步排查发现是由于 Pod 会通过 hostPath 挂载工作目录和日志目录,由于构建任务过多导致 Node 节点磁盘打满。
2. 排查过程 2.1 事件分析 通过 Pod 事件可以发现是由于 Node 节点磁盘打满,导致 Pod 被驱逐,构建任务失败。
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Evicted 20m kubelet The node was low on resource: ephemeral-storage. Container build1753761077695-ivcpmoxg was using 1580320Ki, which exceeds its request of 0. Normal NodeHasNoDiskPressure 3m (x32 over 6d5h) kubelet Node 10.10.32.2 status is now: NodeHasNoDiskPressure
pod yaml 是由于 Pod 通过 hostPath 挂载工作目录和日志目录,通过 hostPath 挂载目录是为了做缓存,当同一流水线任务重复执行时能够加速。
volumes: - hostPath: path: /data/landun/workspace/build1753761077695-ivcpmoxg type: "" name: data-volume - hostPath: path: /data/landun/logs/build1753761077695-ivcpmoxg type: "" name: logs-volume
dispatch-k8s-manager 模块的配置文件 dispatch-k8s-manager/resources/config.yaml
dispatch: volume: builderConfigMap: name: dispatch-kubernetes-builder items: - key: initsh.properties path: init.sh - key: sleepsh.properties path: sleep.sh hostPath: dataHostDir: /data/landun/workspace logsHostDir: /data/landun/logs cfs: path: /data/cfs volumeMount: dataPath: /data/landun/workspace logPath: /data/logs builderConfigMapPath: /data/landun/config cfs: path: /data/bkdevops/apps readOnly: true
2.2 源码分析 dispatch-k8s-manager/pkg/apiserver/service/builder_start.go
func getBuilderVolumeAndMount ( workloadName string , nFSs []types.NFS, ) (volumes []corev1.Volume, volumeMounts []corev1.VolumeMount) { volumes = getBuilderPodVolume(workloadName) volumeMounts = getBuilderPodVolumeMount() ... return volumes, volumeMounts } func getBuilderPodVolume (workloadName string ) []corev1 .Volume { dataHostPath := filepath.Join(config.Config.Dispatch.Volume.HostPath.DataHostDir, workloadName) logHostPath := filepath.Join(config.Config.Dispatch.Volume.HostPath.LogsHostDir, workloadName) var items []corev1.KeyToPath for _, v := range config.Config.Dispatch.Volume.BuilderConfigMap.Items { items = append (items, corev1.KeyToPath{ Key: v.Key, Path: v.Path, }) } return ... }
通过源码分析可以发现 hostPath 是通过 dispatch-k8s-manager/resources/config.yaml 加上 workloadName 拼接而成的,所以没办法通过配置文件控制不使用 hostPath,于是我们通过定时任务来清理该缓存。
3. 解决方案 参考 bk-applog-bkapp-filebeat 的日志清理方案,通过 DaemonSet 实现蓝盾挂载工作目录实施定时清理操作。
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE bk-applog-bkapp-filebeat-ingress 18 18 18 18 18 <none> 424d bk-applog-bkapp-filebeat-json 18 18 18 18 18 <none> 424d bk-applog-bkapp-filebeat-log-cleaner 18 18 18 18 18 <none> 424d bk-applog-bkapp-filebeat-stdout 18 18 18 18 18 <none> 424d bk-ci-builder-cleaner 18 18 18 18 18 <none> 13d
编写 daemonSet.yaml
apiVersion: apps/v1 kind: DaemonSet metadata: name: bk-ci-builder-cleaner namespace: blueking labels: app: bk-ci-builder spec: revisionHistoryLimit: 10 selector: matchLabels: app: bk-ci-builder template: metadata: labels: app: bk-ci-builder name: bk-ci-builder-cleaner spec: hostPID: true restartPolicy: Always serviceAccountName: bk-applog-bkapp-filebeat containers: - name: batch-delete-files image: xxx.xxx.com/bk-ci-builder-cleaner:v1 imagePullPolicy: IfNotPresent command: - bash args: - -c - while true ; do ./delete_files.sh; sleep 21600 ; done; resources: requests: cpu: 25m memory: 32Mi limits: cpu: 2560m memory: 256Mi volumeMounts: - mountPath: /data/devops/workspace name: data-volume - mountPath: /data/devops/logs name: logs-volume volumes: - name: data-volume hostPath: path: /data/landun/workspace type: DirectoryOrCreate - name: logs-volume hostPath: path: /data/landun/logs type: DirectoryOrCreate
缓存清理脚本 delete_files.sh
#!/usr/bin/env bash set -euo pipefailROOT_DIRS=("/data/devops/workspace" "/data/devops/logs" ) RETENTION_DAYS=7 LOG_FILE="/tmp/delete_build_dirs.log" log () { printf '%s [%s] %s\n' "$(date '+%F %T') " "$1 " "$2 " | tee -a "$LOG_FILE " } cutoff_date=$(date -d "$RETENTION_DAYS days ago" +%F) log INFO "==== 开始检查并删除 $RETENTION_DAYS 天未更新的 build* 目录 ====" for root in "${ROOT_DIRS[@]} " ; do [[ -d $root ]] || { log WARN "目录不存在: $root " ; continue ; } for dir in "$root " /build*; do [[ -d $dir ]] || continue if ! find "$dir " -type f -newermt "$cutoff_date " -print -quit | grep -q .; then log DELETE "$dir " rm -rf "$dir " else log SKIP "$dir " fi done done log INFO "==== 清理完成,日志: $LOG_FILE ===="
4. 参考