蓝盾「Docker公共构建机」缓存清理

在使用蓝盾「Docker公共构建机」一段时间后,我们发现构建镜像偶发性超时。排查后发现是由于集群的 Node 节点的磁盘满了,本文会介绍如何清理构建缓存。

1. 背景

我们发现构建镜像偶发性超时,排查发现是上了 Docker-in-Docker 构建镜像之后发生的,而且发生频率越来越高,进一步排查发现是由于 Pod 会通过 hostPath 挂载工作目录和日志目录,由于构建任务过多导致 Node 节点磁盘打满。

2. 排查过程

2.1 事件分析

通过 Pod 事件可以发现是由于 Node 节点磁盘打满,导致 Pod 被驱逐,构建任务失败。

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Evicted 20m kubelet The node was low on resource: ephemeral-storage. Container build1753761077695-ivcpmoxg was using 1580320Ki, which exceeds its request of 0.
Normal NodeHasNoDiskPressure 3m (x32 over 6d5h) kubelet Node 10.10.32.2 status is now: NodeHasNoDiskPressure

pod yaml
是由于 Pod 通过 hostPath 挂载工作目录和日志目录,通过 hostPath 挂载目录是为了做缓存,当同一流水线任务重复执行时能够加速。

volumes:
- hostPath:
path: /data/landun/workspace/build1753761077695-ivcpmoxg
type: ""
name: data-volume
- hostPath:
path: /data/landun/logs/build1753761077695-ivcpmoxg
type: ""
name: logs-volume

dispatch-k8s-manager 模块的配置文件
dispatch-k8s-manager/resources/config.yaml

dispatch:
volume:
# 构建机脚本
builderConfigMap:
name: dispatch-kubernetes-builder
items:
# 初始化脚本
- key: initsh.properties
path: init.sh
# 登录调试需要的sleep脚本
- key: sleepsh.properties
path: sleep.sh
hostPath:
# 数据盘
dataHostDir: /data/landun/workspace
# 日志盘
logsHostDir: /data/landun/logs
# 应用数据使用cfs
cfs:
path: /data/cfs
volumeMount:
dataPath: /data/landun/workspace
logPath: /data/logs
builderConfigMapPath: /data/landun/config
cfs:
path: /data/bkdevops/apps
readOnly: true

2.2 源码分析

dispatch-k8s-manager/pkg/apiserver/service/builder_start.go

// getBuilderVolumeAndMount 获取一些构建机的常规的被挂载到pod上的volume和mount
func getBuilderVolumeAndMount(
workloadName string,
nFSs []types.NFS,
) (volumes []corev1.Volume, volumeMounts []corev1.VolumeMount) {
volumes = getBuilderPodVolume(workloadName)
volumeMounts = getBuilderPodVolumeMount()

...

return volumes, volumeMounts
}

// getBuilderPodVolume 获取一些构建机的常规的被挂载到pod上的volume,包括配置configmap和data目录hostpath
func getBuilderPodVolume(workloadName string) []corev1.Volume {
dataHostPath := filepath.Join(config.Config.Dispatch.Volume.HostPath.DataHostDir, workloadName)
logHostPath := filepath.Join(config.Config.Dispatch.Volume.HostPath.LogsHostDir, workloadName)

var items []corev1.KeyToPath
for _, v := range config.Config.Dispatch.Volume.BuilderConfigMap.Items {
items = append(items, corev1.KeyToPath{
Key: v.Key,
Path: v.Path,
})
}

return ...
}

通过源码分析可以发现 hostPath 是通过 dispatch-k8s-manager/resources/config.yaml 加上 workloadName 拼接而成的,所以没办法通过配置文件控制不使用 hostPath,于是我们通过定时任务来清理该缓存。

3. 解决方案

参考 bk-applog-bkapp-filebeat 的日志清理方案,通过 DaemonSet 实现蓝盾挂载工作目录实施定时清理操作。

NAME                                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
bk-applog-bkapp-filebeat-ingress 18 18 18 18 18 <none> 424d
bk-applog-bkapp-filebeat-json 18 18 18 18 18 <none> 424d
bk-applog-bkapp-filebeat-log-cleaner 18 18 18 18 18 <none> 424d
bk-applog-bkapp-filebeat-stdout 18 18 18 18 18 <none> 424d
bk-ci-builder-cleaner 18 18 18 18 18 <none> 13d

编写 daemonSet.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: bk-ci-builder-cleaner
namespace: blueking
labels:
app: bk-ci-builder
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: bk-ci-builder
template:
metadata:
labels:
app: bk-ci-builder
name: bk-ci-builder-cleaner
spec:
hostPID: true
restartPolicy: Always
serviceAccountName: bk-applog-bkapp-filebeat
containers:
- name: batch-delete-files
image: xxx.xxx.com/bk-ci-builder-cleaner:v1
imagePullPolicy: IfNotPresent
command:
- bash
args:
- -c
- while true; do ./delete_files.sh; sleep 21600; done;
resources:
requests:
cpu: 25m
memory: 32Mi
limits:
cpu: 2560m
memory: 256Mi
volumeMounts:
- mountPath: /data/devops/workspace
name: data-volume
- mountPath: /data/devops/logs
name: logs-volume
volumes:
- name: data-volume
hostPath:
path: /data/landun/workspace
type: DirectoryOrCreate
- name: logs-volume
hostPath:
path: /data/landun/logs
type: DirectoryOrCreate

缓存清理脚本 delete_files.sh

#!/usr/bin/env bash
# delete_files.sh —— 正式删除版
# 同时扫描 /data/devops/workspace 和 /data/devops/logs
set -euo pipefail

# --------- 可配置参数 ---------
ROOT_DIRS=("/data/devops/workspace" "/data/devops/logs")
RETENTION_DAYS=7
LOG_FILE="/tmp/delete_build_dirs.log"
# -----------------------------

log() {
printf '%s [%s] %s\n' "$(date '+%F %T')" "$1" "$2" | tee -a "$LOG_FILE"
}

cutoff_date=$(date -d "$RETENTION_DAYS days ago" +%F)

log INFO "==== 开始检查并删除 $RETENTION_DAYS 天未更新的 build* 目录 ===="

for root in "${ROOT_DIRS[@]}"; do
[[ -d $root ]] || { log WARN "目录不存在: $root"; continue; }

for dir in "$root"/build*; do
[[ -d $dir ]] || continue

# 二次确认:目录内是否仍无任何 7 天内更新的文件
if ! find "$dir" -type f -newermt "$cutoff_date" -print -quit | grep -q .; then
log DELETE "$dir"
rm -rf "$dir"
else
log SKIP "$dir"
fi
done
done

log INFO "==== 清理完成,日志: $LOG_FILE ===="

4. 参考