Investigating a failed VolumeSnapshot with NFS on Kubernetes

Date: 2023-11-14

Based on real-world experience.

Two common mistakes of using NFS for dynamic volume provisioning on Kubernetes:

Using nfs-subdir-external-provisioner instead of csi-driver-nfs
Using NFS for dynamic volume provisioning on Kubernetes

Using `nfs-subdir-external-provisioner` instead of `csi-driver-nfs`

The Container Storage Interface (CSI) is a standard for exposing arbitrary block and file storage systems to containerized workloads on Container Orchestration Systems (COs) like Kubernetes. Using CSI third-party storage providers can write and deploy plugins exposing new storage systems in Kubernetes without ever having to touch the core Kubernetes code.

Source: Kubernetes CSI Developer Documentation

Basically, CSI is a unified abstraction layer for managing and interacting with storage on Kubernetes. A conforming implementation is known as a CSI driver and exposes the following common storage operations which covers both the running of stateful applications on Kubernetes and primitives for enabling application backup and recovery:

The ability to define a StorageClass utilizing the driver which dynamically provisions PVs to satisfy PVC requests
The ability to define a VolumeSnapshotClass utilizing the driver which dynamically creates volume snapshots (VolumeSnapshotContent) to satisfy VolumeSnapshot requests

csi-driver-nfs is a CSI driver for NFS. On the other hand, nfs-subdir-external-provisioner is just a StorageClass implementation and lacks volume snapshot capabilities - not acceptable for production use.

Using NFS for dynamic volume provisioning on Kubernetes

In some cases, volume snapshots fail to be created even with the NFS CSI driver. This is due to fundamental limitations with NFS itself. Perhaps a hands-on demo is in order.

In the lab to follow, we’ll set up a 2-node kubeadm cluster (1 master, 1 worker) with the NFS CSI driver installed, deploy MinIO as a sample stateful application and attempt to create a VolumeSnapshot from the minio PVC. We’ll then investigate the root cause of the issue and conclude why NFS is not suitable for Kubernetes storage in a production context.

Lab: NFS VolumeSnapshots failing in action

Prerequisites

Familiarity with Kubernetes cluster administration is assumed. If not, consider enrolling in the comprehensive LFS258: Kubernetes Fundamentals online training course offered by The Linux Foundation which is also the official training course for the CKA certification exam offered by the CNCF.

Setting up your environment

It is assumed you already have a public cloud account such as an AWS account or a laptop / workstation capable of hosting at least 2 Linux nodes each with 2 vCPUs, 8G of RAM and 40G of storage space, one of which will become the master node and the other the worker node. You may follow the lab with a bare-metal setup as well if desired.

The reference distribution is Ubuntu 22.04 LTS for which the instructions in this lab have been tested against. If you’re looking for a challenge, feel free to follow the lab with a different distribution but beware that some of the instructions may require non-trivial modification.

For the purposes of this lab, we’ll refer to our master node as master0 and worker node as worker0.

Provisioning a two-node kubeadm cluster

Let’s set up a kubeadm cluster following the typical process.

The versions of Kubernetes and associated components to be installed:

Kubernetes 1.28.3
containerd 1.7.8
runc 1.1.10
CNI plugins 1.3.0
Calico 3.26.3

Setting up `master0`

Run the following commands on master0 to perform preliminary setup and avoid issues on installing and initializing Kubernetes. Make sure to replace x.x.x.x below with the private IP address of master0.

sudo hostnamectl set-hostname master0
echo "export PATH=\"/opt/cni/bin:/usr/local/sbin:/usr/local/bin:\$PATH\"" >> "$HOME/.bashrc" && \
    source "$HOME/.bashrc"
sudo sed -i 's#Defaults    secure_path = /sbin:/bin:/usr/sbin:/usr/bin#Defaults    secure_path = /opt/cni/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin#' /etc/sudoers
export K8S_CONTROL_PLANE="x.x.x.x"
echo "$K8S_CONTROL_PLANE k8s-control-plane" | sudo tee -a /etc/hosts
sudo modprobe br_netfilter
echo br_netfilter | sudo tee /etc/modules-load.d/kubernetes.conf
cat << EOF | sudo tee -a /etc/sysctl.conf
net.ipv4.ip_forward=1
EOF
sudo sysctl -p
sudo systemctl reboot

After the reboot, run the following commands to install the containerd CRI and associated components:

wget https://github.com/containerd/containerd/releases/download/v1.7.8/containerd-1.7.8-linux-amd64.tar.gz
sudo tar Cxzvf /usr/local containerd-1.7.8-linux-amd64.tar.gz
sudo mkdir -p /usr/local/lib/systemd/system/
sudo wget -qO /usr/local/lib/systemd/system/containerd.service https://raw.githubusercontent.com/containerd/containerd/main/containerd.service
sudo systemctl daemon-reload
sudo systemctl enable --now containerd.service
sudo mkdir -p /etc/containerd/
containerd config default | \
    sed 's/SystemdCgroup = false/SystemdCgroup = true/' | \
    sed 's/pause:3.8/pause:3.9/' | \
    sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd.service
sudo mkdir -p /usr/local/sbin/
sudo wget -qO /usr/local/sbin/runc https://github.com/opencontainers/runc/releases/download/v1.1.10/runc.amd64
sudo chmod +x /usr/local/sbin/runc
sudo mkdir -p /opt/cni/bin/
wget https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-amd64-v1.3.0.tgz
sudo tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v1.3.0.tgz

Now run the commands below to install Kubernetes and initialize our master node:

sudo apt update && sudo apt install -y apt-transport-https ca-certificates curl
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.28/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update && sudo apt install -y \
    kubeadm=1.28.3-1.1 \
    kubelet=1.28.3-1.1 \
    kubectl=1.28.3-1.1
sudo apt-mark hold kubelet kubeadm kubectl
sudo systemctl enable --now kubelet.service
cat > kubeadm-config.yaml << EOF
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kubernetesVersion: v1.28.3
controlPlaneEndpoint: "k8s-control-plane:6443"
networking:
  podSubnet: "192.168.0.0/16"
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd
EOF
sudo kubeadm init --config kubeadm-config.yaml
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
echo "source <(kubectl completion bash)" >> "$HOME/.bashrc" && \
    source "$HOME/.bashrc"
wget -qO - https://raw.githubusercontent.com/projectcalico/calico/v3.26.3/manifests/calico.yaml | \
    kubectl apply -f -

Now run the following command to wait for master0 to become ready - this should take no longer than 5 minutes:

kubectl wait --for=condition=Ready node master0

Sample output:

node/master0 condition met

Setting up `worker0`

Now set up our worker node worker0.

Again, the preliminary setup which also reboots our node - replace x.x.x.x again with the private IP address of master0:

sudo hostnamectl set-hostname worker0
echo "export PATH=\"/opt/cni/bin:/usr/local/sbin:/usr/local/bin:\$PATH\"" >> "$HOME/.bashrc" && \
    source "$HOME/.bashrc"
sudo sed -i 's#Defaults    secure_path = /sbin:/bin:/usr/sbin:/usr/bin#Defaults    secure_path = /opt/cni/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin#' /etc/sudoers
export K8S_CONTROL_PLANE="x.x.x.x"
echo "$K8S_CONTROL_PLANE k8s-control-plane" | sudo tee -a /etc/hosts
sudo modprobe br_netfilter
echo br_netfilter | sudo tee /etc/modules-load.d/kubernetes.conf
cat << EOF | sudo tee -a /etc/sysctl.conf
net.ipv4.ip_forward=1
EOF
sudo sysctl -p
sudo systemctl reboot

Next, install containerd and associated components:

wget https://github.com/containerd/containerd/releases/download/v1.7.8/containerd-1.7.8-linux-amd64.tar.gz
sudo tar Cxzvf /usr/local containerd-1.7.8-linux-amd64.tar.gz
sudo mkdir -p /usr/local/lib/systemd/system/
sudo wget -qO /usr/local/lib/systemd/system/containerd.service https://raw.githubusercontent.com/containerd/containerd/main/containerd.service
sudo systemctl daemon-reload
sudo systemctl enable --now containerd.service
sudo mkdir -p /etc/containerd/
containerd config default | \
    sed 's/SystemdCgroup = false/SystemdCgroup = true/' | \
    sed 's/pause:3.8/pause:3.9/' | \
    sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd.service
sudo mkdir -p /usr/local/sbin/
sudo wget -qO /usr/local/sbin/runc https://github.com/opencontainers/runc/releases/download/v1.1.10/runc.amd64
sudo chmod +x /usr/local/sbin/runc
sudo mkdir -p /opt/cni/bin/
wget https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-amd64-v1.3.0.tgz
sudo tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v1.3.0.tgz

Now, install Kubernetes and initialize our worker node - replace the x’s with your Kubernetes token and CA certificate hash as shown in the output of kubeadm init on our master node master0:

sudo apt update && sudo apt install -y apt-transport-https ca-certificates curl
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.28/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update && sudo apt install -y \
    kubeadm=1.28.3-1.1 \
    kubelet=1.28.3-1.1
sudo apt-mark hold kubelet kubeadm
sudo systemctl enable --now kubelet.service
export K8S_TOKEN="xxxxxx.xxxxxxxxxxxxxxxx"
export K8S_CA_CERT_HASH="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
sudo kubeadm join k8s-control-plane:6443 \
    --discovery-token "${K8S_TOKEN}" \
    --discovery-token-ca-cert-hash "sha256:${K8S_CA_CERT_HASH}"

Now run a similar command on the master node to wait for our worker node to become ready - again, this should take no longer than 5 minutes:

kubectl wait --for=condition=Ready node worker0

Sample output:

node/worker0 condition met

Setting up NFS and installing the NFS CSI driver

We’ll set up our NFS share on our worker node so our applications run close to our data.

Run these commands on worker0:

sudo apt update && sudo apt install -y nfs-kernel-server
sudo systemctl enable --now nfs-kernel-server.service
sudo chown -R nobody:nogroup /srv
cat << EOF | sudo tee /etc/exports
/srv     *(rw,sync)
EOF
sudo exportfs -a

Now check that the NFS share is available:

showmount -e

Sample output:

Export list for worker0:
/srv *

With our NFS share available, let’s install the NFS CSI driver onto our cluster. We’ll also create the following objects:

A StorageClass utilizing the driver for dynamically provisioning NFS-backed storage, marked as default via the annotation storageclass.kubernetes.io/is-default-class=true
A VolumeSnapshotClass utilizing the driver for taking snapshots of our NFS volumes

Run these commands on master0. Replace x.x.x.x below with the IP address of worker0 since that is where we installed our NFS share.

wget https://get.helm.sh/helm-v3.13.2-linux-amd64.tar.gz
tar xvf helm-v3.13.2-linux-amd64.tar.gz
chmod +x linux-amd64/helm
mkdir -p "$HOME/.local/bin/"
mv linux-amd64/helm "$HOME/.local/bin/"
echo "export PATH=\"\$HOME/.local/bin:\$PATH\"" >> "$HOME/.bashrc"
source "$HOME/.bashrc"
helm repo add csi-driver-nfs https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts
helm repo update
helm -n kube-system install \
    csi-driver-nfs \
    csi-driver-nfs/csi-driver-nfs \
    --version v4.5.0 \
    --set externalSnapshotter.enabled=true
export K8S_WORKER_NODE="x.x.x.x"
kubectl apply -f - << EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-csi
provisioner: nfs.csi.k8s.io
parameters:
  server: ${K8S_WORKER_NODE}
  share: /srv
reclaimPolicy: Delete
volumeBindingMode: Immediate
EOF
kubectl annotate storageclass nfs-csi storageclass.kubernetes.io/is-default-class=true
kubectl apply -f - << EOF
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-nfs-snapclass
driver: nfs.csi.k8s.io
deletionPolicy: Delete
EOF

Install MinIO for our stateful application

MinIO is an Amazon S3-compatible object storage solution that can be deployed on Kubernetes.

We’ll not dive deep into MinIO, however - our focus is on the fact that it is stateful and therefore requests provisioned storage via a PVC with the name minio.

Run the commands below on master0:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
helm -n minio install \
    minio \
    bitnami/minio \
    --version 12.9.4 \
    --create-namespace

Now wait for the pods in minio namespace to become ready - this should take no longer than 5 minutes:

kubectl -n minio wait --for=condition=Ready pods --all

Sample output:

pod/minio-76c7dcbb5-84qmb condition met

Notice that a PVC minio has been created and its status should be Bound:

kubectl -n minio get pvc

Sample output:

NAME    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
minio   Bound    pvc-44ac176e-f1d5-4fe1-ac58-2153f0db6198   8Gi        RWO            nfs-csi        2m56s

Attempt to create a snapshot from the `minio` PVC

While creating a snapshot on its own does not constitute a backup, it is nonetheless a fundamental operation implemented as part of a complete workflow in comprehensive Kubernetes-native backup and recovery solutions such as Velero and Kasten K10.

Let’s try to create a VolumeSnapshot for our minio PVC - run these on master0:

kubectl -n minio apply -f - << EOF
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test-minio-snapshot
spec:
  volumeSnapshotClassName: csi-nfs-snapclass
  source:
    persistentVolumeClaimName: minio
EOF

Now wait a few seconds and observe that the snapshot has failed:

kubectl -n minio get volumesnapshot test-minio-snapshot -o jsonpath='{.status.error.message}'

Sample output:

Failed to check and update snapshot content: failed to take snapshot of the volume 10.1.0.162#srv#pvc-44ac176e-f1d5-4fe1-ac58-2153f0db6198##: "rpc error: code = Internal desc = failed to create archive for snapshot: exit status 2: ./\n./.minio.sys/\n./.minio.sys/multipart/\n./.minio.sys/format.json\n./.minio.sys/pool.bin/\n./.minio.sys/pool.bin/xl.meta\n./.minio.sys/tmp/\n./.minio.sys/tmp/9bd8ec0e-cf8a-4c86-8dcf-0bbe1d35de80\n./.minio.sys/tmp/.trash/\n./.minio.sys/config/\n./.minio.sys/config/iam/\n./.minio.sys/config/iam/format.json/\n./.minio.sys/config/iam/format.json/xl.meta\n./.minio.sys/config/config.json/\n./.minio.sys/config/config.json/xl.meta\n./.minio.sys/buckets/\n./.minio.sys/buckets/.bloomcycle.bin/\n./.minio.sys/buckets/.bloomcycle.bin/xl.meta\n./.minio.sys/buckets/.usage.json/\n./.minio.sys/buckets/.usage.json/xl.meta\ntar: ./.root_password: Cannot open: Permission denied\ntar: ./.root_user: Cannot open: Permission denied\ntar: Exiting with failure status due to previous errors\n"

In particular, notice the keywords Permission denied.

Uncovering the root cause of the failed snapshot

Viewing the file and directory ownership under /bitnami/minio/data/ in the MinIO pod reveals the root cause of the issue:

export MINIO_POD="$(kubectl -n minio get pod -l app.kubernetes.io/name=minio --no-headers | awk '{ print $1 }')"
kubectl -n minio exec "${MINIO_POD}" -- ls -al /bitnami/minio/data/

Sample output:

total 20
drwxrwsr-x 3 nobody nogroup 4096 Nov 14 14:02 .
drwxr-xr-x 3 root   root    4096 Nov 11 18:58 ..
drwxr-sr-x 7   1001 nogroup 4096 Nov 14 14:02 .minio.sys
-rw------- 1   1001 nogroup   11 Nov 14 14:02 .root_password
-rw------- 1   1001 nogroup    6 Nov 14 14:02 .root_user

By default, NFS shares have root squash enabled which maps the privileged root user on NFS clients to the unprivileged nobody user on the NFS server, preventing remote clients from unexpectedly gaining root privileges on the NFS host which is a security concern. All other users are mapped to their own UID.

The MinIO pod in our Helm chart runs as a non-root user with UID 1001 as a security measure, so the files and directories created by this pod also have an owner UID of 1001. However, since neither .root_user nor .root_password are group or world readable, and the CSI snapshot operation implemented via tar for NFS presumably runs as root which maps to nobody on the NFS share, the snapshot operation is unable to read these two files when running the tar command and fails.

There are at least two ways to work around this issue, but both involve reducing the overall security of our infrastructure and workloads and therefore unacceptable in a production environment:

Specifying no_root_squash for NFS which opens up the NFS share host to privilege escalation attacks
Configuring MinIO to run as root via specifying the appropriate Helm chart values which opens up the cluster to potential container escape and privilege escalation attacks