Date: 2023-11-14
Based on real-world experience.
Two common mistakes of using NFS for dynamic volume provisioning on Kubernetes:
nfs-subdir-external-provisioner
instead of csi-driver-nfs
nfs-subdir-external-provisioner
instead of csi-driver-nfs
The Container Storage Interface (CSI) is a standard for exposing arbitrary block and file storage systems to containerized workloads on Container Orchestration Systems (COs) like Kubernetes. Using CSI third-party storage providers can write and deploy plugins exposing new storage systems in Kubernetes without ever having to touch the core Kubernetes code.
Source: Kubernetes CSI Developer Documentation
Basically, CSI is a unified abstraction layer for managing and interacting with storage on Kubernetes. A conforming implementation is known as a CSI driver and exposes the following common storage operations which covers both the running of stateful applications on Kubernetes and primitives for enabling application backup and recovery:
VolumeSnapshotContent
) to satisfy VolumeSnapshot
requestscsi-driver-nfs
is a CSI driver for NFS. On the other hand, nfs-subdir-external-provisioner
is just a StorageClass
implementation and lacks volume snapshot capabilities - not acceptable for production use.
In some cases, volume snapshots fail to be created even with the NFS CSI driver. This is due to fundamental limitations with NFS itself. Perhaps a hands-on demo is in order.
In the lab to follow, we’ll set up a 2-node kubeadm cluster (1 master, 1 worker) with the NFS CSI driver installed, deploy MinIO as a sample stateful application and attempt to create a VolumeSnapshot
from the minio
PVC. We’ll then investigate the root cause of the issue and conclude why NFS is not suitable for Kubernetes storage in a production context.
Familiarity with Kubernetes cluster administration is assumed. If not, consider enrolling in the comprehensive LFS258: Kubernetes Fundamentals online training course offered by The Linux Foundation which is also the official training course for the CKA certification exam offered by the CNCF.
It is assumed you already have a public cloud account such as an AWS account or a laptop / workstation capable of hosting at least 2 Linux nodes each with 2 vCPUs, 8G of RAM and 40G of storage space, one of which will become the master node and the other the worker node. You may follow the lab with a bare-metal setup as well if desired.
The reference distribution is Ubuntu 22.04 LTS for which the instructions in this lab have been tested against. If you’re looking for a challenge, feel free to follow the lab with a different distribution but beware that some of the instructions may require non-trivial modification.
For the purposes of this lab, we’ll refer to our master node as master0
and worker node as worker0
.
Let’s set up a kubeadm cluster following the typical process.
The versions of Kubernetes and associated components to be installed:
master0
Run the following commands on master0
to perform preliminary setup and avoid issues on installing and initializing Kubernetes. Make sure to replace x.x.x.x
below with the private IP address of master0
.
sudo hostnamectl set-hostname master0
echo "export PATH=\"/opt/cni/bin:/usr/local/sbin:/usr/local/bin:\$PATH\"" >> "$HOME/.bashrc" && \
source "$HOME/.bashrc"
sudo sed -i 's#Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin#Defaults secure_path = /opt/cni/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin#' /etc/sudoers
export K8S_CONTROL_PLANE="x.x.x.x"
echo "$K8S_CONTROL_PLANE k8s-control-plane" | sudo tee -a /etc/hosts
sudo modprobe br_netfilter
echo br_netfilter | sudo tee /etc/modules-load.d/kubernetes.conf
cat << EOF | sudo tee -a /etc/sysctl.conf
net.ipv4.ip_forward=1
EOF
sudo sysctl -p
sudo systemctl reboot
After the reboot, run the following commands to install the containerd CRI and associated components:
wget https://github.com/containerd/containerd/releases/download/v1.7.8/containerd-1.7.8-linux-amd64.tar.gz
sudo tar Cxzvf /usr/local containerd-1.7.8-linux-amd64.tar.gz
sudo mkdir -p /usr/local/lib/systemd/system/
sudo wget -qO /usr/local/lib/systemd/system/containerd.service https://raw.githubusercontent.com/containerd/containerd/main/containerd.service
sudo systemctl daemon-reload
sudo systemctl enable --now containerd.service
sudo mkdir -p /etc/containerd/
containerd config default | \
sed 's/SystemdCgroup = false/SystemdCgroup = true/' | \
sed 's/pause:3.8/pause:3.9/' | \
sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd.service
sudo mkdir -p /usr/local/sbin/
sudo wget -qO /usr/local/sbin/runc https://github.com/opencontainers/runc/releases/download/v1.1.10/runc.amd64
sudo chmod +x /usr/local/sbin/runc
sudo mkdir -p /opt/cni/bin/
wget https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-amd64-v1.3.0.tgz
sudo tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v1.3.0.tgz
Now run the commands below to install Kubernetes and initialize our master node:
sudo apt update && sudo apt install -y apt-transport-https ca-certificates curl
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.28/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update && sudo apt install -y \
kubeadm=1.28.3-1.1 \
kubelet=1.28.3-1.1 \
kubectl=1.28.3-1.1
sudo apt-mark hold kubelet kubeadm kubectl
sudo systemctl enable --now kubelet.service
cat > kubeadm-config.yaml << EOF
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kubernetesVersion: v1.28.3
controlPlaneEndpoint: "k8s-control-plane:6443"
networking:
podSubnet: "192.168.0.0/16"
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd
EOF
sudo kubeadm init --config kubeadm-config.yaml
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
echo "source <(kubectl completion bash)" >> "$HOME/.bashrc" && \
source "$HOME/.bashrc"
wget -qO - https://raw.githubusercontent.com/projectcalico/calico/v3.26.3/manifests/calico.yaml | \
kubectl apply -f -
Now run the following command to wait for master0
to become ready - this should take no longer than 5 minutes:
kubectl wait --for=condition=Ready node master0
Sample output:
node/master0 condition met
worker0
Now set up our worker node worker0
.
Again, the preliminary setup which also reboots our node - replace x.x.x.x
again with the private IP address of master0
:
sudo hostnamectl set-hostname worker0
echo "export PATH=\"/opt/cni/bin:/usr/local/sbin:/usr/local/bin:\$PATH\"" >> "$HOME/.bashrc" && \
source "$HOME/.bashrc"
sudo sed -i 's#Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin#Defaults secure_path = /opt/cni/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin#' /etc/sudoers
export K8S_CONTROL_PLANE="x.x.x.x"
echo "$K8S_CONTROL_PLANE k8s-control-plane" | sudo tee -a /etc/hosts
sudo modprobe br_netfilter
echo br_netfilter | sudo tee /etc/modules-load.d/kubernetes.conf
cat << EOF | sudo tee -a /etc/sysctl.conf
net.ipv4.ip_forward=1
EOF
sudo sysctl -p
sudo systemctl reboot
Next, install containerd and associated components:
wget https://github.com/containerd/containerd/releases/download/v1.7.8/containerd-1.7.8-linux-amd64.tar.gz
sudo tar Cxzvf /usr/local containerd-1.7.8-linux-amd64.tar.gz
sudo mkdir -p /usr/local/lib/systemd/system/
sudo wget -qO /usr/local/lib/systemd/system/containerd.service https://raw.githubusercontent.com/containerd/containerd/main/containerd.service
sudo systemctl daemon-reload
sudo systemctl enable --now containerd.service
sudo mkdir -p /etc/containerd/
containerd config default | \
sed 's/SystemdCgroup = false/SystemdCgroup = true/' | \
sed 's/pause:3.8/pause:3.9/' | \
sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd.service
sudo mkdir -p /usr/local/sbin/
sudo wget -qO /usr/local/sbin/runc https://github.com/opencontainers/runc/releases/download/v1.1.10/runc.amd64
sudo chmod +x /usr/local/sbin/runc
sudo mkdir -p /opt/cni/bin/
wget https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-amd64-v1.3.0.tgz
sudo tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v1.3.0.tgz
Now, install Kubernetes and initialize our worker node - replace the x’s with your Kubernetes token and CA certificate hash as shown in the output of kubeadm init
on our master node master0
:
sudo apt update && sudo apt install -y apt-transport-https ca-certificates curl
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.28/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update && sudo apt install -y \
kubeadm=1.28.3-1.1 \
kubelet=1.28.3-1.1
sudo apt-mark hold kubelet kubeadm
sudo systemctl enable --now kubelet.service
export K8S_TOKEN="xxxxxx.xxxxxxxxxxxxxxxx"
export K8S_CA_CERT_HASH="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
sudo kubeadm join k8s-control-plane:6443 \
--discovery-token "${K8S_TOKEN}" \
--discovery-token-ca-cert-hash "sha256:${K8S_CA_CERT_HASH}"
Now run a similar command on the master node to wait for our worker node to become ready - again, this should take no longer than 5 minutes:
kubectl wait --for=condition=Ready node worker0
Sample output:
node/worker0 condition met
We’ll set up our NFS share on our worker node so our applications run close to our data.
Run these commands on worker0
:
sudo apt update && sudo apt install -y nfs-kernel-server
sudo systemctl enable --now nfs-kernel-server.service
sudo chown -R nobody:nogroup /srv
cat << EOF | sudo tee /etc/exports
/srv *(rw,sync)
EOF
sudo exportfs -a
Now check that the NFS share is available:
showmount -e
Sample output:
Export list for worker0:
/srv *
With our NFS share available, let’s install the NFS CSI driver onto our cluster. We’ll also create the following objects:
StorageClass
utilizing the driver for dynamically provisioning NFS-backed storage, marked as default via the annotation storageclass.kubernetes.io/is-default-class=true
VolumeSnapshotClass
utilizing the driver for taking snapshots of our NFS volumesRun these commands on master0
. Replace x.x.x.x
below with the IP address of worker0
since that is where we installed our NFS share.
wget https://get.helm.sh/helm-v3.13.2-linux-amd64.tar.gz
tar xvf helm-v3.13.2-linux-amd64.tar.gz
chmod +x linux-amd64/helm
mkdir -p "$HOME/.local/bin/"
mv linux-amd64/helm "$HOME/.local/bin/"
echo "export PATH=\"\$HOME/.local/bin:\$PATH\"" >> "$HOME/.bashrc"
source "$HOME/.bashrc"
helm repo add csi-driver-nfs https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts
helm repo update
helm -n kube-system install \
csi-driver-nfs \
csi-driver-nfs/csi-driver-nfs \
--version v4.5.0 \
--set externalSnapshotter.enabled=true
export K8S_WORKER_NODE="x.x.x.x"
kubectl apply -f - << EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-csi
provisioner: nfs.csi.k8s.io
parameters:
server: ${K8S_WORKER_NODE}
share: /srv
reclaimPolicy: Delete
volumeBindingMode: Immediate
EOF
kubectl annotate storageclass nfs-csi storageclass.kubernetes.io/is-default-class=true
kubectl apply -f - << EOF
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-nfs-snapclass
driver: nfs.csi.k8s.io
deletionPolicy: Delete
EOF
MinIO is an Amazon S3-compatible object storage solution that can be deployed on Kubernetes.
We’ll not dive deep into MinIO, however - our focus is on the fact that it is stateful and therefore requests provisioned storage via a PVC with the name minio
.
Run the commands below on master0
:
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
helm -n minio install \
minio \
bitnami/minio \
--version 12.9.4 \
--create-namespace
Now wait for the pods in minio
namespace to become ready - this should take no longer than 5 minutes:
kubectl -n minio wait --for=condition=Ready pods --all
Sample output:
pod/minio-76c7dcbb5-84qmb condition met
Notice that a PVC minio
has been created and its status should be Bound
:
kubectl -n minio get pvc
Sample output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
minio Bound pvc-44ac176e-f1d5-4fe1-ac58-2153f0db6198 8Gi RWO nfs-csi 2m56s
minio
PVCWhile creating a snapshot on its own does not constitute a backup, it is nonetheless a fundamental operation implemented as part of a complete workflow in comprehensive Kubernetes-native backup and recovery solutions such as Velero and Kasten K10.
Let’s try to create a VolumeSnapshot for our minio
PVC - run these on master0
:
kubectl -n minio apply -f - << EOF
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: test-minio-snapshot
spec:
volumeSnapshotClassName: csi-nfs-snapclass
source:
persistentVolumeClaimName: minio
EOF
Now wait a few seconds and observe that the snapshot has failed:
kubectl -n minio get volumesnapshot test-minio-snapshot -o jsonpath='{.status.error.message}'
Sample output:
Failed to check and update snapshot content: failed to take snapshot of the volume 10.1.0.162#srv#pvc-44ac176e-f1d5-4fe1-ac58-2153f0db6198##: "rpc error: code = Internal desc = failed to create archive for snapshot: exit status 2: ./\n./.minio.sys/\n./.minio.sys/multipart/\n./.minio.sys/format.json\n./.minio.sys/pool.bin/\n./.minio.sys/pool.bin/xl.meta\n./.minio.sys/tmp/\n./.minio.sys/tmp/9bd8ec0e-cf8a-4c86-8dcf-0bbe1d35de80\n./.minio.sys/tmp/.trash/\n./.minio.sys/config/\n./.minio.sys/config/iam/\n./.minio.sys/config/iam/format.json/\n./.minio.sys/config/iam/format.json/xl.meta\n./.minio.sys/config/config.json/\n./.minio.sys/config/config.json/xl.meta\n./.minio.sys/buckets/\n./.minio.sys/buckets/.bloomcycle.bin/\n./.minio.sys/buckets/.bloomcycle.bin/xl.meta\n./.minio.sys/buckets/.usage.json/\n./.minio.sys/buckets/.usage.json/xl.meta\ntar: ./.root_password: Cannot open: Permission denied\ntar: ./.root_user: Cannot open: Permission denied\ntar: Exiting with failure status due to previous errors\n"
In particular, notice the keywords Permission denied
.
Viewing the file and directory ownership under /bitnami/minio/data/
in the MinIO pod reveals the root cause of the issue:
export MINIO_POD="$(kubectl -n minio get pod -l app.kubernetes.io/name=minio --no-headers | awk '{ print $1 }')"
kubectl -n minio exec "${MINIO_POD}" -- ls -al /bitnami/minio/data/
Sample output:
total 20
drwxrwsr-x 3 nobody nogroup 4096 Nov 14 14:02 .
drwxr-xr-x 3 root root 4096 Nov 11 18:58 ..
drwxr-sr-x 7 1001 nogroup 4096 Nov 14 14:02 .minio.sys
-rw------- 1 1001 nogroup 11 Nov 14 14:02 .root_password
-rw------- 1 1001 nogroup 6 Nov 14 14:02 .root_user
By default, NFS shares have root squash enabled which maps the privileged root
user on NFS clients to the unprivileged nobody
user on the NFS server, preventing remote clients from unexpectedly gaining root privileges on the NFS host which is a security concern. All other users are mapped to their own UID.
The MinIO pod in our Helm chart runs as a non-root user with UID 1001
as a security measure, so the files and directories created by this pod also have an owner UID of 1001
. However, since neither .root_user
nor .root_password
are group or world readable, and the CSI snapshot operation implemented via tar
for NFS presumably runs as root
which maps to nobody
on the NFS share, the snapshot operation is unable to read these two files when running the tar
command and fails.
There are at least two ways to work around this issue, but both involve reducing the overall security of our infrastructure and workloads and therefore unacceptable in a production environment:
no_root_squash
for NFS which opens up the NFS share host to privilege escalation attacksroot
via specifying the appropriate Helm chart values which opens up the cluster to potential container escape and privilege escalation attacksDon’t use NFS as a storage backend for your production-grade on-premises Kubernetes cluster!