Kubernetes Storage Options

Below are what’s already in, what’s queued, and what’s missing.

TYPE FORMAT DURATION PROVIDER NOTE
EmptyDir File Ephermal Local host  Available
HostDir File Persistent Local host  Available
GitRepo File Persistent Git repository  Available
GCE PD Block Persistent GCE  Available
AWS EBS Block Persistent AWS  Available
NFS File Persistent NFS Server  Available
iSCSI Block Persistent iSCSI target provider  Available
Glusterfs File Persistent Glusterfs Servers  Available
Ceph RBD Block Persistent Ceph Cluster  Available
Ceph FS File Persistent Ceph Cluster Pull Request
OpenStack Cinder Block Persistent Cinder Pull Request
CIFS File Persistent CIFS Server MISSING
LVM Block Persistent Local MISSING

Thought on Intel’s Clear Linux Container

LWN’s recent post gained enormous interests. I like many of the technologies in this project but still scratch my head on some (missing) details, even after peeking into Intel’s rkt patches in the SRPM.

I understand Intel’s position of bringing fast (reducing KVM overhead) and secure (using isolation) container technologies into rkt and Docker, but I don’t see any words on flexibility. With Docker/rtk, I can run a service/process just like I run a Unix shell command. But with KVM, I have to start a VM, ssh to the VM, and execute the command. There are more moving parts involved.

Intel used two performance metrics: startup time and memory usage. But from my prior (although likely obsolete) experience, the runtime overhead is not negligible. For instance, a process running in KVM will see its virtual memory remapped, and thus causes penalize the runtime performance. This overhead might be less significant with VT-x.  A more comprehensive (though not up-to-date) KVM-vs-Docker performance research conducted by IBM still confirmed my bias.

Now Available: Ceph RBD as Persistent Storage for Kubernetes

You can now use Ceph RBD as Persistent Storage for containers in Kubernetes.  Examples and instructions can be found here.

Now you have the following storage options for Kubernets.

Type Format Duration Provider
EmptyDir File Ephermal Local host
HostDir File Persistent Local host
GitRepo File Persistent Git repository
GCE PD Block Persistent GCE
AWS EBS Block Persistent AWS
NFS File Persistent NFS Server
iSCSI Block Persistent iSCSI target provider
Glusterfs File Persistent Glusterfs Servers
Ceph RBD Block Persistent Ceph Cluster

How Docker Handles Mount Namespace

In my previous attempt to mount a Glusterfs on Docker host, I was hacking into mount(2) to reset the namespace that mountpoint resided on. But as Eric suggested, looking at how Docker handles mount namespace could be a more elegant solution this is issue.

I thus spent some efforts looking at how a container’s mount namespace is transformed. I got my crude note of how Docker creates the container’s mount namespace in the following (as usually, imperfect) diagram.

Life inside Docker

docker-ns

Long story short, as the diagram illustrated, when a Docker client issues a run command, it posts a command to Docker daemon. Docker Daemon thus invokes execdriver to initialize the container. This initialization is executed by libcontainer. On Linux, libcontainer calls out LinuxStandardInit, which in turn runs setupRootfs to create the initial namespace for the container.

Currently (as of Docker v1.6.0), containers’ initial mount namespace is a either a slave or private replica of the host, and thus new mounts executed on the host or inside container are not propagated to each other’s side. When a container mounts a e.g. Glusterfs, the newly mounted filesystem is only visible to the container, host or other containers don’t see it at all. Vice versa, when a host mounts a new filesystem, this new mount is not propagated to containers that are created before.

Why this is a problem?

Because Docker hosts like RedHat Atomic host or CoreOS are loaded with minimal OS and packages, not all filesystem client packages are installed. It is highly likely if administrators want to mount a Glusterfs filesystem on the Docker host, they find no Glusterfs client package on the host and thus mount(8) is doomed.

This is where a container-based solution come to the picture. A container that has filesystem client package (let name it the utility container) is first created, administrators then run the commands that live inside the utility container to mount the filesystem, and expect the filesystem mountpoint is visible to both the Docker host and other containers (let name them application containers) on the host.

Hands on Docker Hack

Now, a bit hack will change the container’s mount namespace from slave/private to shared, and making container-initiated filesystem mount event propagate to the host.

  1. diff –git a/contrib/init/systemd/docker.service b/contrib/init/systemd/docker.service
  2. index bc73448..ee1b43b 100644
  3. — a/contrib/init/systemd/docker.service
  4. +++ b/contrib/init/systemd/docker.service
  5. @@ -6,7 +6,7 @@ Requires=docker.socket
  6.  [Service]
  7.  ExecStart=/usr/bin/docker -d -H fd://
  8. -MountFlags=slave
  9. +MountFlags=shared
  10.  LimitNOFILE=1048576
  11.  LimitNPROC=1048576
  12.  LimitCORE=infinity
  13. diff –git a/vendor/src/github.com/docker/libcontainer/standard_init_linux.go b/vendor/src/github.com/docker/libcontainer/standard_init_linux.go
  14. index 282832b..652f278 100644
  15. — a/vendor/src/github.com/docker/libcontainer/standard_init_linux.go
  16. +++ b/vendor/src/github.com/docker/libcontainer/standard_init_linux.go
  17. @@ -48,7 +48,7 @@ func (l *linuxStandardInit) Init() error {
  18.         }
  19.         label.Init()
  20.         // InitializeMountNamespace() can be executed only for a new mount namespace
  21. –       if l.config.Config.Namespaces.Contains(configs.NEWNS) {
  22. +       if !l.config.Config.Namespaces.Contains(configs.NEWNS) {
  23.                 if err := setupRootfs(l.config.Config, console); err != nil {
  24.                         return err
  25.                 }

This patch is against Docker v1.6.0. The basic idea is to tell libcontainer and systemd not to make initial rootfs mount namespace a slave/private replica.

A quick POC to run this patch on a Docker host is illustrated in the following.


# run container
[root@docker-host ~]# docker run --privileged --net=host -i -v /home:/home -t gluster bash

# bad thing: no private mount namespace, container sees the host namespace
[root@docker-host rootfs]# pwd
/var/lib/docker/devicemapper/mnt/1f2b1bf5c0d3f75bf862b8172e8c725b7b1fcd5e132c1e72b825631ff2e5116e/rootfs

# good thing: able to mount a gluster fs in host's namespace
[root@docker-host rootfs]# mount -t glusterfs localhost:test_vol /home/con
[root@docker-host rootfs]# mount |grep gluster
localhost:test_vol on /home/con type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@docker-host rootfs]# exit
exit


# exit container and check if host can see the mountpoint
[root@docker-host ~]# mount |grep gluster
localhost:test_vol on /home/con type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)