How Docker Handles Mount Namespace

In my previous attempt to mount a Glusterfs on Docker host, I was hacking into mount(2) to reset the namespace that mountpoint resided on. But as Eric suggested, looking at how Docker handles mount namespace could be a more elegant solution this is issue.

I thus spent some efforts looking at how a container’s mount namespace is transformed. I got my crude note of how Docker creates the container’s mount namespace in the following (as usually, imperfect) diagram.

Life inside Docker

docker-ns

Long story short, as the diagram illustrated, when a Docker client issues a run command, it posts a command to Docker daemon. Docker Daemon thus invokes execdriver to initialize the container. This initialization is executed by libcontainer. On Linux, libcontainer calls out LinuxStandardInit, which in turn runs setupRootfs to create the initial namespace for the container.

Currently (as of Docker v1.6.0), containers’ initial mount namespace is a either a slave or private replica of the host, and thus new mounts executed on the host or inside container are not propagated to each other’s side. When a container mounts a e.g. Glusterfs, the newly mounted filesystem is only visible to the container, host or other containers don’t see it at all. Vice versa, when a host mounts a new filesystem, this new mount is not propagated to containers that are created before.

Why this is a problem?

Because Docker hosts like RedHat Atomic host or CoreOS are loaded with minimal OS and packages, not all filesystem client packages are installed. It is highly likely if administrators want to mount a Glusterfs filesystem on the Docker host, they find no Glusterfs client package on the host and thus mount(8) is doomed.

This is where a container-based solution come to the picture. A container that has filesystem client package (let name it the utility container) is first created, administrators then run the commands that live inside the utility container to mount the filesystem, and expect the filesystem mountpoint is visible to both the Docker host and other containers (let name them application containers) on the host.

Hands on Docker Hack

Now, a bit hack will change the container’s mount namespace from slave/private to shared, and making container-initiated filesystem mount event propagate to the host.

  1. diff –git a/contrib/init/systemd/docker.service b/contrib/init/systemd/docker.service
  2. index bc73448..ee1b43b 100644
  3. — a/contrib/init/systemd/docker.service
  4. +++ b/contrib/init/systemd/docker.service
  5. @@ -6,7 +6,7 @@ Requires=docker.socket
  6.  [Service]
  7.  ExecStart=/usr/bin/docker -d -H fd://
  8. -MountFlags=slave
  9. +MountFlags=shared
  10.  LimitNOFILE=1048576
  11.  LimitNPROC=1048576
  12.  LimitCORE=infinity
  13. diff –git a/vendor/src/github.com/docker/libcontainer/standard_init_linux.go b/vendor/src/github.com/docker/libcontainer/standard_init_linux.go
  14. index 282832b..652f278 100644
  15. — a/vendor/src/github.com/docker/libcontainer/standard_init_linux.go
  16. +++ b/vendor/src/github.com/docker/libcontainer/standard_init_linux.go
  17. @@ -48,7 +48,7 @@ func (l *linuxStandardInit) Init() error {
  18.         }
  19.         label.Init()
  20.         // InitializeMountNamespace() can be executed only for a new mount namespace
  21. –       if l.config.Config.Namespaces.Contains(configs.NEWNS) {
  22. +       if !l.config.Config.Namespaces.Contains(configs.NEWNS) {
  23.                 if err := setupRootfs(l.config.Config, console); err != nil {
  24.                         return err
  25.                 }

This patch is against Docker v1.6.0. The basic idea is to tell libcontainer and systemd not to make initial rootfs mount namespace a slave/private replica.

A quick POC to run this patch on a Docker host is illustrated in the following.


# run container
[root@docker-host ~]# docker run --privileged --net=host -i -v /home:/home -t gluster bash

# bad thing: no private mount namespace, container sees the host namespace
[root@docker-host rootfs]# pwd
/var/lib/docker/devicemapper/mnt/1f2b1bf5c0d3f75bf862b8172e8c725b7b1fcd5e132c1e72b825631ff2e5116e/rootfs

# good thing: able to mount a gluster fs in host's namespace
[root@docker-host rootfs]# mount -t glusterfs localhost:test_vol /home/con
[root@docker-host rootfs]# mount |grep gluster
localhost:test_vol on /home/con type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@docker-host rootfs]# exit
exit


# exit container and check if host can see the mountpoint
[root@docker-host ~]# mount |grep gluster
localhost:test_vol on /home/con type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s