Latest Status on Docker Mount Propagation Development

Some recent heated discussion on this topic compels me to update the latest development.

Docker has donated libcontainer to Open Container Project, so my Docker pull request moved to runc as well.

Here is the build and test walk through.


# get my code
git clone https://github.com/rootfs/docker
cd docker
git checkout rootfs_mount
make
# replace docker with my docker binary
ln -fs `pwd`/bundles/1.8.0-dev/binary/docker-1.8.0-dev /usr/bin/docker
# clean up docker images
mv /var/lib/docker /var/lib/docker.old
# !!!! edit /usr/lib/systemd/system/docker.service and change MountFlags=slave to MountFlags=shared !!!

# reload/restart docker service
for SERVICES in docker; do systemctl stop $SERVICES; systemctl reload $SERVICES;  systemctl daemon-reload; systemctl restart $SERVICES;     systemctl enable $SERVICES;     systemctl status $SERVICES ; done

# Build a docker image that contains nfs-util rpm, below is my dockerfile
# cat Dockerfile
#From centos
#RUN yum install -y nfs-utils

# build the image
docker build -t  centos-nfs .

# run it in privileged mode, not share propagation mode using --rootmount=slave, and mount an nfs filesystem

docker run -ti --net=host --rootmount=slave --privileged -v /test:/test centos-nfs mount -t nfs nfs-server:/home/git/dev /test/mnt -o nolock,vers=3
mount |grep /test/mnt

[Empty output]

# run it in privileged mode, share propagation mode using --rootmount=shared, and mount an nfs filesystem
docker run -ti --net=host --rootmount=shared --privileged -v /test:/test centos-nfs mount -t nfs nfs-server:/home/git/dev /test/mnt -o nolock,vers=3
mount |grep /test/mnt

nfs-server:/home/git/dev on /test/mnt type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.16.154.75,mountvers=3,mountport=20048,mountproto=udp,local_lock=all,addr=10.16.154.75)
Advertisements

Ceph RBD Import/Export and More

Ceph rbd can import a block image and/or export a rbd image to a block image.

Create an ext4 block image


# dd if=/dev/zero of=/tmp/disk.img bs=1M count=1

# mkfs.ext4 /tmp/disk.img

Import the block image to rbd pool


# rbd import /tmp/disk.img disk --pool kube

# rbd ls --pool kube

disk

foo

Map rbd image to /dev/rbdX


# rbd map disk --pool kube

# rbd showmapped
id pool image snap device
0 kube foo - /dev/rbd0
1 kube disk - /dev/rbd1

Mount /dev/rbdX and create a test file


#mount /dev/rbd1 /tmp/mount

# echo "test" > /tmp/mount/first
# ls /tmp/mount/
first lost+found

Export the rbd image into a block image


# rbd export disk /tmp/export.img --pool kube
Exporting image: 100% complete...done.
# file /tmp/export.img
/tmp/export.img: Linux rev 1.0 ext2 filesystem data (mounted or unclean), UUID=3b1f22b1-48d3-4bdf-819d-d62fd7063321 (extents) (huge files)
#mount -o loop /tmp/export.img /tmp/mount
# ls /tmp/mount/
first lost+found

There are more powerful use cases with import and export. One can simulate ZFS send/recv, a feature used by flocker, by just importing and exporting image diff. And recent Ceph Giant release also support parallelized import and export.

OpenStack Dashboard in Javascript

OpenStack Horizon provides Web based dashboard. It is implemented in Python, as the rest of the OpenStack projects.

horizon-js offers a Javascript alternative, but it doesn’t appear to receive active development and maintenance. Thus I forked it into my own repo and plan to actively fix and improve this project.

Some of the early deliveries over the past weekend include

  • Able to run on your OpenStack, by making Keystone URL an input.
  • Fixed many jstack bugs
  • Fixed Model-View inconsistency.
  • Doc, doc, doc

And you can find some screenshot.

Why Javascript? The answer isn’t hard to find ….

Tenant Name or Tenant Id in OpenStack Keystone

OpenStack Keystone is the first stop to get into access of other services (Nova, Cinder, Glance, Neutron, etc). So it is critical to understand Keystone API well.

Applications, such as Vagrant OpenStack Providers, need to access service endpoints from Keystone service catalog. So they can access these services and create e.g. compute instances. Yet, there appears no consistency on which of the tenant forms to use and thus causes confusing for application developers.

Service catalog is on per tenant basis. A REST request to Keystone must contain necessary tenant information to get the service catalogs. Tenant information can either be name (string) or Id (UUID), as specified in the API doc. It is convenient to use names.

In this example, a Keystone authentication request doesn’t have any tenant information:


curl -v -D -i -H "Content-Type: application/json" -d '{"auth":{"passwordCredentials":{"username":"user","password":"password"}}}' http://keystone:5000/v2.0/tokens

And as expected, no service catalog is returned:


{"access": {"token": {"issued_at": "2015-07-10T18:26:07.389768", "expires": "2015-07-10T19:26:07Z", "id": "....."}, "serviceCatalog": [], "user": {"username": "user", "roles_links": [], "id": "...", "roles": [], "name": "user"}, "metadata": {"is_admin": 0, "roles": []}}}

Then providing a tenantName in the request:


curl  -i -H "Content-Type: application/json" -d '{"auth":{"passwordCredentials":{"username":"user","password":"password"}, "tenantName":"Some tenant name"}}' http://keystone:5000/v2.0/tokens

.

You can then find service catalog and endpoints information.

RBD Volume Mount Fencing in Kubernetes

Kubernetes can use RBD images as persistent block storage for Linux containers. However, there is only one container can mount a RBD volume in read-write mode. If multiple containers write to the same RBD volume without high level coordination, data corruption will likely occur, as reported in a recent case.

One intuitive solution is to make persistent block storage provider be restrictive on client mount. For instance, Google Compute Engine’s Persistent Disk allows only one read-write mount.

Another approach is fencing. A RBD image writer needs to hold an exclusive lock on an image during mount. If the writer fails to acquire the lock, it is safe to assume the image is being used by others. The writer shouldn’t attempt to mount the RBD volume in this case. As a result, only one writer can use the image and no more data corruption.

This is the how the RBD volume mount fencing pull request does for Kubernetes. I tried the following test and found this fixes the mount racing problem.

I have two Fedora 21 hosts. Each loads my fix and runs as a local cluster :


# ./hack/local-up-cluster.sh

# start the rest of the kubectl routines

Then each local cluster creates a Pod using RBD volume:


#./cluster/kubectl.sh create -f examples/rbd/rbd.json

Watch RBD image lock:


# rbd lock list foo --pool kube
There is 1 exclusive lock on this image.
Locker ID Address 
client.4494 kubelet_lock_magic_host 10.16.154.78:0/1026846 


On both clusters, get the Pod status. I see one cluster has a running Pod and another cluster sees Pod pending.

Running Pod:


# ./cluster/kubectl.sh get pod
NAME READY REASON RESTARTS AGE
rbd       1/1         Running   0                 5m


The other Pod:


# ./cluster/kubectl.sh get pod
NAME READY REASON   RESTARTS AGE
rbd3     0/1        Image: kubernetes/pause is ready, container is creating 0 4m

Then I delete the running Pod, the second one immediately becomes running.

So with this fix, Pods do get fenced off.