• Jun 27, 2019
In one sense it hardly seems any time at all since Docker 1.0 was released in the summer of 2014, but in another sense it feels like a lifetime. In the 4-5 years that have ensued, the cloud native paradigm has exploded into a full-scale industry, with businesses large and small, betting their existence on the purported benefits that cloud native brings.
The seed that bore this fruit was Docker’s popularisation of the container abstraction, which gave developers a mechanism for defining their applications as self-sufficient, immutable packages - the container image. Yet, despite the furious pace of development in the cloud native technology landscape, the way that we define container images, and the way that we build them into usable artifacts, has changed surprisingly little since that summer of 2014. As container technology has evolved and matured, deficiencies and limitations in the process of container image building have slowly been exposed, which have led to some frustration in the cloud native community.
During the last year or two, different projects have popped up to attempt to deal with some or all of these deficiencies. In this series of articles called ‘The State of the Art in Container Image Building’, we’ll focus on some of these projects and tools, to see how they circumvent the perceived problems.
Before we do that, however, we first have to understand the basis of container image building, and the deficiencies we’re trying to overcome.
Historically, container images are defined in Dockerfiles, using a declarative instruction set for generating filesystem content and metadata for informing a container runtime engine on how to run a derived container. Container images are then built using an API endpoint of the Docker Engine, which sequentially executes each Dockerfile instruction to create content, or to record metadata related to the image being built. The build endpoint is invariably exercised using the Docker client’s docker build command.
Authoring container images, then, requires us to define the image by using an appropriate set of Dockerfile instructions with their associated arguments, and then to build the image using the Docker Engine API. Simple, what could possibly be wrong with that?
One issue frequently expressed, is that building container images using the Docker Engine API requires the use of the Docker daemon. The Docker daemon encompasses a lot of functionality, most of which doesn’t relate to the task of building container images at all. Additionally, for a lot of the other functions it performs it’s required to run with root privileges, and as we’ll see later, this presents a security concern. If the task at hand is to craft container images, then deploying the Docker daemon just for this purpose is unwieldy and inefficient, and not optimal for CI/CD pipelines. The argument goes, that the container image building experience should be daemonless.
The Docker Engine has had a caching function from very early on. It means that if the execution of Dockerfile instructions from one build to the next results in the same commands being run, or identical content being added to the image, then cached content will be used instead of it getting recreated. It greatly speeds up image builds, and makes the process a whole lot more efficient.
Unfortunately, due to the sequential nature of the build API, as soon as the cache is invalidated due to a change in the content or Dockerfile instruction, every subsequent instruction in the Dockerfile is executed again. This happens irrespective of whether those subsequent instructions have changed or not. This means that great care needs to be taken with regard to the sequence of instructions, so as to maximize build cache effectiveness.
As we’ve said, inefficient build cache use comes as a side-effect of the sequential execution of Dockerfile instructions in the build process. Provided instruction ordering is carefully considered, then the sequential execution of instructions is not problematical. However, when multi-stage builds were introduced in the API, the sequential nature of build execution inhibited a major potential benefit; parallel execution of instructions that are non-dependent. This, of course, has a bearing on the length of time a build takes, which in some circumstances could be significantly reduced. When iterating over image builds during application development, this could extend the process considerably.
More often than not, the process of building a container image relies on, or could benefit from, content being made available on a temporary basis. In other words, we want the content during the build, but we don’t want it in the end product, as it will increase the size of the image unnecessarily. For example, we need source code to build a binary, but don’t need the source code in the final image. We could make use of a compiler build cache for faster builds, but we wouldn’t want the cache in the image itself.
Multi-stage builds can help us out with this problem, but a more elegant solution would be to temporarily mount the content at the point that it’s required in the build process. Despite numerous requests to add a variation of this feature, it has never made it into the Dockerfile instruction syntax, or as a command line argument to the docker build command.
Sometimes we need to use secrets in order to access content that requires a client to authenticate and be authorized. For example, we might need to provide an SSH key to clone a repository from GitHub or similar, so that we can make use of the content in the build. It would be possible to copy secrets from the host into the image during a build, but that poses the risk of the secret remaining embedded in the image, for subsequent scrutiny by anyone with access to the image. We might even be tempted to use environment variables, but this method is also vulnerable to exploitation.
There have been proposals to add a build feature to handle secrets required for image building, but none of the proposals made it through to a merge into the API. A promise of a best-practice guide has been the best offering for image builders to work with. This presents a big challenge to container image authors, and requires using some novel techniques as a workaround.
The Docker daemon requires clients to either have root access, or membership of the
docker group, in order to interact with the API and its build endpoint. This is often seen as undesirable, or is even prohibited in some organizations that have strict policies on having access to privileged accounts. In practical terms, this makes the task of container image building very difficult.
This extends further when we want to build images within Kubernetes clusters that use Docker as the container runtime - maybe as part of a CI/CD workflow. To make use of the Docker Engine API running on a Kubernetes node, we’d have to mount the daemon’s socket into a ‘client’ container to make it accessible for builds. This is dangerous from a security perspective, because it effectively gives the container root access on the node, and completely bypasses the Kubernetes abstraction. Another solution might be to avoid using the Docker daemon on the node, and use Docker-in-Docker (DinD) to provide a self-contained image building environment. But, this too has security implications, as the container running DinD is required to run as a privileged container.
Getting to ‘rootless’ builds has been a quest in the community for some time now.
Despite these limitations, using Docker’s build API endpoint and Dockerfile syntax remains the predominant method for building container images. The many thousands of images that are hosted in public image registries such as the Docker Hub and Quay, along with their associated Dockerfiles, are a testament to this. But the limitations have also prompted new techniques and tools to emerge, which aim to address some of these problems, and to enhance the container image building experience. These tools are developed in the open with community support, and often overlap in terms of functionality. In order to get a better understanding of the state of the art in container image building, the next articles in this series will explore what’s on offer.
Ready to get your cloud native project into production? Simply request your Giant Swarm invite here.
Giant Swarm’s managed microservices infrastructure enables enterprises to run agile, resilient, distributed systems at scale, while removing the tasks related to managing the complex underlying infrastructure.
GET IN TOUCH
CERTIFIED SERVICE PROVIDER