Hung's Notebook

An Introduction to Docker

What's this?

This is an introduction to Docker, inspired from the first lesson of Data Engineering Zoomcamp 2024.

Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers.

This year, Docker is the top-used other tool amongst all respondents (53%) rising from its second place spot last year.

At the most basic level, the platform Docker consists of 2 parts:

  1. Dockerfile to define a programming pipeline, from creating an environment to executing a program inside that environment.
  2. Docker Engine to run the pipeline, guaranteeing that the output will be the same on any machine given the same inputs and program files. This study note is about how to write Dockerfile and run Docker Engine. It assumes that Docker Engine is already installed, a non-trivial task despite good documentation from Docker. The convenient GitHub Codespace, with conda and Docker pre-installed can be used instead.

A Brief History of Docker

(Feel free to skip)

Docker was born out of the need for reproducibility and enability for a consumer software. Software Products1 depend on Platform layers, which vary between user machines. As such, a program you wrote in COBOL with your TempleOS laptop might not work in my Ubuntu laptop the way it should.

To address this problem economically, developers use virtual machine i.e., a computer system that runs on another computer system. The first solution was adapted from hardware virtualization2, where a portion of the computer hardware is isolated and become a different machine algother. This works, but it's slow.

To increase speed, OS virtualization is used instead. While virtual machine uses a separate kernel, container uses the same kernel from the host machine. This means less isolation i.e., a container can now interact with the host machine settings and files, which would raise eyebrows from cybersecurity folks. However, in a classic tale of "it's a feature not a bug", Docker turns this into advantage with features such as volume, where data can be persisted on host machine and shared across containers3.

How to use Docker

Docker is useful for:

  1. Run an application that would be cumbersome to install and uninstall. Examples for data engineers include SQL database (the lesson uses the PostgreSQL image instead of installing it) and Spark (need to install Java and Python and the code itself 😭).
  2. Package your pipeline for easier running and scheduling later.

For 2, it's essential your codes work first before jumping into using Docker. It could be seen from the lesson: Alexey made sure his ingestion Python script works inside the development environment first before writing the Dockerfile for it.

A Dockerfile is a file containing line-by-line instruction for bundling everything needed to run an app - code, runtime, system tools, system libraries and settings - into a lightweight and standalone package called Docker image. A Docker image is a read-only template. A runnable instance of an image is called a Docker container.

Firstly, let's go through the steps and syntax of writing a Dockerfile.

For every line, the syntax is INSTRUCTION args. The common instructions we will see are:

ARG
FROM
RUN
VOLUME
WORKDIR
COPY
ADD
CMD
ENTRYPOINT
  1. ARG: Define variables that users can pass when running docker build. Similar to using argparse for a Python CLI program.
  2. FROM: Reuse a Docker image as the base.
  3. RUN: Run a command in the build shell and create a new layer for the image i.e., results will be committed to the image, unlike CMD.
  4. VOLUME: Specify a folder the image would mount to i.e., have permission to read and write and will see changes to the files. Files will persist in this folder.
  5. WORKDIR: Specify the working folder for subsequent RUN, CMD, ENTRYPOINT, COPY and ADD commands.
  6. COPY: Copy local files and add them to the filesystem of the container (possibly also rename the file).
  7. ADD: Copy files (can be remote, such as a Git repo) and add them to the filesystem of the container (possibly also rename the file).
  8. CMD: Mark the end of build instructions. Sets the command to be executed when running a container from an image.
  9. ENTRYPOINT: Mark the end of build instructions. Configure a container that will run as an executable.

For options and finer differences between similar instructions (CMD vs. ENTRYPOINT), see Documentation.

Secondly, let's go through the CLI commands docker build anddocker run, which are used to build the image and run the containers. They are the main commands used in the lesson.

I have also prepared a full cheatsheet for Docker CLI.

docker build is the CLI command to trigger the build of a Docker image from the Dockerfile. The build can use a Dockerfile, a Git repository, or a tarball, locally or remotely. The most common arguments passed to it are -t | --tag to name the image and -f | --file to specify the file (for cases of unconventional naming, or many Dockerfiles in the same folder, since Dockerfile is used by default).

docker run is the CLI command to create a container from a Docker image

Here's the example of using them in the lesson. First, Alexey tested the Python script ingest_data.py to make sure that it works. Then he wrote the below Dockerfile for it.

# 1. Pull base image of Python 3.9.1 on Debian OS
FROM python:3.9.1

# 2. Run commands to install required packages
RUN apt-get install wget
RUN pip install pandas sqlalchemy psycopg2 --no-cache

# 3. Specify the working directory and copy the file there
WORKDIR /app
COPY ingest_data.py ingest_data.py 

# 4. Run the Python file to ingest the data.
ENTRYPOINT [ "python", "ingest_data.py" ]

To create the image, run

docker build -t taxi_ingest:v001 .

In the working folder, it's the only Dockerfile so no --file needs specifying. After the image is built, it is identified as the image with tag/name taxi_ingest:v001. To run it with docker run

# URL is an environment variable
URL="https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz"

docker run -it \
  --network=pg-network \
  taxi_ingest:v001 \
    --user=root \
    --password=root \
    --host=pg-database \
    --port=5432 \
    --db=ny_taxi \
    --table_name=yellow_taxi_trips \
    --url=${URL}

Here we see two new arguments: --network and -it. By default, each container is isolated from other containers on the same host machine. To connect them, we need to tell Docker to create a network, where all containers put into that network are connected together like computers on public Internet. The other arguments after the tag name is for arguments used in the Python script itself (see here).

ENTRYPOINT instruction allows arguments to be passed to the command itself. So the actual command run when the container starts up is

python ingest_data.py \
--user=root \
--password=root \
--host=localhost \
--port=5432 \
--db=ny_taxi \
--table_name=yellow_taxi_trips \
--url=${URL}

which is the same as running the script inside a Python environment.

-it is actually two arguments: -i to run the Docker container in interactive mode i.e., you have access to the container stdin, and -t to allocates a pseudo-TTY/terminal-like environment for you to run command. One command example is accessing the container shell with docker exec -it <container_name> bash.

Other examples of docker run for the PostgreSQL and pgAdmin images:

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v c:/Users/alexe/git/data-engineering-zoomcamp/week_1_basics_n_setup/2_docker_sql/ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  --network=pg-network \
  --name pg-database \
  postgres:13

Here argument -e is used to set environment variable, and -v to set the path of the volume. They can be specified in the Dockerfile itself via ENV and VOLUME instructions, but we are using the prebuilt image from Docker Hub, so we have to specify in CLI instead.

Simplify with Docker Compose

In the lesson, we need to orchestrate up to 3 containers, with 2 in the same network. It's complicated.

Docker Compose was developed to simplify orchestrating multiple containers. To use the plugin (after installed), we define a docker-compose.yaml file like example below

services:
  pgdatabase:
    image: postgres:13
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=root
      - POSTGRES_DB=ny_taxi
    volumes:
      - "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw"
    ports:
      - "5432:5432"
  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@admin.com
      - PGADMIN_DEFAULT_PASSWORD=root
    ports:
      - "8080:80"

Each container will be defined as a named service under services. The image and arguments are specified with a familiar syntax.

To run the both containers from the shell

docker compose up
# detached mode
docker compose up -d

To stop everything

docker compose stop

To stop and remove containers entirely

docker compose down

Conclusion

You can refer to the official documentation for finer arguments, and to my cheatsheet for common Docker commands.

That's all for this lesson! I hope it was helpful 🤗.


  1. Refer here. Reproduce with some simplications from swyx's The Coding Career Handbook.

  2. Hardware virtualization was developed to enable concurrent (multiple users at once) usage of a machine. There is one host machine only, but for each user logging in, they are using a separate one. To enable this, the virtual machine monitor, or hypervisor, allocates a separate partition of hardware such as hard disk and CPU cores and RAMs to each user. The virtual machine is thus isolated at the hardware level.

  3. Docker Engine is no silver bullet since it is (still) confined to run on Linux. Thus, Windows users have to install Window Subsystem for Linux to run Docker, which in turn usually helps them virtualize a Linux environment for the app...

#post #study