DEZ Week 1 - Docker & Terraform & SQL

30 Jan, 2025

Table of Content

Virtualization Software - Docker
Infrastructure as Code (IaC) - Terraform
Structured Query Language - SQL

This year I started the Data Engineering Zoomcamp from datatalks.club. I took one last year during my internship, but I joined late and just managed to finish the final project. Let's do it systematically this year 🤗. I will finish the homework and publish the lesson note every week.

For the first week, we learnt about Docker and Terraform, 2 DevOps tools to set up the environment for data engineering projects. We also had a homework refresh on SQL, using the database we set up with Docker.

Virtualization Software - Docker

Virtualization software is the solution to the consistency problem for software development. Software must run on top of the hardware, the OS, the libraries, the runtime, and any additional layers. The difference in some layers between machine can render a software unusable in other machines, even if "It works on MY machine!". Therefore, developers want to make sure that their software is developed and deployed on the same environment, the way experiments must be carried out in the same setting.

Virtualization software helps to create a software-based machine that includes everything (OS, libraries, etc.) of an actual computer with borrowed hardware from the physical host computer (image from Azure). In other words, it's a "computer within a computer".

Container software such as Docker is quite similar to VM, the difference being level of isolation. VM is virtualized with full OS isolation, so each VM needs to boot its own OS and kernel¹. This introduces significant overhead and start-up time. Containter, however, shares the host kernel and relies on native (Linux) libraries namespaces to isolate itself from other processes and cgroups to allocate the hardware resources. It was a trade of isolation for lighter weight and better performance, and a successful one. Docker particularly shines now as the standard for cloud deployment and microservice architecture².

I have created a Docker cheatsheet here: https://github.com/HangenYuu/docker-cheatsheet. This chapter offers more detailed on the thought process to run a Docker image.

Infrastructure as Code (IaC) - Terraform

In the early days, systems administrator ("sysadmin") is responsible to provision and configure the hardware for processes, and they do so manually via the browser's configuration page for each service (a simple app may require an API Gateway, a few Lambda functions, an S3 bucket, and a DynamoDB table, which are already a lot, but a team may have may have many apps, or even full microservice architecture with hundreds of services). It's time-consuming and error-prone, and poorly documented, which makes replication and debugging much harder. There was a need for a more well-documented and structured approach.

IaC tool such as Terraform or Pulumi is the answer to these challenges. Instead of a folder of 31-step .txt files about what to look out for on the web consolve, the resources and configurations will be defined in a programming language (HCL for Terraform, Python/Go/etc. for Pulumi). Afterwards, Terraform will communicate with the cloud providers on your behalf.

Instead of fiddling with the web console, now users can create resources with 5 steps:

Write out the service configurations, using the documentation for each cloud provider service, to main.tf file. Here's the course example to create a GCS bucket and a BigQuery dataset. Below is the equivalent to create an AWS S3 bucket and Athena database (verbose, since AWS is more modular and fussy about security).

Terraform AWS

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Configure the AWS Provider
provider "aws" {
# Credentials should be set via AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables
#  access_key = "my-access-key"
#  secret_key = "my-secret-key"
  region = "us-east-1"
}

resource "aws_s3_bucket" "data_lake_bucket" {
  bucket = "<Your-Unique-Bucket-Name>"
  force_destroy = true
}

# Enable versioning
resource "aws_s3_bucket_versioning" "bucket_versioning" {
  bucket = aws_s3_bucket.data_lake_bucket.id
  versioning_configuration {
    status = "Enabled"
  }
}

# Configure lifecycle rule
resource "aws_s3_bucket_lifecycle_configuration" "bucket_lifecycle" {
  bucket = aws_s3_bucket.data_lake_bucket.id

  rule {
    id     = "delete_old_objects"
    status = "Enabled"

    expiration {
      days = 30
    }
  }
}

# Create Athena database (closest equivalent to BigQuery dataset)
resource "aws_athena_database" "dataset" {
  name   = "<your-dataset-name>"
  bucket = aws_s3_bucket.data_lake_bucket.bucket
}

# Enable server-side encryption (recommended security practice)
resource "aws_s3_bucket_server_side_encryption_configuration" "bucket_encryption" {
  bucket = aws_s3_bucket.data_lake_bucket.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

# Block public access (recommended security practice)
resource "aws_s3_bucket_public_access_block" "bucket_public_access_block" {
  bucket = aws_s3_bucket.data_lake_bucket.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Initialize with terraform init in the shell. Provided the credentials and environment variables are set up, this will set up the backend where we run terraform from (local, or Terraform Cloud) by downloading the provider plug-in code there and creating the lock file. Nothing too interesting happens yet.

screenshot

.terraform.lock.hcl

# This file is maintained automatically by "terraform init".
# Manual edits may be lost in future updates.

provider "registry.terraform.io/hashicorp/google" {
  version     = "4.51.0"
  constraints = "4.51.0"
  hashes = [
    "h1:7JFdiV9bvV6R+AeWzvNbVeoega481sJY3PqtIbrwTsM=",
    "zh:001bf7478e495d497ffd4054453c97ab4dd3e6a24d46496d51d4c8094e95b2b1",
    "zh:19db72113552dd295854a99840e85678d421312708e8329a35787fff1baeed8b",
    "zh:42c3e629ace225a2cb6cf87b8fabeaf1c56ac8eca6a77b9e3fc489f3cc0a9db5",
    "zh:50b930755c4b1f8a01c430d8f688ea79de0b0198c87511baa3a783e360d7e624",
    "zh:5acd67f0aafff5ad59e179543cccd1ffd48d69b98af0228506403b8d8193b340",
    "zh:70128d57b4b4bf07df941172e6af15c4eda8396af5cc2b0128c906983c7b7fad",
    "zh:7905fac0ba2becf0e97edfcd4224e57466b04f960f36a3ec654a0a3c2ffececb",
    "zh:79b4cc760305cd77c1ff841f789184f808b8052e8f4faa5cb8d518e4c13beb22",
    "zh:c7aebd7d7dd2b29de28e382500d36fae8b4d8a192cf05e41ea29c66f1251acfc",
    "zh:d8b4494b13ef5af65d3afedf05bf7565918f1e31ad68ae0df81f5c3b12baf519",
    "zh:e6e68ef6881bc3312db50c9fd761f226f34d7834b64f90d96616b7ca6b1daf34",
    "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
  ]
}

See what we are going to do with terraform plan. This is a validation step. Terraform will

Pings the cloud provider and reads the current state of any already-existing remote objects.
Compares the current configuration to the the remote state and noting any differences.
Proposes a set of change actions that should, if applied, make the remote objects match the configuration. For example, here's the actions when I try to create the aforementioned resources from scratch

Terraform's Plan

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_bigquery_dataset.dataset will be created
  + resource "google_bigquery_dataset" "dataset" {
      + creation_time              = (known after apply)
      + dataset_id                 = "<The Dataset Name You Want to Use>"
      + delete_contents_on_destroy = false
      + etag                       = (known after apply)
      + id                         = (known after apply)
      + labels                     = (known after apply)
      + last_modified_time         = (known after apply)
      + location                   = "US"
      + project                    = "<Your Project ID>"
      + self_link                  = (known after apply)

      + access (known after apply)
    }

  # google_storage_bucket.data-lake-bucket will be created
  + resource "google_storage_bucket" "data-lake-bucket" {
      + force_destroy               = true
      + id                          = (known after apply)
      + location                    = "US"
      + name                        = "<Your Unique Bucket Name>"
      + project                     = (known after apply)
      + public_access_prevention    = (known after apply)
      + self_link                   = (known after apply)
      + storage_class               = "STANDARD"
      + uniform_bucket_level_access = true
      + url                         = (known after apply)

      + lifecycle_rule {
          + action {
              + type          = "Delete"
                # (1 unchanged attribute hidden)
            }
          + condition {
              + age                    = 30
              + matches_prefix         = []
              + matches_storage_class  = []
              + matches_suffix         = []
              + with_state             = (known after apply)
                # (3 unchanged attributes hidden)
            }
        }

      + versioning {
          + enabled = true
        }

      + website (known after apply)
    }

Plan: 2 to add, 0 to change, 0 to destroy.

Run terraform apply to create the resources/apply the changes to resources.
Use terraform destroy to remove the resources³.

For advanced usage, the documentation actually feels more helpful than the books I can find. The cheat sheet can be access here.

Structured Query Language - SQL

Creating resources for SQL is like beating a dead horse. There's no data engineer who cannot work with SQL. To do it properly, I will need to write up a whole blog post about it.

That's it for now. See you next week, when we move into orchestration framework with (another) new face, Kestra.

Kernel is the lowest level process running in a computer managing all other processes (memory allocation, CPU scheduling, etc.).↩
For development, a virtual environment is often enough on one's laptop e.g., conda env for Python devs. But if you use a VM from a cloud provider or GitHub Codespace, you are developing on a container.↩
I don't want to rain on the parade in the main article, but Terraform has its share of problems. Here's one good article to balance the opinion. Most problematic point is not everything will be created successfully with apply, and destroy can also fail to get something. However, if you are just using common services on common providers, no serious errors should happen. If you are interested in the Terraform vs OpenTofu situation, you can check out this video.↩

#dataeng #post #study