Using Apache Airflow’s Docker Operator with Amazon’s Container Repository

Using Apache Airflow's Docker Operator with Amazon's Container Repository thumbnail

Last year, Lucid Software’s data science and analytics teams moved to Apache Airflow for scheduling tasks. Airflow was a major improvement over our previous solution—running Windows Task Manager on analyst’s laptop and hoping it worked—but we’ve had to work through a few hurdles to get everything working.

One interesting hurdle has been getting Airflow’s provided DockerOperator to work with images on AWS’s hosted private Elastic Container repository (ECR). In this post, I will take you through what we did to make Airflow and ECR work together. This is written under the assumption you know the basics of Airflow and Docker, though not necessarily ECR.

While not all of the jobs we run with Airflow require Docker, there were a few jobs that needed the portability that Docker provides. Most of our analysts and data scientists work in OS X or Windows, while our Airflow cluster runs on Linux. If a job relied on system APIs, we couldn’t guarantee it would work the same on the Airflow cluster as it did on the developer’s laptop. For example, one analyst wrote a web scraper with the Selenium web driver, and while it worked on his laptop, some of the system calls Selenium used were failing in Linux.

Debugging each system call and finding a way to make each step of the scraper work in every environment we support would have required a significant up-front cost and left us with fragile code, requiring the same fixes the next time someone changed the code. Instead, we helped the analyst move his scraper to a Docker container, creating something we could easily maintain.

Setup permissions and push to ECR

Once we had the image, we then needed to move that image into ECR. First, we needed to give the analysts access to ECR and have them push their container, so we gave the analyst access to ECR in IAM by adding a few policies. At the very least, someone pushing a container to ECR will need the permissions ecr:GetAuthorizationToken and ecr:PutImage. If you want to manage repositories yourself, that’s all you need. If you want someone to manage the repository they are pushing to as well, you’ll also need them to give them the ecr:CreateRepository permission. For more detailed information, AWS provides excellent tutorials: Creating a Repository and Pushing an Image.

Next, we needed to give Airflow permissions to pull the image of the job from ECR. The permissions Airflow needed were ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetAuthorizationToken, and ecr:GetDownloadUrlForLayer. Our Airflow cluster runs on EC2 instances so we gave those specific permissions to the IAM roles associated with those instances. From there, we set up Airflow to be able to communicate with our account’s ECR.

Connect Airflow to ECR

Airflow communicates with the Docker repository by looking for connections with the type “docker” in its list of connections. We wrote a small script that retrieved login credentials from ECR, parsed them, and put those into Docker’s connection list.

Here is an example script similar to what we used to retrieve and store credentials:


#!/usr/bin/env python
import subprocess
import boto3
import base64

ecr = boto3.client('ecr', region_name='us-east-1')
response = ecr.get_authorization_token()

username, password = base64.b64decode(
  response['authorizationData'][0]['authorizationToken']
).split(':')
registry_url = response['authorizationData'][0]['proxyEndpoint']

# Delete existing docker connection
airflow_del_cmd = 'airflow connections -d --conn_id docker_default'
process = subprocess.Popen(airflow_del_cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

# Add docker connection with updated credentials
airflow_add_cmd = 'airflow connections -a --conn_id docker_default --conn_type docker --conn_host {} --conn_login {} --conn_password {}'.format(registry_url, username, password)
process = subprocess.Popen(airflow_add_cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

The issue with the script above, though, is that the ECR credentials are only valid for an hour. To keep these credentials fresh, we set up a cron task on every host in our cluster that runs this script every half hour.

Use your Docker image on Airflow

Then, in order to run the container image as a task, we set up a dag with an operator like this:


DockerOperator(
    task_id=’web_scraper’,
    image='XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/web_scraper:latest',
    command='python /home/ubuntu/web_scraper.py',
    execution_timeout=timedelta(minutes=30),
    dag=dag)

 

The DockerOperator pulls the image you pushed using the Docker connection we set up in the last script and runs that image with the provided command.

We had one last issue with working with Docker on Airflow. The DockerOperator does not clean up old images, which eventually led us to run out of disk space on our ECS cluster. To fix that, we added another task to the same DAG that does some cleanup:


BashOperator(
    task_id='clean_up_docker',
    bash_command='docker container prune',
    dag=dag)

With that last operator in place, we had a system for running Docker images stored in ECR as tasks in Airflow. We can now take a task, put it in a portable Docker image, push that image to our private hosted repository in ECR, and then run on a schedule from our Airflow cluster.

Of course, this isn’t your only option for using Docker, or even ECR, with Airflow. Our site reliability team has started running some containerized tasks using the ECSOperator instead of the DockerOperator so they could run on an Elastic Container Service (ECS) cluster rather than directly on the Airflow worker. We’ve decided to use the DockerOperator since it made sense for our team, and I hope I’ve helped you get the most out of your Docker and Airflow infrastructure.

1 Comment

  1. I have been looking a meaningful post to understand dockerOperator since a month and yours is amazing. Thanks bud.

Your email address will not be published.