Using Apache Airflow’s Docker Operator with Amazon’s Container Repository

Last year, Lucid Software’s data science and analytics teams moved to Apache Airflow for scheduling tasks. Airflow was a major improvement over our previous solution—running Windows Task Manager on analyst’s laptop and hoping it worked—but we’ve had to work through a few hurdles to get everything working.

One interesting hurdle has been getting Airflow’s provided DockerOperator to work with images on AWS’s hosted private Elastic Container repository (ECR). In this post, I will take you through what we did to make Airflow and ECR work together. This is written under the assumption you know the basics of Airflow and Docker, though not necessarily ECR.

While not all of the jobs we run with Airflow require Docker, there were a few jobs that needed the portability that Docker provides. Most of our analysts and data scientists work in OS X or Windows, while our Airflow cluster runs on Linux. If a job relied on system APIs, we couldn’t guarantee it would work the same on the Airflow cluster as it did on the developer’s laptop. For example, one analyst wrote a web scraper with the Selenium web driver, and while it worked on his laptop, some of the system calls Selenium used were failing in Linux.

Debugging each system call and finding a way to make each step of the scraper work in every environment we support would have required a significant up-front cost and left us with fragile code, requiring the same fixes the next time someone changed the code. Instead, we helped the analyst move his scraper to a Docker container, creating something we could easily maintain.

Setup permissions and push to ECR

Once we had the image, we then needed to move that image into ECR. First, we needed to give the analysts access to ECR and have them push their container, so we gave the analyst access to ECR in IAM by adding a few policies. At the very least, someone pushing a container to ECR will need the permissions ecr:GetAuthorizationToken and ecr:PutImage. If you want to manage repositories yourself, that’s all you need. If you want someone to manage the repository they are pushing to as well, you’ll also need them to give them the ecr:CreateRepository permission. For more detailed information, AWS provides excellent tutorials: Creating a Repository and Pushing an Image.

Next, we needed to give Airflow permissions to pull the image of the job from ECR. The permissions Airflow needed were ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetAuthorizationToken, and ecr:GetDownloadUrlForLayer. Our Airflow cluster runs on EC2 instances so we gave those specific permissions to the IAM roles associated with those instances. From there, we set up Airflow to be able to communicate with our account’s ECR.

Connect Airflow to ECR

Airflow communicates with the Docker repository by looking for connections with the type “docker” in its list of connections. We wrote a small script that retrieved login credentials from ECR, parsed them, and put those into Docker’s connection list.

Here is an example script similar to what we used to retrieve and store credentials:


#!/usr/bin/env python
import subprocess
import boto3
import base64

ecr = boto3.client('ecr', region_name='us-east-1')
response = ecr.get_authorization_token()

username, password = base64.b64decode(
  response['authorizationData'][0]['authorizationToken']
).split(':')
registry_url = response['authorizationData'][0]['proxyEndpoint']

# Delete existing docker connection
airflow_del_cmd = 'airflow connections -d --conn_id docker_default'
process = subprocess.Popen(airflow_del_cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

# Add docker connection with updated credentials
airflow_add_cmd = 'airflow connections -a --conn_id docker_default --conn_type docker --conn_host {} --conn_login {} --conn_password {}'.format(registry_url, username, password)
process = subprocess.Popen(airflow_add_cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

The issue with the script above, though, is that the ECR credentials are only valid for an hour. To keep these credentials fresh, we set up a cron task on every host in our cluster that runs this script every half hour.

Use your Docker image on Airflow

Then, in order to run the container image as a task, we set up a dag with an operator like this:


DockerOperator(
    task_id=’web_scraper’,
    image='XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/web_scraper:latest',
    command='python /home/ubuntu/web_scraper.py',
    execution_timeout=timedelta(minutes=30),
    dag=dag)

 

The DockerOperator pulls the image you pushed using the Docker connection we set up in the last script and runs that image with the provided command.

We had one last issue with working with Docker on Airflow. The DockerOperator does not clean up old images, which eventually led us to run out of disk space on our ECS cluster. To fix that, we added another task to the same DAG that does some cleanup:


BashOperator(
    task_id='clean_up_docker',
    bash_command='docker container prune',
    dag=dag)

With that last operator in place, we had a system for running Docker images stored in ECR as tasks in Airflow. We can now take a task, put it in a portable Docker image, push that image to our private hosted repository in ECR, and then run on a schedule from our Airflow cluster.

Of course, this isn’t your only option for using Docker, or even ECR, with Airflow. Our site reliability team has started running some containerized tasks using the ECSOperator instead of the DockerOperator so they could run on an Elastic Container Service (ECS) cluster rather than directly on the Airflow worker. We’ve decided to use the DockerOperator since it made sense for our team, and I hope I’ve helped you get the most out of your Docker and Airflow infrastructure.

5 Comments

  1. I have been looking a meaningful post to understand dockerOperator since a month and yours is amazing. Thanks bud.

  2. Great post Brian.

    I am currently integrating Airflow in my organisation and faced similar problem where images were hosted on ECR and token needs to be refreshed every 12 hours. I was looking for solutions and stumbled upon this post and found it really helpful.

    This also inspired me to implement a custom Airflow operator that can refresh the token automatically. The plugin is available on PyPi (link below) in case other people are facing similar problem.
    https://pypi.org/project/airflow-ecr-plugin/

  3. Nice post.

    Any reason you need to run your token refresh script of every host? Seems like running it on a single host (like in an Airflow task) would be enough to update the airflow db and make the new connection available to all the hosts.

    Am I missing something?

  4. Hey Brian! Loved the post, it helped me get started. I had to update a few things that I did to get it working with my version of Airflow, 2.1.2
    1. The ECR Authentication lasts for 12 hours, meaning you only have to run this job 2 or 3 times a day
    2. I separated out the two airflow commands into their own separate bash operator steps, for more transparency.
    3. The adding of airflow connections was done by me this way:
    ‘airflow connections add –conn-type docker –conn-host “{}” –conn-login “{}” –conn-password “{}” docker_default’.format(login, “AWS”, token)
    Where AWS is the username, docker_default is a required parameter, and login is “https://${AWS_ACCOUNT_NUM}.dkr.ecr.us-east-1.amazonaws.com”
    4. The deleting of airflow connections was done this way:
    ‘airflow connections delete docker_default’
    5. From the initial Python request, I only used the token received as follows:
    token = response[‘authorizationData’][0][‘authorizationToken’]
    That is later retrieved using XCOMs in the DAG
    token = “{{ task_instance.xcom_pull(task_ids=’authenticate_ecr’) }}”
    (authenticate_ecr is my method)

  5. Here is a simple dag which I wrote based on that post in only python for the pulling of the token http://www.gudasoft.com/uncategorized/10/22/1876/using-apache-airflows-docker-operator-with-amazons-container-repository/2021

    It is good for running inside airflow with docker containers.

Your email address will not be published.