Push Pipeline and Model To AWS S3

Create Buckets

AWS S3 is a wonderful managed service that lets us upload files and access them later. We’ll leverage this by uploading our pickled files so our API can fetch them later. First we’ll create two “buckets”, think of these like a folder, one for our pipelines and one for our models.

import boto3

session = boto3.Session(profile_name="personal")
s3 = session.client("s3")


If you get any errors like:

botocore.exceptions.ClientError: An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to.

Ensure that your region in ~/.aws/config is set to the region you wish to create the S3 bucket in.

Additionally, bucket names must be globally unique along with some other requirements. So blatantly coping and pasting my code above will not work.

Upload Dill’ed Files

import boto3

session = boto3.Session(profile_name="personal")
s3 = session.client("s3")

s3.upload_file("artifact/pipe.dill", "data-science-from-scratch-pipeline", "pipe.dill", ExtraArgs={"ACL": "public-read"})
s3.upload_file("artifact/model.dill", "data-science-from-scratch-model", "model.dill", ExtraArgs={"ACL": "public-read"})

Now our files files are accessible from S3 and our API can fetch them. Please note, I decided to make these files public since this is a tutorial.

Modifying API To Use S3 Files

We’ll have to make a slight modification to our API to leverage our files on S3.

import dill
import requests
import pandas as pd
from flask import Flask, request, jsonify

app = Flask(__name__)

r = requests.get("https://data-science-from-scratch-pipeline.s3.amazonaws.com/pipe.dill")
open("pipe.dill", 'wb').write(r.content)

with open("pipe.dill", 'rb') as f:
    pipe = dill.load(f)

r = requests.get("https://data-science-from-scratch-model.s3.amazonaws.com/model.dill")
open("model.dill", 'wb').write(r.content)

with open("model.dill", 'rb') as f:
    model = dill.load(f)

def predict():
    raw_json = request.get_json(force=True)
    flat_table_df = pd.json_normalize(raw_json)
    processed = pipe.transform(flat_table_df)
    return str(model.predict(processed)[0])

Push API To AWS Lambda

Now that we have a pipeline file, model file, and API code written and our files uploaded, we’re ready to deploy our API to AWS. For this tutorial, we’re going to leverage AWS’s serverless architecture called lambda, which will allow us to host our API.

We’ll also leverage Zappa to make deploying to lambda much easier. We’ll first use zappa init which will walk us through generating a zappa_settings.json.

import zappa

!zappa init
zappa init

███████╗ █████╗ ██████╗ ██████╗  █████╗
  ███╔╝ ███████║██████╔╝██████╔╝███████║
 ███╔╝  ██╔══██║██╔═══╝ ██╔═══╝ ██╔══██║
███████╗██║  ██║██║     ██║     ██║  ██║
╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝     ╚═╝  ╚═╝

Welcome to Zappa!

Zappa is a system for running server-less Python web applications on AWS Lambda and AWS API Gateway.
This `init` command will help you create and configure your new Zappa deployment.
Let's get started!

Your Zappa configuration can support multiple production stages, like 'dev', 'staging', and 'production'.
What do you want to call this environment (default 'dev'):

AWS Lambda and API Gateway are only available in certain regions. Let's check to make sure you have a profile set up in one that will work.
We found the following profiles: default, and personal. Which would you like us to use? (default 'default'): personal

Your Zappa deployments will need to be uploaded to a private S3 bucket.
If you don't have a bucket yet, we'll create one for you too.
What do you want to call your bucket? (default 'zappa-zwkn4ys6w'):

It looks like this is a Flask application.
What's the modular path to your app's function?
This will likely be something like 'your_module.app'.
We discovered: api.api.app, api..ipynb_checkpoints.api-checkpoint.app
Where is your app's function? (default 'api.api.app'): api.api.app

You can optionally deploy to all available regions in order to provide fast global service.
If you are using Zappa for the first time, you probably don't want to do this!
Would you like to deploy this application globally? (default 'n') [y/n/(p)rimary]:

Okay, here's your zappa_settings.json:

    "dev": {
        "aws_region": "us-west-1",
        "app_function": "api.api.app",
        "profile_name": "personal",
        "project_name": "ds-production",
        "runtime": "python3.7",
        "s3_bucket": "grehg-zappa-west1",
        "slim_handler": true

From there, we’re ready to deploy to aws lambda

zappa deploy dev

This will give you output showing your code and dependencies being packaged up and uploaded.

If the above command gives you a

No such file or directory

error, then save your depenencies into a requirments.txt, deactivate your current venv, blow it away, create a new one and reinstall all your dependencies from your requirements.txt. This issue is described here.

Otherwise you should get some output that looks like:

Additionally, its good to know that AWS Lambda has a 250 MB limit. Now our code is negligible in size, the packages we depend on are quite heavy. If you run into an issue, zappa has a flag you can add into your zappa_settings.json slim_handler that will increase your zappa limit to 500 Mb zipped. This works by leveraging S3 to store the data and download on boot.

Also, since our API has no route on /, ignore the following error if you get it:

Error: Warning! Status check on the deployed lambda failed. A GET request to ‘/’ yielded a 502 response code.

Getting Your Deployed API URL

Described here with photos

  1. go to https://console.aws.amazon.com/apigateway
  2. select api link (which you have deployed on aws lambda).
  3. select stages in left side panel and see the invoke url.

Hitting Our Production API

Just like we demonstrated in the Data Science From Scratch To Production MVP Style: API blog post, we’ll leverage a simple CURL command to make a HTTP GET request to our live API.

curl --request GET -H "Content-Type: application/json" --data '{"temperature_celsius": 5.004}' "https://dl8rikuiqc.execute-api.us-east-1.amazonaws.com/dev/predict"

Viewing API Logs

The Easy Way

$ zappa tail allows will show you the latest logs for your lambda service

The Hard Way

Zappa makes viewing our aws lambda logs easy by automatically configuring the logs to be displayable in AWS Cloudwatch. Just click Logs and select the appropriate Log Group matching the name of the lambda function you deployed.

Deploying An Update

Zappa makes it easy to deploy an update to aws lambda, all we have to do is run

zappa update dev
Calling update for stage dev..
Downloading and installing dependencies..
- scipy==1.4.1: Using locally cached manylinux wheel
- pandas==1.0.1: Using locally cached manylinux wheel
- numpy==1.18.1: Using locally cached manylinux wheel
- markupsafe==1.1.1: Downloading
Packaging project as gzipped tarball.
You are using pip version 19.0.3, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Downloading and installing dependencies..
Packaging project as zip.
Uploading ds-production-dev-1586805569.tar.gz (272.4MiB)..
Updating Lambda function code..
Updating Lambda function configuration..
Uploading ds-production-dev-template-1586806171.json (1.6KiB)..
Deploying API Gateway..
Unscheduled ds-production-dev-zappa-keep-warm-handler.keep_warm_callback.
Scheduled ds-production-dev-zappa-keep-warm-handler.keep_warm_callback with expression rate(4 minutes)!

Moving Forward

As stated at the start of this series, we focused purely on an MVP style. Regardless, this is a non exhaustive list of improvements we could make:

Thank You

First off a thank you to any reader who made it to the end of this series.

Additionally, I’d like to thank Max Lei who originally opened my eyes to the power of scikit-learn’s pipelines and has been a wonderful mentor for years.