Comprehensive Guide to Download Files From S3 with Python

By : Mydatahack
February 25, 2018
Category : Data Engineering, Data Ingestion
Tags: AWS, Boto, Boto 3, Python SDK, S3

Using AWS SDK for Python can be confusing. First of all, there seems to be two different ones (Boto and Boto3). Even if you choose one, either one of them seems to have multiple ways to authenticate and connect to AWS services. Googling solutions can quickly become confusing as you may find different variations of code examples. If you are just getting into AWS, this can be scary. In this post, I will explain the different and give you the code examples that work by using the example of downloading files from S3.

Boto is the older version of Python AWS SDK. Boto3 is the newer version. Boto3 doesn’t mean it is for Python 3. It works for Python 2.6 or 2.7. All the code provided in this post will work for both Python versions. As Boto3 is rewritten from ground up, the code you write with boto3 is different from Boto. You can read further about the change made in Boto3 here.

Generally speaking, you should use Boto3 if you are writing new programs. It is the new and improved version of Boto. I personally find it more efficient and easier to use. You also need to know Boto to maintain legacy code.

Before you start, you need to install boto and boto3. All the code work for both Python 2.7 and 3. You can use either one of them

pip install boto
pip install boto3

(1) Downloading S3 Files With Boto3

Boto3 provides super-easy way to configure credentials and access to AWS resources. To connect to S3, you can either create a S3 resorce or S3 client. Resource is an object-oriented interface to AWS and provides higher-level abstraction while Client is a low-level interface to AWS whose methods map close to 1:1 with service APIs.

1-1. Using Resource

You can create s3 resources with boto3.resource(‘s3’). It will use the AWS credentials configured in your AWS CLI (see here). This is perfect when you want to use the code between different environments as the credentials come from environment variable and you do not need to hardcode it. Once you have the resources, create the bucket object and use the download_file method. This page shows different s3 methods that you can use with resource.

import boto3
import botocore

def download_file_with_resource(bucket_name, key, local_path):
s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file(key, local_path)
print('Downloaded File with boto3 resource')

bucket_name = '<your bucket name>'
key = '<folder…/filename>'
local_path = '<e.g. ./log.txt>'

download_file_with_resource(bucket_name, key, local_path)

Here is the troubleshooting tip for 404 error: An error occurred (404) when calling the HeadObject operation: Not Found.

You first need to make sure you have the correct bucket & key names. If you still get this error after triple-checking bucket name and object key, make sure your key does not start with ‘/’. For example, key = ‘/2018-02-27/log.txt’ should be key = ‘2018-02-27/log.txt’ in the code above.

1-2. Using Client

With Client, you can specify the credentials. As opposed to Boto, you do not need to specify the region. Boto3 handles it for you. You can also configure multiple credentials in AWS CLI and choose it to connect to non-default S3 with client. This documentation here is for further details on configuring credentials. This page shows the list of methods you can use with client. Note that download_file method argument is different from the one used with resource.

import boto3
import botocore

def download_file_with_client(access_key, secret_key, bucket_name, key, local_path):
client = boto3.client('s3',\
aws_access_key_id=access_key,\
aws_secret_access_key=secret_key,\
)
client.download_file(bucket_name, key, local_path)
print('Downloaded frile with boto3 client')

access_key = '<Access Key>'
secret_key = '<Secret Key>'
bucket_name = '<your bucket name>'
key = '<folder…/filename>'
local_path = '<e.g. ./log.txt>'

download_file_with_client(access_key, secret_key, bucket_name, key, local_path)

(2) Downloading S3 Files With Boto

Boto requires more parameters compared to Boto3 and you need to know a little bit more about the package to make it work.

Here are some common errors and how to handle them.

Error 1:

sl.CertificateError: hostname your.bucket.name.s3.amazonaws.com’ doesn’t match either of ‘*.s3.amazonaws.com’, ‘s3.amazonaws.com’

You will get this error when your bucket contain ‘.’ Without specifying OrdinaryCallingFormat() in connect_s3().

Error 2:

Cannot resolve boto.exception.S3ResponseError: S3ResponseError: 301 Moved Permanently

You will get this error without specifying the host with region if you are not using one of the default US region. Otherwise, it will use s3.amazonaws.com as a host name which assumes your bucket is created in the default US region. Alternatively, you can use boto.s3.connect_to_region() to specify the region.

Here is the further reference on Python S3 with Boto.

2-1. Using boto.connect_s3() and default credentials

Let’s start with using the default credential configures with AWS CLI. This is the equivalent of the first code in Boto3. You can see how simplified Boto3 is.

import boto
import boto.s3.connection

def download_data(region, bucket_name, key, local_path):

conn = boto.connect_s3(
host='s3-{}.amazonaws.com'.format(region),\
calling_format = boto.s3.connection.OrdinaryCallingFormat())

bucket = conn.get_bucket(bucket_name)
key = bucket.get_key(key)
key.get_contents_to_filename(local_path)
print('Downloaded File {} to {}'.format(key, local_path))

region = '<Region e.g. ap-southeast-2>'
bucket_name = '<your bucket name>'
key = '<folder…/filename>'
local_path = '<e.g. ./log.txt>'

download_data(region, bucket_name, key, local_path)

2-2. Using boto.connect_s3() with custom credentials

Of course you can configure the credential within the code.

import boto
import boto.s3.connection
def download_data_connect_s3(access_key, secret_key, region, bucket_name, key, local_path):
conn = boto.connect_s3(aws_access_key_id = access_key,\
aws_secret_access_key = secret_key,\
host='s3-{}.amazonaws.com'.format(region),\
calling_format = boto.s3.connection.OrdinaryCallingFormat()\
)

bucket = conn.get_bucket(bucket_name)
key = bucket.get_key(key)
key.get_contents_to_filename(local_path)
print('Downloaded File {} to {}'.format(key, local_path))

region = '<Region e.g. ap-southeast-2>'
access_key = '<Access Key>'
secret_key = '<Secret Key>'
bucket_name = '<your bucket name>'
key = '<folder…/filename>'
local_path = '<e.g. ./log.txt>'

download_data_connect_s3(access_key, secret_key, region, bucket_name, key, local_path)

2-3. Using boto.s3.connect_to_region()

The main difference from connect_s3() is that this function takes region as an argument while the region had to be defined in the host url in connect_s3(). This is slightly simpler implementation to specify the region.

import boto
import boto.s3.connection
def download_data_connect_to_region(region, access_key, secret_key, bucket_name, key, local_path):
'''This will use connect_to_region() function in boto'''
conn = boto.s3.connect_to_region(
region_name=region,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
calling_format=boto.s3.connection.OrdinaryCallingFormat()
)
bucket = conn.get_bucket(bucket_name)
key = bucket.get_key(key)
key.get_contents_to_filename(local_path)
print('Downloaded File {} to {}'.format(key, local_path))

region = '<Region e.g. ap-southeast-2>'
access_key = '<Access Key>'
secret_key = '<Secret Key>'
bucket_name = '<your bucket name>'
key = '<folder…/filename>'
local_path = '<e.g. ./log.txt>'

download_data_connect_to_region(region, access_key, secret_key, bucket_name, key, local_path)

Hope this clarifies a few things!

If you are interested in moving data to S3 and Redshift, check out this post: Data Engineering in S3 and Redshift with Python.

SIMILAR NEWS

Data Engineering

Sending XML Payload and Converting XML Response to JSON with Python

If you need to interact with a REST endpoint that takes a XML string as a payload and returns another XML string as a response, this is the quick guide if you want to use Python. If you want to do it with Node.js, you can check out the post …

Data Engineering

Sending XML Payload and Converting XML Response to JSON with Node.js

Here is the quick Node.js example of interacting with a rest API endpoint that takes XML string as a payload and return with XML string as response. Once we get the response, we will convert it to a JSON object. For this example, we will use the old-school QAS (Quick …

Data Engineering

Downloading All Public GitHub Gist Files

I used to use plug-ins to render code blocks for this blog. Yesterday, I decided to move all the code into GitHub Gist and inject them from there. Using a WordPress plugin to render code blocks can be problematic when update happens. Plugins might not be up to date. It …