This tutorial explains how you can use Amazon S3 storage for easy access to datasets
We'll load data from Amazon S3 directly in a pandas dataframe
For this tutorial we'll use a public dataset.
See AWS Open Data Registry for more information on Public Datasets
What you'll need:
publicBucket = "covid19-lake" # the bucket reference
Boto3 is AWS own sdk for programmatic access to S3
We'll first apply Boto3 to connect to S3.
import boto3
from botocore import UNSIGNED # You'll need this to connect as anonymous. You could also pass your access key and secret
from botocore.client import Config
import pandas as pd
We need to instantiate a boto3 client and pass it our credentials or indicate that we want to access as anonymous
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
Let's verify the access control list for this bucket
acl = s3_client.get_bucket_acl(Bucket=publicBucket)
owner = acl["Owner"]
grants = acl["Grants"]
print("Bucket owned by ", owner)
print("Bucket grants:")
for grant in grants:
grantee = grant["Grantee"]
permission = grant["Permission"]
print("Grantee=", grantee, ", Permission=", permission)
Helper function to list the objects in a bucket:
We'll use this function to browse through the bucket contents
def list_bucket_objects(**kwargs):
response = s3_client.list_objects_v2(**kwargs)
continuation_token = response.get("NextContinuationToken")
for obj in response.get("Contents"):
key = obj.get("Key")
size = obj.get("Size")
storageclass = obj.get("StorageClass")
print("Object found with key=", key, ", size=", size, ", S3 storage class=", storageclass)
return continuation_token
Check what's inside the bucket:
args = dict(Bucket=publicBucket, MaxKeys=10)
continuation = list_bucket_objects(**args)
The amount of objects is huge, so let's browse 10 objects at a time
If you keep running next code block the objects will be shown like a page of 10 each
args["ContinuationToken"] = continuation
continuation = list_bucket_objects(**args)
Of course, it makes no sense to browse through 10000 or 100K amounts of files. We can do this a bit more efficiently.
Let's add a prefix
Prefixes give us a way of prefiltering the objects in the bucket
For more information on this public dataset see covid19-lake datasets
args = dict(Bucket=publicBucket, MaxKeys=50, Prefix='static-datasets')
list_bucket_objects(**args)
args = dict(Bucket=publicBucket, MaxKeys=50, Prefix='rearc-covid-19')
list_bucket_objects(**args)
Let's load one of these files in a pandas dataframe
obj = s3_client.get_object(Bucket=publicBucket, Key="rearc-covid-19-testing-data/csv/us_daily/us_daily.csv")
df = pd.read_csv(obj.get("Body"))
display(df)
Ok, so we can see clearly that this file contains daily updated data for the amount of positive, negative, pending tests, hospitalized patients, recovered, deceased etc.
A bit simpler to just load data from S3 is the library s3fs
Pandas internally replaced boto3 with s3fs for loading of data from S3
# import libraries
import os
from s3fs.core import S3FileSystem
Just connect to S3 and treat it as a file system
s3 = S3FileSystem(anon=True)
list files
s3.ls(path=publicBucket, detail=False)
path = '/covid19-lake/rearc-covid-19-testing-data'
s3.ls(path=path, detail=False)
disk usage
s3.disk_usage(path=path, total=False)
Read comma delimited file directly in pandas dataframe
file = 'covid19-lake/rearc-covid-19-testing-data/csv/us_daily/us_daily.csv'
df = pd.read_csv(s3.open(file, mode='rb'))
display(df)
Ok, same result. Looks good.
We'll use Altair to display the data
import altair as alt
# by default altair will only render a visualization if the number of records <=5000
# if you want to disable that behaviour uncomment next line
# alt.data_transformers.disable_max_rows()
Check first datatypes of dataframe
df.dtypes
We'll convert the int64 datatype of date to an actual datetime
df["timestamp"] = pd.to_datetime(df["date"], format="%Y%m%d")
set the index of the dataframe to a datetime index
df.set_index("timestamp",drop=False, inplace=True)
display(df)
So let's plot the data on a timeline
Let's look at the trend in positives/negatives, recovered, deceased and hospitalized patients for Covid-19 in the USA
source = df
base = alt.Chart(source).properties(width=1200, height=400).transform_fold(['recovered', 'death', 'hospitalized','positive'], as_ = ['counts', 'category'])
area = base.mark_area(opacity=0.5).encode(
alt.X('timestamp:T'),
alt.Y('category:Q', stack=None),
alt.Color('counts:N')
)
mark = base.mark_point(color='red', shape='circle').encode(x='timestamp:T', y='positive', tooltip=['timestamp', 'positive', 'negative', 'recovered', 'death','hospitalized'])
display(area+mark)
Let's also plot the increase numbers
source = df
base = alt.Chart(source).properties(width=1200, height=400).transform_fold(['deathIncrease', 'hospitalizedIncrease', 'positiveIncrease'], as_ = ['growth', 'category'])
area = base.mark_area(opacity=0.5).encode(
alt.X('timestamp:T'),
alt.Y('category:Q', stack=None),
alt.Color('growth:N')
)
mark = base.mark_point(color='red', shape='circle').encode(x='timestamp:T', y='positiveIncrease', tooltip=['timestamp', 'deathIncrease', 'hospitalizedIncrease', 'positiveIncrease'])
display(area+mark)