Working with S3 objects

This tutorial explains how you can use Amazon S3 storage for easy access to datasets
We'll load data from Amazon S3 directly in a pandas dataframe
For this tutorial we'll use a public dataset.

Use a public available dataset

See AWS Open Data Registry for more information on Public Datasets
What you'll need:

  • The bucket reference
In [1]:
publicBucket = "covid19-lake" # the bucket reference

Python - Boto3

Boto3 is AWS own sdk for programmatic access to S3
We'll first apply Boto3 to connect to S3.

In [2]:
import boto3
from botocore import UNSIGNED # You'll need this to connect as anonymous. You could also pass your access key and secret
from botocore.client import Config
import pandas as pd

We need to instantiate a boto3 client and pass it our credentials or indicate that we want to access as anonymous

In [3]:
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))

Let's verify the access control list for this bucket

In [4]:
acl = s3_client.get_bucket_acl(Bucket=publicBucket)
owner = acl["Owner"]
grants = acl["Grants"]
print("Bucket owned by ", owner)
print("Bucket grants:")
for grant in grants:
  grantee = grant["Grantee"]
  permission = grant["Permission"]
  print("Grantee=", grantee, ", Permission=", permission)
Bucket owned by  {'ID': '052611c7635e5a88f6e2b6e8b9ebeb8feabfec78a155b5499897dcaa90e731b5'}
Bucket grants:
Grantee= {'Type': 'Group', 'URI': 'http://acs.amazonaws.com/groups/s3/LogDelivery'} , Permission= WRITE
Grantee= {'Type': 'Group', 'URI': 'http://acs.amazonaws.com/groups/s3/LogDelivery'} , Permission= READ_ACP
Grantee= {'Type': 'Group', 'URI': 'http://acs.amazonaws.com/groups/global/AllUsers'} , Permission= READ
Grantee= {'Type': 'Group', 'URI': 'http://acs.amazonaws.com/groups/global/AllUsers'} , Permission= READ_ACP
Grantee= {'ID': '68862915a0b1ecf44a6ecb256b2eb4df3c4b169bbf5d73bf638bdd717ba1dcf0', 'Type': 'CanonicalUser'} , Permission= FULL_CONTROL
Grantee= {'ID': '052611c7635e5a88f6e2b6e8b9ebeb8feabfec78a155b5499897dcaa90e731b5', 'Type': 'CanonicalUser'} , Permission= READ
Grantee= {'ID': '052611c7635e5a88f6e2b6e8b9ebeb8feabfec78a155b5499897dcaa90e731b5', 'Type': 'CanonicalUser'} , Permission= READ_ACP

Helper function to list the objects in a bucket:
We'll use this function to browse through the bucket contents

In [5]:
def list_bucket_objects(**kwargs):
  response = s3_client.list_objects_v2(**kwargs)
  continuation_token = response.get("NextContinuationToken")
  for obj in response.get("Contents"):
    key = obj.get("Key")
    size = obj.get("Size")
    storageclass = obj.get("StorageClass")
    print("Object found with key=", key, ", size=", size, ", S3 storage class=", storageclass)
  return continuation_token

Boto3 - browse the contents of a Bucket

Check what's inside the bucket:

In [6]:
args = dict(Bucket=publicBucket, MaxKeys=10)
continuation = list_bucket_objects(**args)
Object found with key= alleninstitute/CORD19/comprehendmedical/ , size= 0 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/comprehendmedical/comprehend_medical.json , size= 14136397 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/json/metadata/part-00000-9e786a1f-46af-4351-be48-d1a0129c76be-c000.json , size= 78485736 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/json/metadata/part-00001-9e786a1f-46af-4351-be48-d1a0129c76be-c000.json , size= 78427247 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/json/metadata/part-00002-9e786a1f-46af-4351-be48-d1a0129c76be-c000.json , size= 78473799 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/json/metadata/part-00003-9e786a1f-46af-4351-be48-d1a0129c76be-c000.json , size= 36557651 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/ , size= 0 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/0015023cc06b5362d332b3baf348d11567ca2fbb.json , size= 72983 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/00340eea543336d54adda18236424de6a5e91c9d.json , size= 66399 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/004f0f8bb66cf446678dc13cf2701feec4f36d76.json , size= 12712 , S3 storage class= STANDARD

The amount of objects is huge, so let's browse 10 objects at a time
If you keep running next code block the objects will be shown like a page of 10 each

In [9]:
args["ContinuationToken"] = continuation
continuation = list_bucket_objects(**args)
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/06d12dc5ac32d82387c65370d0a600e13059122d.json , size= 56034 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/07e833d0917cace550853f72923856d0fe1a7120.json , size= 58585 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/080660f20f078c10524f6186bca263327094acbb.json , size= 65686 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/08660499ee722a74043f8417faee3e1eeb9d0f5f.json , size= 237587 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/08a22278486e12768ce186677a6a89663d24586f.json , size= 55368 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/090b6c8b3df30bc248221869f673a2d970caa1b9.json , size= 40371 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/091a8e9a61e19e88caeb039f0e3888d111b20439.json , size= 80963 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/09b6706748f0c1ae0da436ac2dfac9052b84e4ea.json , size= 85339 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/09c9fcabc66a106e01ef42247cbd86b6d85bd67f.json , size= 127138 , S3 storage class= STANDARD
Object found with key= alleninstitute/CORD19/raw/2020_04_28/biorxiv_medrxiv/09ec8daa8e32168d92d05b86de1784c639685fb4.json , size= 156719 , S3 storage class= STANDARD

Of course, it makes no sense to browse through 10000 or 100K amounts of files. We can do this a bit more efficiently.

Let's add a prefix
Prefixes give us a way of prefiltering the objects in the bucket
For more information on this public dataset see covid19-lake datasets

In [10]:
args = dict(Bucket=publicBucket, MaxKeys=50, Prefix='static-datasets')
list_bucket_objects(**args)
Object found with key= static-datasets/csv/CountyPopulation/County_Population.csv , size= 146906 , S3 storage class= STANDARD
Object found with key= static-datasets/csv/countrycode/CountryCodeQS.csv , size= 8622 , S3 storage class= STANDARD
Object found with key= static-datasets/csv/state-abv/states_abv.csv , size= 665 , S3 storage class= STANDARD
Object found with key= static-datasets/json/CountyPopulation/part-00000-efc1e925-701b-4432-98be-7d36b9d3ec7e-c000.json , size= 359380 , S3 storage class= STANDARD
Object found with key= static-datasets/json/countrycode/part-00000-d80c811d-343b-4f60-ad30-624239b02074-c000.json , size= 31820 , S3 storage class= STANDARD
Object found with key= static-datasets/json/state-abv/part-00000-0faa317c-1e4c-43d2-ad87-fd5f5fc610f3-c000.json , size= 2125 , S3 storage class= STANDARD
In [11]:
args = dict(Bucket=publicBucket, MaxKeys=50, Prefix='rearc-covid-19')
list_bucket_objects(**args)
Object found with key= rearc-covid-19-nyt-data-in-usa/csv/us-counties/us-counties.csv , size= 11043235 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-nyt-data-in-usa/csv/us-states/us-states.csv , size= 209197 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-nyt-data-in-usa/json/us-counties/part-00000-b286de62-7a95-4f2f-a7ea-594a94cea9d2-c000.json , size= 28369100 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-nyt-data-in-usa/json/us-states/part-00000-54f699bd-38d1-490f-9e47-af853a3810df-c000.json , size= 540873 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-prediction-models/csv/county-predictions/county-predictions.csv , size= 770751 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-prediction-models/csv/severity-index/severity-index.csv , size= 794287 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-prediction-models/json/county-predictions/part-00000-1007846c-ba6d-4bc2-85df-0a938983f507-c000.json , size= 3129921 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-prediction-models/json/severity-index/part-00000-a47732d5-04ae-4d35-a080-4e662ec4ce6b-c000.json , size= 2592695 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-testing-data/csv/states_daily/states_daily.csv , size= 1358016 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-testing-data/csv/us-total-latest/us.csv , size= 471 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-testing-data/csv/us_daily/us_daily.csv , size= 30536 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-testing-data/json/states_daily/part-00000-6e997909-89c5-4f54-b8ce-d30a3f1f4546-c000.json , size= 2359835 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-testing-data/json/us-total-latest/part-00000-1fa3af7b-f025-4571-a7bc-c9ca2eefb46f-c000.json , size= 544 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-testing-data/json/us_daily/part-00000-37a1a406-3909-4b0c-b309-a314187a2e1b-c000.json , size= 63776 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-world-cases-deaths-testing/csv/covid-19-world-cases-deaths-testing.csv , size= 4794390 , S3 storage class= STANDARD
Object found with key= rearc-covid-19-world-cases-deaths-testing/json/part-00000-c43699b6-0612-4add-8d93-a33e0cd3f00c-c000.json , size= 8317970 , S3 storage class= STANDARD

Pandas through Boto3

Let's load one of these files in a pandas dataframe

In [12]:
obj =  s3_client.get_object(Bucket=publicBucket, Key="rearc-covid-19-testing-data/csv/us_daily/us_daily.csv")
df = pd.read_csv(obj.get("Body"))
display(df)
date states positive negative pending hospitalizedCurrently hospitalizedCumulative inIcuCurrently inIcuCumulative onVentilatorCurrently ... lastModified total totalTestResults posNeg deathIncrease hospitalizedIncrease negativeIncrease positiveIncrease totalTestResultsIncrease hash
0 20200628 56 2540983 28447030 2198.0 32117.0 240156.0 5230.0 10473.0 2077.0 ... 2020-06-28T00:00:00Z 30990211 30988013 30988013 273 580 544208 42161 586369 dc9b104a6101a2b1d147dd004970493f3faef554
1 20200627 56 2498822 27902822 2186.0 32220.0 239576.0 5296.0 10415.0 2159.0 ... 2020-06-27T00:00:00Z 30403830 30401644 30401644 506 1057 547406 43471 590877 c76401840e79b9f3870ca039962ca0289d948dcf
2 20200626 56 2455351 27355416 2201.0 31423.0 238519.0 5263.0 10334.0 2075.0 ... 2020-06-26T00:00:00Z 29812968 29810767 29810767 619 1526 558574 44373 602947 d28d2902aab75c2b63f7584fd72a5e02f160fd0a
3 20200625 56 2410978 26796842 2133.0 31532.0 236993.0 5305.0 10257.0 2214.0 ... 2020-06-25T00:00:00Z 29209953 29207820 29207820 2500 1257 598526 39061 637587 f29e38890a88c4b4d5770436f86bcd1c326ca7ac
4 20200624 56 2371917 26198316 2049.0 30826.0 235736.0 5279.0 10173.0 2248.0 ... 2020-06-24T00:00:00Z 28572282 28570233 28570233 722 1310 473722 38706 512428 9fb40b6267ac764e6e112724ce5419555da235c0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
154 20200126 1 2 0 NaN NaN NaN NaN NaN NaN ... 2020-01-26T00:00:00Z 2 2 2 0 0 0 0 0 e1cf59ab48e1cf367c4a6798a508a23d9d36bd18
155 20200125 1 2 0 NaN NaN NaN NaN NaN NaN ... 2020-01-25T00:00:00Z 2 2 2 0 0 0 0 0 bef2a1d5f2a13491e0e0369bbd46c10cdd12973b
156 20200124 1 2 0 NaN NaN NaN NaN NaN NaN ... 2020-01-24T00:00:00Z 2 2 2 0 0 0 0 0 bfffe76fc0b7cf11efe8aecd3cc7b22598d77d61
157 20200123 1 2 0 NaN NaN NaN NaN NaN NaN ... 2020-01-23T00:00:00Z 2 2 2 0 0 0 0 0 cee36ebf3174bf1df0daa36e1e8088a157406fad
158 20200122 1 2 0 NaN NaN NaN NaN NaN NaN ... 2020-01-22T00:00:00Z 2 2 2 0 0 0 0 0 d538c99729d1fee626212d1878a100c1e1204a5f

159 rows × 25 columns

Ok, so we can see clearly that this file contains daily updated data for the amount of positive, negative, pending tests, hospitalized patients, recovered, deceased etc.

Python - s3fs

A bit simpler to just load data from S3 is the library s3fs
Pandas internally replaced boto3 with s3fs for loading of data from S3

In [13]:
# import libraries
import os
from s3fs.core import S3FileSystem

Just connect to S3 and treat it as a file system

In [14]:
s3 = S3FileSystem(anon=True)

list files

In [15]:
s3.ls(path=publicBucket, detail=False)
Out[15]:
['covid19-lake/alleninstitute',
 'covid19-lake/archived',
 'covid19-lake/cfn',
 'covid19-lake/covid_knowledge_graph',
 'covid19-lake/covidcast',
 'covid19-lake/dashboard.html',
 'covid19-lake/databrowser.html',
 'covid19-lake/enigma-aggregation',
 'covid19-lake/enigma-jhu',
 'covid19-lake/enigma-jhu-timeseries',
 'covid19-lake/enigma-nytimes-data-in-usa',
 'covid19-lake/index.html',
 'covid19-lake/rearc-covid-19-nyt-data-in-usa',
 'covid19-lake/rearc-covid-19-prediction-models',
 'covid19-lake/rearc-covid-19-testing-data',
 'covid19-lake/rearc-covid-19-world-cases-deaths-testing',
 'covid19-lake/rearc-usa-hospital-beds',
 'covid19-lake/safegraph-open-census-data',
 'covid19-lake/static-datasets',
 'covid19-lake/tableau-covid-datahub',
 'covid19-lake/tableau-jhu']
In [16]:
path = '/covid19-lake/rearc-covid-19-testing-data'
s3.ls(path=path, detail=False)
Out[16]:
['covid19-lake/rearc-covid-19-testing-data/csv',
 'covid19-lake/rearc-covid-19-testing-data/json']

disk usage

In [17]:
s3.disk_usage(path=path, total=False)
Out[17]:
{'covid19-lake/rearc-covid-19-testing-data/csv/states_daily/states_daily.csv': 1358016,
 'covid19-lake/rearc-covid-19-testing-data/csv/us-total-latest/us.csv': 471,
 'covid19-lake/rearc-covid-19-testing-data/csv/us_daily/us_daily.csv': 30536,
 'covid19-lake/rearc-covid-19-testing-data/json/states_daily/part-00000-6e997909-89c5-4f54-b8ce-d30a3f1f4546-c000.json': 2359835,
 'covid19-lake/rearc-covid-19-testing-data/json/us-total-latest/part-00000-1fa3af7b-f025-4571-a7bc-c9ca2eefb46f-c000.json': 544,
 'covid19-lake/rearc-covid-19-testing-data/json/us_daily/part-00000-37a1a406-3909-4b0c-b309-a314187a2e1b-c000.json': 63776}

Read comma delimited file directly in pandas dataframe

In [18]:
file = 'covid19-lake/rearc-covid-19-testing-data/csv/us_daily/us_daily.csv'
df = pd.read_csv(s3.open(file, mode='rb'))
display(df)
date states positive negative pending hospitalizedCurrently hospitalizedCumulative inIcuCurrently inIcuCumulative onVentilatorCurrently ... lastModified total totalTestResults posNeg deathIncrease hospitalizedIncrease negativeIncrease positiveIncrease totalTestResultsIncrease hash
0 20200628 56 2540983 28447030 2198.0 32117.0 240156.0 5230.0 10473.0 2077.0 ... 2020-06-28T00:00:00Z 30990211 30988013 30988013 273 580 544208 42161 586369 dc9b104a6101a2b1d147dd004970493f3faef554
1 20200627 56 2498822 27902822 2186.0 32220.0 239576.0 5296.0 10415.0 2159.0 ... 2020-06-27T00:00:00Z 30403830 30401644 30401644 506 1057 547406 43471 590877 c76401840e79b9f3870ca039962ca0289d948dcf
2 20200626 56 2455351 27355416 2201.0 31423.0 238519.0 5263.0 10334.0 2075.0 ... 2020-06-26T00:00:00Z 29812968 29810767 29810767 619 1526 558574 44373 602947 d28d2902aab75c2b63f7584fd72a5e02f160fd0a
3 20200625 56 2410978 26796842 2133.0 31532.0 236993.0 5305.0 10257.0 2214.0 ... 2020-06-25T00:00:00Z 29209953 29207820 29207820 2500 1257 598526 39061 637587 f29e38890a88c4b4d5770436f86bcd1c326ca7ac
4 20200624 56 2371917 26198316 2049.0 30826.0 235736.0 5279.0 10173.0 2248.0 ... 2020-06-24T00:00:00Z 28572282 28570233 28570233 722 1310 473722 38706 512428 9fb40b6267ac764e6e112724ce5419555da235c0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
154 20200126 1 2 0 NaN NaN NaN NaN NaN NaN ... 2020-01-26T00:00:00Z 2 2 2 0 0 0 0 0 e1cf59ab48e1cf367c4a6798a508a23d9d36bd18
155 20200125 1 2 0 NaN NaN NaN NaN NaN NaN ... 2020-01-25T00:00:00Z 2 2 2 0 0 0 0 0 bef2a1d5f2a13491e0e0369bbd46c10cdd12973b
156 20200124 1 2 0 NaN NaN NaN NaN NaN NaN ... 2020-01-24T00:00:00Z 2 2 2 0 0 0 0 0 bfffe76fc0b7cf11efe8aecd3cc7b22598d77d61
157 20200123 1 2 0 NaN NaN NaN NaN NaN NaN ... 2020-01-23T00:00:00Z 2 2 2 0 0 0 0 0 cee36ebf3174bf1df0daa36e1e8088a157406fad
158 20200122 1 2 0 NaN NaN NaN NaN NaN NaN ... 2020-01-22T00:00:00Z 2 2 2 0 0 0 0 0 d538c99729d1fee626212d1878a100c1e1204a5f

159 rows × 25 columns

Ok, same result. Looks good.

Optional - do some visualization to have the data 'speak' a bit more

Python - Altair

For more information on Altair see Altair

We'll use Altair to display the data

In [19]:
import altair as alt
# by default altair will only render a visualization if the number of records <=5000
# if you want to disable that behaviour uncomment next line
# alt.data_transformers.disable_max_rows()

Check first datatypes of dataframe

In [20]:
df.dtypes
Out[20]:
date                          int64
states                        int64
positive                      int64
negative                      int64
pending                     float64
hospitalizedCurrently       float64
hospitalizedCumulative      float64
inIcuCurrently              float64
inIcuCumulative             float64
onVentilatorCurrently       float64
onVentilatorCumulative      float64
recovered                   float64
dateChecked                  object
death                       float64
hospitalized                float64
lastModified                 object
total                         int64
totalTestResults              int64
posNeg                        int64
deathIncrease                 int64
hospitalizedIncrease          int64
negativeIncrease              int64
positiveIncrease              int64
totalTestResultsIncrease      int64
hash                         object
dtype: object

We'll convert the int64 datatype of date to an actual datetime

In [21]:
df["timestamp"] = pd.to_datetime(df["date"], format="%Y%m%d")

set the index of the dataframe to a datetime index

In [22]:
df.set_index("timestamp",drop=False, inplace=True)
display(df)
date states positive negative pending hospitalizedCurrently hospitalizedCumulative inIcuCurrently inIcuCumulative onVentilatorCurrently ... total totalTestResults posNeg deathIncrease hospitalizedIncrease negativeIncrease positiveIncrease totalTestResultsIncrease hash timestamp
timestamp
2020-06-28 20200628 56 2540983 28447030 2198.0 32117.0 240156.0 5230.0 10473.0 2077.0 ... 30990211 30988013 30988013 273 580 544208 42161 586369 dc9b104a6101a2b1d147dd004970493f3faef554 2020-06-28
2020-06-27 20200627 56 2498822 27902822 2186.0 32220.0 239576.0 5296.0 10415.0 2159.0 ... 30403830 30401644 30401644 506 1057 547406 43471 590877 c76401840e79b9f3870ca039962ca0289d948dcf 2020-06-27
2020-06-26 20200626 56 2455351 27355416 2201.0 31423.0 238519.0 5263.0 10334.0 2075.0 ... 29812968 29810767 29810767 619 1526 558574 44373 602947 d28d2902aab75c2b63f7584fd72a5e02f160fd0a 2020-06-26
2020-06-25 20200625 56 2410978 26796842 2133.0 31532.0 236993.0 5305.0 10257.0 2214.0 ... 29209953 29207820 29207820 2500 1257 598526 39061 637587 f29e38890a88c4b4d5770436f86bcd1c326ca7ac 2020-06-25
2020-06-24 20200624 56 2371917 26198316 2049.0 30826.0 235736.0 5279.0 10173.0 2248.0 ... 28572282 28570233 28570233 722 1310 473722 38706 512428 9fb40b6267ac764e6e112724ce5419555da235c0 2020-06-24
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-01-26 20200126 1 2 0 NaN NaN NaN NaN NaN NaN ... 2 2 2 0 0 0 0 0 e1cf59ab48e1cf367c4a6798a508a23d9d36bd18 2020-01-26
2020-01-25 20200125 1 2 0 NaN NaN NaN NaN NaN NaN ... 2 2 2 0 0 0 0 0 bef2a1d5f2a13491e0e0369bbd46c10cdd12973b 2020-01-25
2020-01-24 20200124 1 2 0 NaN NaN NaN NaN NaN NaN ... 2 2 2 0 0 0 0 0 bfffe76fc0b7cf11efe8aecd3cc7b22598d77d61 2020-01-24
2020-01-23 20200123 1 2 0 NaN NaN NaN NaN NaN NaN ... 2 2 2 0 0 0 0 0 cee36ebf3174bf1df0daa36e1e8088a157406fad 2020-01-23
2020-01-22 20200122 1 2 0 NaN NaN NaN NaN NaN NaN ... 2 2 2 0 0 0 0 0 d538c99729d1fee626212d1878a100c1e1204a5f 2020-01-22

159 rows × 26 columns

So let's plot the data on a timeline
Let's look at the trend in positives/negatives, recovered, deceased and hospitalized patients for Covid-19 in the USA

In [23]:
source = df
base = alt.Chart(source).properties(width=1200, height=400).transform_fold(['recovered', 'death', 'hospitalized','positive'], as_ = ['counts', 'category'])
area = base.mark_area(opacity=0.5).encode(
    alt.X('timestamp:T'),
    alt.Y('category:Q', stack=None),
    alt.Color('counts:N')
    )
mark = base.mark_point(color='red', shape='circle').encode(x='timestamp:T', y='positive', tooltip=['timestamp', 'positive', 'negative', 'recovered', 'death','hospitalized'])
display(area+mark)

Let's also plot the increase numbers

In [24]:
source = df
base = alt.Chart(source).properties(width=1200, height=400).transform_fold(['deathIncrease', 'hospitalizedIncrease', 'positiveIncrease'], as_ = ['growth', 'category'])
area = base.mark_area(opacity=0.5).encode(
    alt.X('timestamp:T'),
    alt.Y('category:Q', stack=None),
    alt.Color('growth:N')
    )
mark = base.mark_point(color='red', shape='circle').encode(x='timestamp:T', y='positiveIncrease', tooltip=['timestamp', 'deathIncrease', 'hospitalizedIncrease', 'positiveIncrease'])
display(area+mark)
In [ ]: