Python – Using S3 Public Data

Dirk BrysBig Data, Tutorial

In this tutorial we’ll dig into Amazon’s Public Datasets. These datasets are available for everyone to explore and are stored in S3 buckets. We’ll learn how to connect to these buckets, explore them and how to load these data in Pandas dataframes. In future tutorials we’ll see how to load these data in Spark dataframes and transform them for more heavy lifting, transformations and ML but for now we’ll stick to Pandas.

Pandas is probably the most popular python library for data scientists to deal with datasets. It’s a very powerful library and even when working with Spark dataframes (required if you want to run distributed code) you’ll probably will still convert to Pandas at one point. Luckily it’s pretty straightforward to do this. You can use either Koalas or just the simple toPandas() function in Spark.

For this tutorial we’ll have a look at the public ‘Covid-19 lake’ bucket on AWS S3. You can find all the code in this git: git python notebook.

It’s also available in Google Colab (Colab Notebook) or you can download the code and run it in Jupyter. For your convenience we’ve embedded it here also in html format so you can read through the notebook. However, you won’t be able to run it here. At the end you should see the same output (potentially more data since the data keeps on growing) as provided below.

So let’s dive in.

So that’s it. We covered following topics in this tutorial:

  • Using Boto3 to connect, browse and load files from S3 into a Pandas dataframe
  • Using s3fs to connect, browse and load files from S3 into Pandas
  • (Optional): visualize one of the datasets in a Python visualization library (e.g. Altair)

There’s of course more to say about S3. In a next tutorial we’ll guide you in how to create your own bucket and provide secure access to it and how to write data and load data from secure Private buckets.

Related posts: Amazon S3

Want to know more?

Get in Touch