Python - Using S3 Public Data - Tutorial

In this tutorial we’ll dig into Amazon’s Public Datasets. These datasets are available for everyone to explore and are stored in S3 buckets. We’ll learn how to connect to these buckets, explore them and how to load these data in Pandas dataframes. In future tutorials we’ll see how to load these data in Spark dataframes and transform them for more heavy lifting, transformations and ML but for now we’ll stick to Pandas.

Pandas is probably the most popular python library for data scientists to deal with datasets. It’s a very powerful library and even when working with Spark dataframes (required if you want to run distributed code) you’ll probably will still convert to Pandas at one point. Luckily it’s pretty straightforward to do this. You can use either Koalas or just the simple toPandas() function in Spark.

For this tutorial we’ll have a look at the public ‘Covid-19 lake’ bucket on AWS S3. You can find all the code in this git: git python notebook.

It’s also available in Google Colab (Colab Notebook) or you can download the code and run it in Jupyter. For your convenience we’ve embedded it here also in html format so you can read through the notebook. However, you won’t be able to run it here. At the end you should see the same output (potentially more data since the data keeps on growing) as provided below.

So let’s dive in.

So that’s it. We covered following topics in this tutorial:

Using Boto3 to connect, browse and load files from S3 into a Pandas dataframe
Using s3fs to connect, browse and load files from S3 into Pandas
(Optional): visualize one of the datasets in a Python visualization library (e.g. Altair)

There’s of course more to say about S3. In a next tutorial we’ll guide you in how to create your own bucket and provide secure access to it and how to write data and load data from secure Private buckets.

Related posts: Amazon S3

Want to know more?

Get in Touch

Dirk Brys

Originally I started my career as an expert in OO design and development.

I shifted more than 15 years ago to data warehousing and business intelligence and specialised in big data and data science.

My main interests are in deep learning and big data technologies.

My mission: store, process and deliver data fast, provide insights in data, design ML models and apply them to smart devices.

In my spare time I’m very passionate about Salsa dancing. So much that I performed internationally and that I have my own dance school where I teach LA style salsa.

sAInce.io

Python – Using S3 Public Data