Data Lakes – The Apache Spark Ecosystem

Dirk BrysBig Data

Once you have stored data you need to process it. Enters the distributed processing system called Apache Spark. Spark reads and processes data on a cluster of machines and, once processed, writes it back to either a distributed file system, … Read More

Data Lakes – Apache Parquet

Dirk BrysBig Data

We love open-source and we believe strongly in the advantages of using open source technologies. In most cases there’s a strong community behind it and the quality of the product or software is generally very high. An example of such … Read More

Python – Using S3 Public Data

Dirk BrysBig Data, Tutorial

In this tutorial we’ll dig into Amazon’s Public Datasets. These datasets are available for everyone to explore and are stored in S3 buckets. We’ll learn how to connect to these buckets, explore them and how to load these data in … Read More

Data Lakes – Amazon S3

Dirk BrysBig DataLeave a Comment

The first requirement for any data lake is to ensure it can store the raw data that we want to use for data analytics. And in any format, whether it’s structured or unstructured data. We potentially want to store flat … Read More