Data Lakes – Apache Spark in detail

Dirk BrysBig Data

We already discussed the Spark Ecosystem. In this blog we’ll delve a bit deeper into the main reasons why you would use Spark. In short, for its distributed processing engine through Spark Clusters and it’s Query Optimization engine. Spark Clustering … Read More

Neural Networks – Crash course

Dirk BrysMachine Learning

Today’s world cannot be envisioned anymore without the appliance of neural networks in our day-to-day life. Think about hand writing recognition, image recognition, voice recognition, recommendation engines etc. When we use our smartphones, our image gallery is automatically categorized and … Read More

Data Lakes – The Apache Spark Ecosystem

Dirk BrysBig Data

Once you have stored data you need to process it. Enters the distributed processing system called Apache Spark. Spark reads and processes data on a cluster of machines and, once processed, writes it back to either a distributed file system, … Read More

Data Lakes – Apache Parquet

Dirk BrysBig Data

We love open-source and we believe strongly in the advantages of using open source technologies. In most cases there’s a strong community behind it and the quality of the product or software is generally very high. An example of such … Read More

Python – Using S3 Public Data

Dirk BrysBig Data, Tutorial

In this tutorial we’ll dig into Amazon’s Public Datasets. These datasets are available for everyone to explore and are stored in S3 buckets. We’ll learn how to connect to these buckets, explore them and how to load these data in … Read More