Data Lakes – Apache Spark in detail

Dirk BrysBig Data

We already discussed the Spark Ecosystem. In this blog we’ll delve a bit deeper into the main reasons why you would use Spark. In short, for its distributed processing engine through Spark Clusters and it’s Query Optimization engine. Spark Clustering … Read More

Data Lakes – The Apache Spark Ecosystem

Dirk BrysBig Data

Once you have stored data you need to process it. Enters the distributed processing system called Apache Spark. Spark reads and processes data on a cluster of machines and, once processed, writes it back to either a distributed file system, … Read More

Data Lakes – Apache Parquet

Dirk BrysBig Data

We love open-source and we believe strongly in the advantages of using open source technologies. In most cases there’s a strong community behind it and the quality of the product or software is generally very high. An example of such … Read More

Data Lakes – Amazon S3

Dirk BrysBig DataLeave a Comment

The first requirement for any data lake is to ensure it can store the raw data that we want to use for data analytics. And in any format, whether it’s structured or unstructured data. We potentially want to store flat … Read More