Smart technology for a strong manufacturing industry with 24/7 uninterrupted production of IIoT solutions for companies.
VDAB, the public employment service of Flanders, Belgium, plays a pivotal role in job matching, vocational training, and career guidance for job seekers. It also supports employers in finding qualified candidates and offers integration programs for disadvantaged groups, including the long-term unemployed, people with disabilities, and immigrants. Although sAInce.io operates primarily in the manufacturing industry we were still eager to assist VDAB in meeting the challenges they face.
To fulfill its mission as the career director of Flanders, VDAB needed to transition to a new big data technology stack capable of managing the vast amounts of data it processes daily. Efficient data management and rapid adaptation to new reporting, analytics and machine learning requirements were crucial for VDAB to enhance its services.
sAInce.io played a pivotal role in selecting and testing a new data platform (SingleStore) for VDAB’s enterprise data warehouse on-premise. After the initial PoC this responsibility was transferred to a new team that works on the new DWH in a datavault approach. sAInce.io was then incorporated in a development team to assist in the development of a generic ingestion framework to load any kind of data in a data lakehouse.
Any kind of data meaning table sources from traditional RDBMS systems like Oracle, MySQL and SQL Server, Kafka event data and flat files. The goal of the data lakehouse: provide a full historical layer for all these datasets so it can be used
by data scientists to time travel historical data and build ML models
by business analysts or reporters to quickly explore historical data
by business users via the data vault and data marts in the DWH for full blown analytical reporting needs.
In addition, this ingestion framework had to land in the existing architectural landscape of VDAB. Nifi is used as technology to offload all table and file datasets and generates avro delta files on an on-premise datalake supported by a Cloudera HDFS cluster.
Finally the team developed a robust data ingestion framework based on Python, Apache Airflow, Apache Spark, Apache Iceberg, and HDFS, facilitating efficient data handling and integration.
Versatile Data Ingestion: The framework loads data from various sources, including databases, events and files into a data lake (actually a lakehouse) and serves the enterprise data warehouse landing area through simple configuration.
Automation and Efficiency: By requiring only a few configuration items, the system automatically manages data flow through all layers, significantly improving the time-to-market capability and eliminating the need for custom-built ETL/ELT pipelines.
Comprehensive Data Handling: The framework supports loading database tables, flat files and Kafka events. It ensures historization with multiple methods like transaction date, content hash value, audit table mappings, and pure append-only.
Advanced Data Storage: The system creates SCD2 (Slowly Changing Dimension type 2) Iceberg tables in parquet format on an HDFS datalake and runs on a Spark cluster on-premise (Cloudera).
Parallel Processing and Optimization: Data is loaded in parallel and optimized for loading and transformation based on the dataset's structure, size, and historization method.
Integration and Automation: The framework integrates with SingleStore, allowing automatic ingestion into a landing area and further automation through Vaultspeed for creating data vault layers. The framework supports schema changes, automatic schema evolution, replays, and subprocess retries.
Monitoring and Orchestration: The system integrates with Prometheus/Grafana for monitoring. Ingestion, replay, and maintenance are orchestrated using Apache Airflow, optimizing resource usage on the Cloudera on-premise appliance.
Implementing the new data ingestion framework at VDAB has led to several significant benefits:
Improved Time-to-Market: Faster delivery of new datasets and reporting/analytics requirements.
Cost Reduction: Decreased cost of additional ETL development.
Enhanced Efficiency: Speedier delivery of new requests, enabling VDAB to respond promptly to new demands
Although we played a pivotal role in the development of this new framework, building a system is always a team effort and not the merit of one single person. So congrats to all our fellow colleagues at VDAB and keep up the good work!
sAInce.io's versatile capabilities also adhere to data engineering and big data challenges. Thanks to our 20+ years of experience in data warehousing, business intelligence and analytics and a profound knowledge of distributed computing frameworks like Apache Spark and Dask and a deep understanding of data architectural needs for any use case, we can assist you in the challenges for digital transformation, not only focusing on how to store data and use data in applications but how to leverage this data to the end users for analytical needs. By taking the end-to-end pipeline into consideration when designing systems, we are capable of delivering systems that really make a difference and enable a true transformation to the smart factory.