Analytics Aura

Unlock Real-Time Insights: Building a PostgreSQL Data Lake with Qlik Replicate

Introduction

In today's data-driven world, businesses thrive on instant access to diverse, real-time data. We embarked on a project to create a robust, PostgreSQL-based data lake, leveraging Qlik Replicate to handle high-velocity change data with minimal latency and maximum throughput. This post details our journey and the significant impact it had on our organization.

Project Overview: Real-Time Data Pipeline

Our primary goal was to establish a centralized data repository that could ingest and process real-time data from various sources using Qlik Replicate. The solution we implemented involved a streamlined pipeline:

Data Ingestion: From Source to Kafka

Diverse Sources: We configured Qlik Replicate to capture change data from Oracle, MSSQL, and MySQL databases.
Real-Time Streaming: The captured data was streamed in real-time to Apache Kafka, a distributed event streaming platform.

Data Processing and Storage: Kafka to PostgreSQL

Kafka's Role: Kafka acted as a critical buffer, ensuring low-latency and reliable message queuing for high-volume data streams.
Unified Data Lake: The processed data was then ingested into PostgreSQL, creating a unified and easily accessible data lake for our business teams.

Issue Identified: Tracking Change Data Processing

Issue: We were unable to get a count of daily changes captured from the source and applied to the target in Qlik Replicate while the task continuously captured and applied changes to the target.
Fix: We developed a framework that tracks changes from the incoming source and the changes applied to the target. This framework enabled us to audit source changes and applied target changes daily, ensuring data integrity and transparency.

Key Achievements

Near-Zero Latency: We achieved near real-time data processing through optimized hardware and network configurations.
High Throughput: Our efficient infrastructure design allowed us to handle large volumes of data seamlessly.
Versatile Data Support: The solution supports both transactional and non-transactional data, catering to a wide range of business needs.
High Availability: Implementing Qlik Replicate in an Active/Passive setup with shared storage ensured continuous pipeline operation, even during outages or planned maintenance.