ABC Framework for Data Quality Management

18/7/2025 7-minute read

Introduction

Often the tasks that are performed for data management are very repetative and hence the need for creating a framework around the common tasks become imperitive when we want to scale a project without compromising on the guardrails of how we want to manage the project.

Hence the need for reusable development patterns gives a strong case for a framework based approch.

ABC framework is one such solution for a meta data driven approch for developing and support of data pipelines(low code / no code) It also has its value proposition in Data Quality Management (DQM) because in any data project we measure the data’s quality across

Accuracy
Consistency
- Data Dictionary
- Data Lineage
Completeness
Integrety
Timeliness

Data Quality Dimension	Role of ABC Framework
Accuracy	Validates against source-of-truth values
Consistency	Ensures uniform rules across pipelines
Data Dictionary	Enforces schema and definitions via metadata
Data Lineage	Tracks transformations and flow automatically
Completeness & Comprehensiveness	Flags missing fields or records
Integrity	Referential checks based on metadata constraints
Timeliness	Monitors data arrival times and freshness

What is ABC Framework?

Audit, Balance and Control Framework enforces the data quality dimensions systematically.Audit is the process of Identifying what happened during an ETL operation. Balance is the process of confirming if what happened was correct or not. Control is the process of identifying and resolving errors that may have happened during the ETL process.

Audit

Audit is the process of identifying what happened during an ETL operation,Audit table is for how the ETL job moved the data, and not how the ETL job is executed, basically what data was moved from which source to which target and which pipeline was used, we cannot possibly use the audit table as a substitute to the logs of the ETL job tasks and how we can optimize the ETL pipeline.

That being said, it does capture the metadata to understand why some pipelines are taking long or the ones which are failing very often and we can deep dive into how the modular pipeline can be optimized.

We can capture data like execution time, input parameters, log location, etc. we capture that data with the intent to inform downstream processes about your data quality example, capturing execution time may be helpful in the scenario where your ETL job relies on asynchronous API calls to third party services. A high execution time suggests your ETL pipeline is being held-up, and there may be a way to improve its execution speed (recall Timeliness is a factor of DQM). Additionally, if you capture log location in your Audit process, you could conduct forensic analysis on your ETL job at a later date. This could come in handy for the Balance and Control processes in particular.

Some questions that the Audit table can answer are:

What ETL job ran? What tasks were associated with that job? Did they all run successfully?
Were there errors that did not halt execution?
What were the input parameters for the process? Are these parameter general or specific to the process itself? Do these parameters change based on time of day or affected systems of record?
How many records were extracted? From where were they extracted? How many records were transformed? What transformations were conducted? How many records were loaded? Where were those records loaded?
When did it all start? What/Who started this process? When did it all end?
What was started as a result of this pipeline finishing? Did this particular job’s execution improve or hurt your environment’s overall data quality rating?

If your organization suffers from data quality issues, you need to start with auditing your ETL pipelines with proper pieces of information.

NOTE: for Data Quality Management done by usage of Audit Balance and control tables, the upstream ETL should be working, if something goes wrong with the ETL job the DQM process will also suffer from the same issue, hence this architechtural pattern is tightly coupled.

In most cases it is meant to be like so, and stake holders are fine with this approach that these processes should be coupled, but that is not always the case. So for the scenarios where the audit should work independently of the ETL job we use the EVENTS and the decoupled approach.

alt text

In the diagram above you see a sample architecture that decouples ETL pipelines and DQM pipelines using events with AWS services. The above image shows a sample architecture that decouples ETL pipelines and DQM pipelines using events with AWS services. Events allow us to send information asynchronously from one application stack to another. In the above image, we describe an ETL pipeline constructed using AWS Data Pipeline and EMR. This pipeline connects an RDS data base to an S3 data lake, fully equipped with Redshift Spectrum for data warehousing queries, and a Glue Data Catalog. Where does DQM fit into this ETL pipeline? That’s where the lambda function comes in. The lambda publishes a custom event to SNS/SQS. This event triggers an execution of AWS Step Functions which contains the logic for your DQM pipeline. One of the functions in that step function stack would contain the logic needed to trigger an audit job.

There are many benefits of using this pattern:

The Events decouple the disparate application components
This architecture pattern implies that everything your audit process needs to start is contained within the body of the event.
The inputs to your audit process could be explicitly defined as key-value pairs in your JSON body.
Being an event driven process this kind of architecture become dependent on the message bus you adopt, id you use Kafka as an event-bus, you can craft your messages in such a way that streaming analytics can be used to describe your systems overall data quality.

Balance

Balance is how you can confirm if your ETL processes are operating with correct data. we can extend the datamodel for the audit to have the columns which helps us in balancing the data during an etl process like records read and records written.

alt text

with this model, we know how each job ran, and where each job’s records came from and were stored. This audit information will form the basis for our balance process.

Balancing in practice

To Balance your data, you would need to ensure minimum, the following:

That your data in your source table is as expected.
That the data in your target table is as expected.

Control: Bringing it all togather

The Control process in a DQM pipeline is the “fixer” of errors. An effective control process will look at the inputs of an ETL job and corresponding outputs and ultimately be able to determine what went wrong and ideally how to fix said issues. In part of the process where we just want to rerun the failed records identified through the balance table, the earlier data model that just captured the information about the job and the records that were read and written to capture more information about the individual job runs in order to “replay” an ETL job.

alt text

This schema paints a bigger picture of the kind of information we can collect to “replay” an ETL job Notice the relationship among the ETL_JOB, ETL_JOB_RUN, and ETL_JOB_PARAM tables. These tables allow us to store all of the parameters associated with an individual run of an ETL job. By querying these tables appropriately, we can reverse engineer the job that created the erroneous records flagged by our Audit and Balance processes. From here, our Control process only has to resubmit an additional job with the same (or a modified set of) parameters from the one originally submitted.

The ABC framework is quite vast and can differ based upon the usecases however it does gives a wider control to the engineering teams to deliver high quality data products and have more granular control over the entire ETL process, therby improving the areas where the efficiencies can be achieved.