Mastering Inceptor Data Audit: Principles, Syntax, and Practical Examples
This article explains the concept and workflow of data audit in Inceptor, detailing dirty data types, configuration steps, relevant SQL syntax for Log Error Tables, example table creation, and control switches that manage error handling and NOT NULL checks to ensure clean data processing.
This article is divided into two parts that introduce the principles and methods of data audit, covering its concept, processing workflow, and a brief overview of related syntax switches.
What Is Data Audit When importing data into business tables, insufficient data cleaning can introduce dirty data, which affects the accuracy of analysis and query results. Data audit functions are needed to inspect and control dirty data, improving analysis precision and preventing misjudgments.
In Inceptor, data audit writes dirty records to a user‑specified Log Error Table, annotating each record with the reason for invalidity. After import, Inceptor returns total record counts, imported record counts, and a data quality report, enabling monitoring programs to handle or display error information.
Dirty Data Types Recorded When data audit is enabled, Inceptor reports and logs the following problematic data to the Log Error Table:
Strings containing delimiter symbols that cause a single line to be misread as multiple lines.
Type mismatches where source data cannot be converted to the target column type.
Errors generated by UDF‑based filtering when type conversion fails.
Records that violate NOT NULL constraints.
Data Audit Processing Flow The complete data audit functionality follows this workflow; configure parameters according to these steps:
Create an external table and specify a Log Error Table (Inceptor data audit currently works only with external tables).
When querying the external table, each parsed record is checked; invalid records are written to the Log Error Table.
Optionally set a REJECT policy to stop reading when the error count or ratio exceeds a threshold.
Inceptor Data Audit Syntax Specifying a Log Error Table You must specify a Log Error Table when creating an external table. <code>CREATE EXTERNAL TABLE table_name (column1 datatype1, column2 datatype2, ...) LOG ERRORS INTO error_table_name [OVERWRITE] [SEGMENT REJECT LIMIT n [ ROWS | PERCENT ] ]</code> error_table_name is the name of the Log Error Table; if it does not exist, Inceptor creates it automatically. The keyword LOG ERRORS INTO must be used at table creation. Use OVERWRITE to replace the Log Error Table on each run; otherwise, new error records are appended. Include SEGMENT REJECT LIMIT n to define a reject threshold (rows or percent) for stopping reads. Tables without a specified Log Error Table are still audited, but errors are not recorded. Specifying a Log Error Table does not automatically enable data audit; the audit switch must be turned on. Example: Create External Table with Log Error Table The employee table imports records from employee.txt (shown below). We set the Log Error Table to employee_error_table , enable OVERWRITE , and apply a REJECT limit of 2 rows per task. <code>CREATE EXTERNAL TABLE employee (</code> <code> id int NOT NULL,</code> <code> name string,</code> <code> age tinyint,</code> <code> degree string,</code> <code> onboard int</code> <code>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/DataAudit/employee' LOG ERRORS INTO employee_error_table SEGMENT REJECT LIMIT 2 ROWS;</code> Running SELECT * FROM employee yields the result shown below. Three rows contain dirty data: rows 1 and 10 have age values that cannot be cast from string to integer (displayed as NULL), and row 11 violates the NOT NULL constraint on id . Control Switches Three switches control data audit behavior: SET inceptor.data.audit = true/false; – Master switch; when enabled, dirty data and NOT NULL checks are performed (default off). SET inceptor.strict.evaluate = true/false; – When true, encountering dirty data throws an exception (default off). SET inceptor.notnull.audit = true/false; – Enables NOT NULL constraint checking (default off). Enabling inceptor.data.audit forces the other two switches to true; they cannot be changed independently while the master switch is on. When the master switch is off, the sub‑switches can be set independently, affecting whether errors are logged to the Log Error Table or printed to the console. Example 1: Enable Master Switch After turning on the master switch, SELECT * FROM employee returns only valid rows; subsequent queries access only clean data. The audit considers only fields involved in the current query, and three dirty rows are recorded in the Log Error Table. Example 2: Disable Master Switch, Enable Strict Evaluation With the master switch off and inceptor.strict.evaluate on, Inceptor throws exceptions on dirty data without recording them in the Log Error Table. Summary Data audit acts as a dirty‑data filter, ensuring queries access only clean data and providing clues for correcting dirty records, thereby isolating bad data from healthy data. Mastering its usage and control mechanisms significantly enhances data reliability and analysis accuracy. Future articles will cover how to modify data audit properties, interpret Log Error Table contents, and manage its access permissions.
StarRing Big Data Open Lab
Focused on big data technology research, exploring the Big Data era | [email protected]
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
