In my projects i need to design a transactional database and since the beginning i know that it will be the source of an ETL, feeding a Data Warehouse.
I add Primary Key (single column or composite) and standardized metadata columns in all the tables. The date-tracking columns:
- date_created
- date_last_modified
Option 1
- date_created = now()
- date_last_modified = NULL
In MariaDB my CREATE TABLE statement includes:
… , date_created DATETIME NOT NULL DEFAULT current_timestamp() , date_last_modified DATETIME DEFAULT NULL ON UPDATE current_timestamp()
Later, when i need to query these columns, the code is:
SELECT … FROM … ORDER BY COALESCE(date_last_modified, date_created) DESC;
Option 2
- date_created = now()
- date_last_modified = now()
In this case, i can use column date_last_modified independently and when both column equal, i know that this row was never updated.
Option 3
- date_last_modified = now()
In this case i use only column date_last_modified and i lose track of when this row was created.
Data Warehouse
On the Data Warehouse side, when ingesting into Bronze, i use column date_last_modified to select only the UPSERTed rows since the last execution:
SELECT <columns list> FROM <table name> WHERE date_last_modified >= <datetime of the last execution> AND date_last_modified < <datetime now>;
Side note: CDC option in SQL Server is Change Tracking. I wrote some posts on this topic.
When i create the silver layer, i determine the last version of the row:
SELECT T.* FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY '<pk columns list>' ORDER BY date_last_modified DESC ) AS R , * FROM <schema name>.<table name> ) AS T WHERE T.R = 1;
To generate the list of the Primary Key columns:
SELECT
STRING_AGG(KU.COLUMN_NAME, CONCAT(CHAR(39), ',', CHAR(39))) AS columns_list
FROM
INFORMATION_SCHEMA.TABLE_CONSTRAINTS AS TC
JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE AS KU ON TC.CONSTRAINT_TYPE = 'PRIMARY KEY'
AND TC.CONSTRAINT_NAME = KU.CONSTRAINT_NAME
WHERE KU.TABLE_NAME = @table_name;
Keep it simple :-)
