Best practice
This is the standard pattern for importing files to a pipeline.
Stage 1: Import, Map and Cleanse
Inputs
-
Raw data files from a Feed (or a direct upload to the pipeline)
-
Master schema reference file from Data Repo
Set the inputs from a feed to auto-update, so every time new data is received, it will be imported to the pipeline.
1.1 Transform to a single-header table
Usually this requires two steps, in this order:
-
Remove rows
with less than [3] columns: if there is data above or below the main table, this operation will strip it out. The column filter function in Remove Rows can also be included to strip out subtotals (e.g. remove rows where column ‘unnamed’ contains ‘total’).
-
Detect headers
: define the number of rows which will be used as headers, combining double or multiple headers into a single header row.
Occasionally, there may be files which require additional steps:
1.2 Map Headers to interim schema
Use Map Column Headers to map the incoming data to a master schema. Consider using an interim schema rather than jumping straight to the final output.
Store master headers in the Data Repo
In the data repo, store your master schema as a file containing only headers, then bring this into Map Column Headers as a source.
Define the master schema from this file and map the other source files to it.
1.3 Append metadata, basic cleansing and validations
-
Date cleanse
: converts dates to QT format.
-
Append metadata
to add columns such as the input filename, pipeline name, ID, run time, run number. These can be used to trace the source of a row after the data has been unioned.
Submission metadata sucha as submission period may also be useful.
-
Validation operation
to apply data quality checks (columns are not blank, claim values are not negative, inception date is after expiry date, etc).
Stage 2: Union
Inputs
All Stage 1 outputs
Set
Linked Stages ON on the Stage 1 outputs. When new files are brought in to Stage 1 they will flow through to Stage 2 automatically, and then through to the rest of the pipeline.
The Union stage can be followed by any combination of stages required to achieve the target outputs.
Why we recommend using a Union as Stage 2
The output ID from a Union Stage is immutable, so stages referencing it are not affected by changes to the Union's input files. Using a Union early makes the pipeline faster to Trace and more resilient to configuration changes.