PIPELINES
- Introducing pipelines
- Pipelines workspace
- Managing pipelines
- Inputs and outputs
- Join and Union
- Dates
- Automap Values
- Validation
- Operations
- Syntax and functions
- Best practice
- Updates and legacy

DATA

ANALYSE

ADMIN

Best practice

Data import pattern

This is the standard pattern for importing files to a pipeline.

Stage 1: Import, Map and Cleanse

Inputs

Raw data files from a Feed (or a direct upload to the pipeline)
Master schema reference file from Data Repo

Set the inputs from a feed to auto-update, so every time new data is received, it will be imported to the pipeline.

1.1 Transform to a single-header table

Usually this requires two steps, in this order:

Remove rows with less than [3] columns: if there is data above or below the main table, this operation will strip it out. The column filter function in Remove Rows can also be included to strip out subtotals (e.g. remove rows where column ‘unnamed’ contains ‘total’).
Detect headers : define the number of rows which will be used as headers, combining double or multiple headers into a single header row.

Occasionally, there may be files which require additional steps:

Multiple data grids placed next to each other on the same sheet:
Use Stack Grids in combination with Fill Down . Watch the tutorial .
Tables need to be flattened or transposed.
Use Swing Down .

1.2 Map Headers to interim schema

Use Map Column Headers to map the incoming data to a master schema. Consider using an interim schema rather than jumping straight to the final output.

Store master headers in the Data Repo

In the data repo, store your master schema as a file containing only headers, then bring this into Map Column Headers as a source. Define the master schema from this file and map the other source files to it.

1.3 Append metadata, basic cleansing and validations

Date cleanse : converts dates to QT format.
Append metadata to add columns such as the input filename, pipeline name, ID, run time, run number. These can be used to trace the source of a row after the data has been unioned. Submission metadata sucha as submission period may also be useful.
Validation operation to apply data quality checks (columns are not blank, claim values are not negative, inception date is after expiry date, etc).

Stage 2: Union

Inputs

All Stage 1 outputs

Set Linked Stages ON on the Stage 1 outputs. When new files are brought in to Stage 1 they will flow through to Stage 2 automatically, and then through to the rest of the pipeline.

The Union stage can be followed by any combination of stages required to achieve the target outputs.

Why we recommend using a Union as Stage 2

The output ID from a Union Stage is immutable, so stages referencing it are not affected by changes to the Union's input files. Using a Union early makes the pipeline faster to Trace and more resilient to configuration changes.

Help Centre

Stage 1: Import, Map and Cleanse

Inputs

1.1 Transform to a single-header table

1.2 Map Headers to interim schema

1.3 Append metadata, basic cleansing and validations

Stage 2: Union

Inputs