Help Centre

Best practice
Data import pattern

This is the standard pattern for importing files to a pipeline.

Stage 1: Import, Map and Cleanse

Inputs

Set the inputs from a feed to auto-update, so every time new data is received, it will be imported to the pipeline.

1.1 Transform to a single-header table

Usually this requires two steps, in this order:

  1. Remove rows with less than [3] columns: if there is data above or below the main table, this operation will strip it out. The column filter function in Remove Rows can also be included to strip out subtotals (e.g. remove rows where column ‘unnamed’ contains ‘total’).
  2. Detect headers : define the number of rows which will be used as headers, combining double or multiple headers into a single header row.

Occasionally, there may be files which require additional steps:

1.2 Map Headers to interim schema

Use Map Column Headers to map the incoming data to a master schema. Consider using an interim schema rather than jumping straight to the final output.

Store master headers in the Data Repo
In the data repo, store your master schema as a file containing only headers, then bring this into Map Column Headers as a source. Define the master schema from this file and map the other source files to it.

1.3 Append metadata, basic cleansing and validations

Stage 2: Union

Inputs

All Stage 1 outputs

Set Linked Stages ON on the Stage 1 outputs. When new files are brought in to Stage 1 they will flow through to Stage 2 automatically, and then through to the rest of the pipeline.

The Union stage can be followed by any combination of stages required to achieve the target outputs.

Why we recommend using a Union as Stage 2
The output ID from a Union Stage is immutable, so stages referencing it are not affected by changes to the Union's input files. Using a Union early makes the pipeline faster to Trace and more resilient to configuration changes.