Interim schemas

What is an interim schema and why use one?

An interim schema is a common schema (set of headers) which uses easy to understand terminology and is independent of any specific input or output header names.

Being independent of your target output schema means that if the target output changes, only the final mapping from the interim to the target needs to change – not every instance of a column name in the whole pipeline.

Use it as early as possible in the pipeline

Incoming files should be mapped to the interim schema as early as possible, before any calculations, validations or cleansing steps are carried out. This means changes to the incoming header names need only be dealt with once, in the mapping stage, rather than in all downstream operations relying on that column name.

Likewise, at the end of the pipeline, the data can be mapped from the interim schema to the output format(s).

Deploying a schema in a pipeline

To deploy a schema in a pipeline, store the headers as a dataset in the Data Repo. The dataset should have zero rows, to avoid additional rows being injected into the input data.

Add the reference headers dataset as a stage input and it will appear as a schema in the Map Column Headers operation, where you can define a master from it.

Using existing mappings

If your organisation has a set of existing approved mappings, these can be used to seed the Map Column Headers Automap function, so that approved mappings always have a 100% rating. Contact support if you'd like this done.

Bringing in a new input file

When a new input file is brought in, if the schema is identical to a previous input, the mappings will be remembered. Note that it may take a few seconds for the new input to trace through the pipeline and for the mappings to be applied.

If there are variations in the schema, all mappings will need to be reapplied. Automap will quickly reapply previously made mappings.

Updating a master schema

Updating the interim schema dataset will make the new schema available to all pipelines it is used in. However, it won’t automatically update the Master Schema in the Map Column Headers operation.

To best way to update the Master Schema, depends on how big the update is:

If it’s just one column name changing, that can be updated manually in the master schema.
If multiple columns are being added, then add them to master from the Master Headers input, one-by-one.
If many column names are changing, it may be best to use ‘Define a master from this dataset’. This will copy over all items but may undo mappings that had previously been made.

Changing a Master Schema will require changing column names in downstream operations.

Copying a master schema

To copy a master schema from Map Column Headers:

View the output data from that stage.
Copy the data by clicking in the top left of data grid to select all items, then using ctrl/cmd c.
Paste the items into a spreadsheet.
Remove all rows other than the headers.
Save the spreadsheet then upload to the data repo.

Help Centre