An interim schema is a common schema (set of headers) which uses easy to understand terminology and is independent of any specific input or output header names.
Being independent of your target output schema means that if the target output changes, only the final mapping from the interim to the target needs to change – not every instance of a column name in the whole pipeline.
Incoming files should be mapped to the interim schema as early as possible, before any calculations, validations or cleansing steps are carried out. This means changes to the incoming header names need only be dealt with once, in the mapping stage, rather than in all downstream operations relying on that column name.
Likewise, at the end of the pipeline, the data can be mapped from the interim schema to the output format(s).
To deploy a schema in a pipeline, store the headers as a dataset in the Data Repo. The dataset should have zero rows, to avoid additional rows being injected into the input data.
Add the reference headers dataset as a stage input and it will appear as a schema in the Map Column Headers operation, where you can define a master from it.
If your organisation has a set of existing approved mappings, these can be used to seed the Map Column Headers Automap function, so that approved mappings always have a 100% rating. Contact support if you'd like this done.
When a new input file is brought in, if the schema is identical to a previous input, the mappings will be remembered. Note that it may take a few seconds for the new input to trace through the pipeline and for the mappings to be applied.
If there are variations in the schema, all mappings will need to be reapplied. Automap will quickly reapply previously made mappings.
Updating the interim schema dataset will make the new schema available to all pipelines it is used in. However, it won’t automatically update the Master Schema in the Map Column Headers operation.
To best way to update the Master Schema, depends on how big the update is:
Changing a Master Schema will require changing column names in downstream operations.