Validating against a Reference Dataset
Scenario: Confirm that every value in a column is using a value from a table of reference values, e.g. check that a ‘policy_transaction_type’ column contains only valid transaction types, according to a reference table.
-
Set the source data as your left dataset, your ref data as your right dataset.
-
Use the 'policy_transaction_type' column in both the input data and the reference data as the join point
-
Configure two outputs:
-
Output 1: Matched rows + Unmatched from left (this will return all the values in the source data)
-
Output 2: Unmatched from left (this will return only the invalid rows from the input data)
You can additionally create an entry in the
Validation Report
to visualise the proportion of failed rows:
-
In a subsequent Transform stage, take Output 1 from the Join as input.
-
Add a
Validate operation
and enter a validation condition to return a warning or failure when 'policy_transaction_type’ column from the source data does not match the ‘policy_transaction_type’ from the joined-in reference data.
-
The failed rows will be reported on in the Validation Report. You can view and download them directly from there – this means you can dispense with Output 2 if desired.
Correct invalid values by fuzzy matching to a reference table
Invalid values identified by the process above can be corrected using fuzzy matching. This is accomplished using
Automap Values
.
In this case you would:
-
Create a Transform stage taking Output 2 from the Join (unmatched rows) as its input.
-
Insert a Map Values operation:
Select ’policy_transaction_type’ column as the ‘map from' value
-
In the ’Settings' column on the right:
Select whether to write the values to a new column (recommended) or to overwrite the existing column.
Select a value to return when the value is unmapped, or leave it unchanged.
-
Select ‘Automap’:
Choose the reference dataset and column with the approved values.
We recommend leaving the match strength at zero for your first run and leaving
waypointing
deactivated. At this point you can try running the pipeline and seeing what kind of results you are getting. A fuzzy match will be performed against the ref dataset with the results reported on in the
Mapping Report
. You can review these results, adjust match strength accordingly to filter out weak matches, and perform remappings.
-
If desired, you can now join the cleansed values back into the main data.
Note that the Automap Values could equally be performed on Output 1 from the join (returning a 100% match for the correct values). The advantage of this is that it simplifies the pipeline by not requiring an additional step to join cleansed values back in. The disadvantage is that Automap values may take longer to run if it is analysing every row, rather than just the rows we know to be invalid.