Schema definition for validation

Overview

This reference represents a guide on how to define validation schemas that can be used to perform extracted data validation.

A validation schema is nothing more than an object encoding the rules based on which the Docheart Validation API will decide whether or not a certain extracted value is valid or not.

Validation schemas are sent together with a list of extracted values to the Docheart Validation API. The Docheart Validation API will evaluate each of the provided extracted values in the list against the provided validation schema. The result of evaluating one extracted value against a schema is a boolean. If a certain extracted value confirms to the validation schema (passes the rules defined in the schema), the result of its evaluation will be True. Otherwise, the result of its evaluation will be False.

Validation schemas can optionally be included inside extraction models. When a validation schema is included inside an extraction model, the Docheart Extraction Pipeline will automatically evaluate the values extracted using the model against the included validation schema. The result of this evaluation will be added to the extracted data object.

Validation schema data structure

Validation schemas are represented as boolean trees encoding logical expressions. The nodes in the tree are of the following types:

leaf nodes - nodes that don’t have any children. These nodes always encode logical expressions about extracted values. You can think of a leaf node as a function that takes the extracted value as input and returns either true or false as output.
logical nodes - non-terminal nodes that have one or multiple children. The nodes always encode logical operations about their children. Logical nodes encode either an or operation or an and operation. A logical node encoding an or operation will evaluate to true if at least one of its children evaluates to true. In contrast, a logical node encoding an and operation will evaluate to true if and only if all of its children evaluate to true.

The image below provides a conceptual view of a validation schema:

Conceptual model validation schema

The grey nodes in the conceptual model represent the leaf nodes encoding logical expressions. The yellow nodes are the logical nodes connecting other nodes through logical operations such as ors or ands.

The conceptual model in the image contains a very simple schema. Docheart Validation API supports validation schemas of arbitrary size and complexity.

Validation schema evaluation

Now that we have seen how a validation schema is conceptually defined, let us look at the mechanism through which it is evaluated.

Let’s pretend that that the value Rotterdamsweg 100 was extracted from a document and we want to check whether or not it is valid using Docheart Validation API.

The first step Docheart will perform is substituting the Value token in each of the leaf nodes of the schema with the actual extracted value and evaluate the associated expressions to either true or false, as presented in the image below:

Evaluation step 1

As we can see, the character length of Rotterdamsweg 100 is larger than 5 and the semantics of Rotterdamnsweg 100 are not similar to animal. However, Rotterdamsweg 100 is a valid address. Thus, only one of the 3 leaf nodes will evaluate to true.

The next step is to evaluate the logical nodes, as seen in the image below:

Evaluation step 2

The children of the and node are true and false, so the node will evaluate to false. The children of the or node are both false, so the entire validation schema will return false.

Below you can find another example in which the validation schema evaluates to true:

Positive alternative

Leaf node expression types

Docheart Validation API supports a large number of expression types to be used as leaf nodes within a validation schema. This offers the users large flexibility in terms of the type of validation operations they can perform.

Below you can find an exhaustive list of all the supported leaf node expressions:

exact_match - verifies whether or not the extracted value exactly matches a pre-defined value. The rule takes one parameter defining the expected value to match against.

Example:

{
    "type": "leaf",
    "match_rule": "exact_match",
    "match_params": ["Dog"]
}

Explanation:

Verifies that the extracted value is exactly ‘dog’.

regex - verifies whether or not the extracted value matches a pre-defined regular expression. The rule takes one parameter defining the regular expression to be used in verification.

Example:

{
    "type": "leaf",
    "match_rule": "regex",
    "match_params": ["$c**l^$"]
}

Explanation:

Verifies that the extracted value matches the following regex ‘$c**l^$’.

language_model - verifies whether or not the extracted value matches a user-defined natural language description by making use of a large language model (LLM). The rule takes 1 parameter defining the natural language description used in the verification.

Example:

{
    "type": "leaf",
    "match_rule": "language_model",
    "match_params": ["animal"]
}

Explanation:

Verifies that the extracted value is an animal (like ‘cat’ or ‘dog’).

category - verifies whether or not the extracted value is part of one of the pre-defined semantic categories. At the moment, Docheart supports the following semantic categories: currency, number.

Example:

{
    "type": "leaf",
    "match_rule": "category",
    "match_params": ["currency"]
}

Explanation:

Verifies that the extracted value is a valid currency (like ‘$300’ or ‘20€’).

length - performs a comparison between the length of the extracted value and a specified number. The length of the extracted value can be measured either in characters (parameter char) or in the number of words (parameter word).

Example:

{
    "type": "leaf",
    "match_rule": "length",
    "match_params": [">", 30, "char"]
}

Explanation:

Verifies that the extracted value has more than 30 characters.

length_between - verifies whether or not the length of the extracted value falls into the open interval defined by 2 fixed numerical values. The comparison is of the form low < length < high. The length of the extracted value can be measured either in characters (parameter char) or in the number of words (parameter word).

Example:

{
    "type": "leaf",
    "match_rule": "length_between",
    "match_params": [10, 50, "word"]
}

Explanation:

Verifies that the number of words in the extracted data is larger than 10 and smaller than 50

length_between_equal - same as length_between, but the interval is closed, i.e. the comparison is of the form low ≤ length ≤ high.

Example:

{
    "type": "leaf",
    "match_rule": "length_between_equal",
    "match_params": [10, 50, "word"]
}

Explanation:

Verifies that the number of words in the extracted data is larger or equal to 10 and smaller or equal to 50

anything - returns true regardless of the value of the extracted data.

Example:

{
    "type": "leaf",
    "match_rule": "anything",
    "match_params": []
}

Logical node definition

As presented in the previous sections, there are 2 types of logical nodes, corresponding to or and and operations. These 2 types of nodes are encoded in JSON as follows:

And node:

{
    "type": "logical_node",
    "operation": "and",
    "children": [...]
}

Or node:

{
    "type": "logical_node",
    "operation": "or",
    "children": [...]
}

Building a JSON validation schema

The leaf and logical nodes presented above represent building blocks that when combined form a validation schema.

Since an example is worth 1000 words, we will illustrate building a JSON validation schema by converting the following conceptual model to a schema:

Conceptual model

Conceptual model validation schema

Corresponding JSON schema

{
    "type": "logical_node",
    "operation": "or",
    "children": [
        {
            "type": "logical_node",
            "operation": "and",
            "children": [
                {
                    "type": "leaf",
                    "match_rule": "length",
                    "match_params": ["<", 5, "char"]
                },
                {
                    "type": "leaf",
                    "match_rule": "language_model",
                    "match_params": ["address"]
                }
            ]
        },
        {
            "type": "leaf",
            "match_rule": "semantic",
            "match_params": ["animal", 0.7]
        }
    ]
}