Output schema configuration

Overview

This reference page provides a guide on how to add an OSDL output schema to an extraction configuration, how to work with the built-in OSDL editor, as well as how to output the results using the configured schema. Understading the OSDL language is a prerequisite for this guide. If you haven’t looked into OSDL yet, please check the following documentation page.

Adding an OSDL schema to an extraction configuration

Adding an OSDL schema to an extraction configuration is done though the 4th step of DocHeart’s configuration wizard, as illustrated below:

  • osdl_conf_nums
  • osdl_conf_nums

The schema can be defined using our built-in code editor. The code editor provides a tag that shows you whether or not the current schema is valid, as well as the error, in case the schema is invalid, along with the format of the schema, as indicated below:

  • Schema Valid
  • Schema Invalid

When a schema is invalid, multiple types of errors are possible:

  • Invalid schema - it means that the provided schema is not a valid JSON object. A schema that throws such an error is unsavable,
  • Parsing error - it means that the provided schema is a valid JSON object, however some of the provided expressions could not be parsed due to a syntax error.
  • Binding error - it means that the provided schema is syntactically correct, however it doesn’t match the current extraction configuration. The most commong reason behind a binding error is having referencer that references fields or tables that don’t exist in the configuration.

Upload OSDL schema

You can upload an existing OSDL (Output Schema Definition Language) schema in JSON or XML format. To upload the schema, click on the Upload Schema button, as shown below:

Schema Button

Select the file that you want to upload.

Schema Upload

Once uploaded, you will be able to see the schema. You can then edit the schema, and after making your changes, click the Save button to save it.

Save Schema

Outputting the extracted data using the configured OSDL schema

All DocHeart endpoints that return extracted data can be configured to output the data in the configured OSDL format by adding the ?show_schema_output=True argument. Below, we will illustrate the OSDL schema output for a synchronous extraction. However, the behaviour for the extraction results queries is analogous.

As mentioned in the previous paragraph, to enable the schema output for a synchornous extraction, the ?show_schema_output=True needs to be added:

  • const response = await fetch("https://api.docheart.ai/docheart/api/extraction/trigger_sync?show_schema_output=True", {
        method: "POST",
        headers: {
            "X-Api": "<api_token>"
        },
        body: {
            "configuration_name": "<the name of the configuration>", 
            "extraction_name": "<the name you give to the current extraction>"
            "content_type": "<application/pdf | image/png | image/jpeg>", 
            "save_extraction": "<true | false>" // only for synchronous interactions
            "input_type": "raw", // (or "hosted" / "docheart"),
            "document_raw_data": "JVBERi0xLjU..." // (or set "document_host_url" / "document_storage_id")
        }   
    })
    
  • curl -X POST https://api.docheart.ai/docheart/api/extraction/trigger_sync?show_schema_output=True \
    -H "X-api: <api_token>" \
    -d '{
        "configuration_name": "<the name of the configuration>", 
        "extraction_name": "<the name you give to the current extraction>"
        "content_type": "<application/pdf | image/png | image/jpeg>", 
        "save_extraction": <true | false> // only for synchronous interactions
        "input_type": "raw", // (or "hosted" / "docheart"),
        "document_raw_data": "JVBERi0xLjU..." // (or set "document_host_url" / "document_storage_id")
    }'
    

The response will have the following format:

  • {
        "extracted_data": {...}, // the original extracted data object
        "schema_output": {...}, // the extracted data in the format defined by the OSDL schema
        "validation": {...} // an object mirroring the schema output, which provides validation information
    }
    
  • {
        "extracted_data": {
            "_id": "660184c1463ccc52a960798b",
            "creation_unix_timestamp": 1711375553.5388951,
            "extracted_document_storage_id": "8931dbbe-ee7c-4bc3-a7e6-ec750fea73a2",
            ...
        },
        "schema_output": {
            "All numerical values in Invoice": ["431","326", "16","2",...],
            "Biggest 2 unique counts": ["40 * 30", "16"],
            "Indexing": {
                "by entries": [
                    "431",
                    "2",
                    "215",
                    "851"
                ],
                "by range": [
                    "326",
                    "16",
                    "2"
                ]
            },
            "Invoice Data": {
                "Contact email address": "emma@fieldlab.nl",
                "Name of contact person": "Emma",
                "Phone number": "(123) 451-7897"
            },
            "Invoice Data Joined": "Emma|emma@fieldlab.nl|(123) 451-7897",
            "Tables": {
                "All tables": [
                    [
                        [
                            "Product",
                            "Count",
                            "Price"
                        ],
                        [
                            "Chair",
                            "30",
                            "$20"
                        ],
                        [
                            "Table",
                            "16",
                            "$87"
                        ],
                        [
                            "Monitor",
                            "40",
                            "$150"
                        ],
                        [
                            "Desk",
                            "21",
                            "$100"
                        ]
                    ]
                ],
                "All tables first row first column": [
                    [
                        [
                            "Chair"
                        ]
                    ]
                ],
                "All tables header and first row": [
                    [
                        [
                            "Product",
                            "Count",
                            "Price"
                        ],
                        [
                            "Chair",
                            "30",
                            "$20"
                        ]
                    ]
                ],
                "First table": [
                    [
                        [
                            "Product",
                            "Count",
                            "Price"
                        ],
                        [
                            "Chair",
                            "30",
                            "$20"
                        ],
                        [
                            "Table",
                            "16",
                            "$87"
                        ],
                        [
                            "Monitor",
                            "40",
                            "$150"
                        ],
                        [
                            "Desk",
                            "21",
                            "$100"
                        ]
                    ]
                ],
                "First table product and price": [
                    [
                        [
                            "Product",
                            "Price"
                        ],
                        [
                            "Chair",
                            "$20"
                        ],
                        [
                            "Table",
                            "$87"
                        ],
                        [
                            "Monitor",
                            "$150"
                        ],
                        [
                            "Desk",
                            "$100"
                        ]
                    ]
                ],
                "Second table": [],
                "Second table count": [],
                "Second table first 2 counts": []
            }
        },
        "validation": {
            "All numerical values in Invoice": {
                "confidence": 0.9971388697343714,
                "validation_messages": [],
                "validation_passed": true
            },
            "Biggest 2 unique counts": {
                "confidence": 0.9847305655479431,
                "validation_messages": [],
                "validation_passed": true
            },
            "Indexing": {
                "by entries": {
                    "confidence": 0.9971388697343714,
                    "validation_messages": [],
                    "validation_passed": true
                },
                "by range": {
                    "confidence": 0.9971388697343714,
                    "validation_messages": [],
                    "validation_passed": true
                }
            },
            "Invoice Data": {
                "Contact email address": {
                    "confidence": 0.9879816055297852,
                    "validation_messages": [],
                    "validation_passed": true
                },
                "Name of contact person": {
                    "confidence": 0.9879816055297852,
                    "validation_messages": [],
                    "validation_passed": true
                },
                "Phone number": {
                    "confidence": 0.9879816055297852,
                    "validation_messages": [],
                    "validation_passed": true
                }
            },
            "Invoice Data Joined": {
                "confidence": 0.9879816055297853,
                "validation_messages": [],
                "validation_passed": true
            },
            "Tables": {
                "All tables": {
                    "confidence": 0.9847305655479431,
                    "validation_messages": [],
                    "validation_passed": true
                },
                "All tables first row first column": {
                    "confidence": 0.9847305655479431,
                    "validation_messages": [],
                    "validation_passed": true
                },
                "All tables header and first row": {
                    "confidence": 0.9847305655479431,
                    "validation_messages": [],
                    "validation_passed": true
                },
                "First table": {
                    "confidence": 0.9847305655479431,
                    "validation_messages": [],
                    "validation_passed": true
                },
                "First table product and price": {
                    "confidence": 0.9847305655479431,
                    "validation_messages": [],
                    "validation_passed": true
                },
                "Second table": {
                    "confidence": 0.9847305655479431,
                    "validation_messages": [],
                    "validation_passed": true
                },
                "Second table count": {
                    "confidence": 0.9847305655479431,
                    "validation_messages": [],
                    "validation_passed": true
                },
                "Second table first 2 counts": {
                    "confidence": 0.9847305655479431,
                    "validation_messages": [],
                    "validation_passed": true
                }
            }
        }
    }