Output schema configuration
Overview
This reference page provides a guide on how to add an OSDL output schema to an extraction configuration, how to work with the built-in OSDL editor, as well as how to output the results using the configured schema. Understading the OSDL language is a prerequisite for this guide. If you haven’t looked into OSDL yet, please check the following documentation page.
Adding an OSDL schema to an extraction configuration
Adding an OSDL schema to an extraction configuration is done though the 4th step of DocHeart’s configuration wizard, as illustrated below:
The schema can be defined using our built-in code editor. The code editor provides a tag that shows you whether or not the current schema is valid, as well as the error, in case the schema is invalid, along with the format of the schema, as indicated below:
When a schema is invalid, multiple types of errors are possible:
- Invalid schema - it means that the provided schema is not a valid JSON object. A schema that throws such an error is unsavable,
- Parsing error - it means that the provided schema is a valid JSON object, however some of the provided expressions could not be parsed due to a syntax error.
- Binding error - it means that the provided schema is syntactically correct, however it doesn’t match the current extraction configuration. The most commong reason behind a binding error is having referencer that references fields or tables that don’t exist in the configuration.
Upload OSDL schema
You can upload an existing OSDL (Output Schema Definition Language) schema in JSON or XML format. To upload the schema, click on the Upload Schema button, as shown below:
Select the file that you want to upload.
Once uploaded, you will be able to see the schema. You can then edit the schema, and after making your changes, click the Save button to save it.
Outputting the extracted data using the configured OSDL schema
All DocHeart endpoints that return extracted data can be configured to output the data in the configured OSDL format by adding the ?show_schema_output=True argument. Below, we will illustrate the OSDL schema output for a synchronous extraction. However, the behaviour for the extraction results queries is analogous.
As mentioned in the previous paragraph, to enable the schema output for a synchornous extraction, the ?show_schema_output=True needs to be added:
-
const response = await fetch("https://api.docheart.ai/docheart/api/extraction/trigger_sync?show_schema_output=True", { method: "POST", headers: { "X-Api": "<api_token>" }, body: { "configuration_name": "<the name of the configuration>", "extraction_name": "<the name you give to the current extraction>" "content_type": "<application/pdf | image/png | image/jpeg>", "save_extraction": "<true | false>" // only for synchronous interactions "input_type": "raw", // (or "hosted" / "docheart"), "document_raw_data": "JVBERi0xLjU..." // (or set "document_host_url" / "document_storage_id") } })
-
curl -X POST https://api.docheart.ai/docheart/api/extraction/trigger_sync?show_schema_output=True \ -H "X-api: <api_token>" \ -d '{ "configuration_name": "<the name of the configuration>", "extraction_name": "<the name you give to the current extraction>" "content_type": "<application/pdf | image/png | image/jpeg>", "save_extraction": <true | false> // only for synchronous interactions "input_type": "raw", // (or "hosted" / "docheart"), "document_raw_data": "JVBERi0xLjU..." // (or set "document_host_url" / "document_storage_id") }'
The response will have the following format:
-
{ "extracted_data": {...}, // the original extracted data object "schema_output": {...}, // the extracted data in the format defined by the OSDL schema "validation": {...} // an object mirroring the schema output, which provides validation information }
-
{ "extracted_data": { "_id": "660184c1463ccc52a960798b", "creation_unix_timestamp": 1711375553.5388951, "extracted_document_storage_id": "8931dbbe-ee7c-4bc3-a7e6-ec750fea73a2", ... }, "schema_output": { "All numerical values in Invoice": ["431","326", "16","2",...], "Biggest 2 unique counts": ["40 * 30", "16"], "Indexing": { "by entries": [ "431", "2", "215", "851" ], "by range": [ "326", "16", "2" ] }, "Invoice Data": { "Contact email address": "emma@fieldlab.nl", "Name of contact person": "Emma", "Phone number": "(123) 451-7897" }, "Invoice Data Joined": "Emma|emma@fieldlab.nl|(123) 451-7897", "Tables": { "All tables": [ [ [ "Product", "Count", "Price" ], [ "Chair", "30", "$20" ], [ "Table", "16", "$87" ], [ "Monitor", "40", "$150" ], [ "Desk", "21", "$100" ] ] ], "All tables first row first column": [ [ [ "Chair" ] ] ], "All tables header and first row": [ [ [ "Product", "Count", "Price" ], [ "Chair", "30", "$20" ] ] ], "First table": [ [ [ "Product", "Count", "Price" ], [ "Chair", "30", "$20" ], [ "Table", "16", "$87" ], [ "Monitor", "40", "$150" ], [ "Desk", "21", "$100" ] ] ], "First table product and price": [ [ [ "Product", "Price" ], [ "Chair", "$20" ], [ "Table", "$87" ], [ "Monitor", "$150" ], [ "Desk", "$100" ] ] ], "Second table": [], "Second table count": [], "Second table first 2 counts": [] } }, "validation": { "All numerical values in Invoice": { "confidence": 0.9971388697343714, "validation_messages": [], "validation_passed": true }, "Biggest 2 unique counts": { "confidence": 0.9847305655479431, "validation_messages": [], "validation_passed": true }, "Indexing": { "by entries": { "confidence": 0.9971388697343714, "validation_messages": [], "validation_passed": true }, "by range": { "confidence": 0.9971388697343714, "validation_messages": [], "validation_passed": true } }, "Invoice Data": { "Contact email address": { "confidence": 0.9879816055297852, "validation_messages": [], "validation_passed": true }, "Name of contact person": { "confidence": 0.9879816055297852, "validation_messages": [], "validation_passed": true }, "Phone number": { "confidence": 0.9879816055297852, "validation_messages": [], "validation_passed": true } }, "Invoice Data Joined": { "confidence": 0.9879816055297853, "validation_messages": [], "validation_passed": true }, "Tables": { "All tables": { "confidence": 0.9847305655479431, "validation_messages": [], "validation_passed": true }, "All tables first row first column": { "confidence": 0.9847305655479431, "validation_messages": [], "validation_passed": true }, "All tables header and first row": { "confidence": 0.9847305655479431, "validation_messages": [], "validation_passed": true }, "First table": { "confidence": 0.9847305655479431, "validation_messages": [], "validation_passed": true }, "First table product and price": { "confidence": 0.9847305655479431, "validation_messages": [], "validation_passed": true }, "Second table": { "confidence": 0.9847305655479431, "validation_messages": [], "validation_passed": true }, "Second table count": { "confidence": 0.9847305655479431, "validation_messages": [], "validation_passed": true }, "Second table first 2 counts": { "confidence": 0.9847305655479431, "validation_messages": [], "validation_passed": true } } } }