Extraction results queries

Overview

This reference page contains an overview of the API endpoints that can be used to query the extraction results saved internally by DocHeart, as a result of a triggered extraction process finishing.

The next section will provide you with a guide on how to interpret the extraction data object, which is the object that contains the results of an extraction. The sections after that will briefly go over the endpoints you can use to obtain the extracted data.

The Extracted Data Object

As the name suggests, the extracted data object is an object that contains the results of a DocHeart extraction. It is made up of various fields and components that contain useful information about the extraction and validation processes, as well as the extracted data itself. Before we can go over the extracted data object, however, it is important to understand what extraction targets and extracted targets are.

Extraction Target

An extraction target is an object that is part of a DocHeart configuration. Its role is to describe a unit of extraction within the DocHeart system. In essence, an extraction target is what informs the DocHeart pipeline of what you want to extract from the document. The DocHeart extraction pipeline can be seen as an engine that takes in a set of extraction targets and a document and outputs a set of extracted targets, which contain the data from the document extracted according to the specification defined by the extraction targets. Since DocHeart can perform two types of extractions (field and table), the extraction targets are of 2 types (field and table), and so are their associated extracted targets. Examples of the 2 types of extraction targets are provided below:

  • {
        "id": "c0e671db-1de2-4e68-ba07-f42f971bd3dc",
        "search_area": "24b896bc-f4a8-40fb-ab1c-c8ae4071825a",
        "search_area_type": "box",
        "extraction_group_id": "f42b849a-c338-4408-87e4-5829385b1908",
        "creation_unix_timestamp": 1707201713.2052312,
        "type": "field",
        "name": "Company name",
        "description": "name of company",
        "example": "Fieldlab",
        "validation_tree": {
            "name": "Validation Tree",
            "rule_groups": []
        },
        "is_multiple": false
    }
    
  • {
        "id": "59e9b850-fbbf-4463-b38a-001472a5e411",
        "search_area": "all",
        "search_area_type": "all",
        "extraction_group_id": "83be75ab-953e-4766-8565-359c9861961b",
        "creation_unix_timestamp": 1707215974.5210805,
        "type": "table",
        "header_schema": [
            {
                "id": "1403ef26-8dab-4907-969a-ff48737fe966",
                "name": "Product",
                "header_matching_validation_tree": {
                    "name": "matching_tree",
                    "rule_groups": []
                },
                "extracted_value_validation_tree": {
                    "name": "validation_tree",
                    "rule_groups": []
                },
                "position": 0
            },
            {
                "id": "18ac5be2-ec8c-4244-bb66-1c245d076f84",
                "name": "Count",
                "header_matching_validation_tree": {
                    "name": "matching_tree",
                    "rule_groups": []
                },
                "extracted_value_validation_tree": {
                    "name": "validation_tree",
                    "rule_groups": []
                },
                "position": 1
            },
            {
                "id": "db45b26d-3016-43a7-8c8a-9d400a4dc98f",
                "name": "Price",
                "header_matching_validation_tree": {
                    "name": "matching_tree",
                    "rule_groups": []
                },
                "extracted_value_validation_tree": {
                    "name": "validation_tree",
                    "rule_groups": []
                },
                "position": 2
            }
        ]
    }
    

Extracted Target

As presented in the section before, the extracted targets are the objects produced by the extraction pipeline from the document and the set of extraction targets, which is illustrated in the figure below.

Pipeline process

The mapping between extraction targets and extracted targets is one-to-one. This means that, within one extraction, every extraction target has exactly one child extracted target and every extracted target has exactly one parent extraction target. The id of the parent extraction target is contained inside the child extracted target within the used_extraction_target_id field.

Since DocHeart can perform both field and table extractions, the extracted target can be of 2 types: field and tables. A field extracted target contains one or multiple values extracted from the document that match the description of that field, as defined by its corresponding extraction target. Similarly, a table extracted target contains the cells of one or multiple tables extracted from the document. Examples of children extracted targets corresponding to the 2 parents extraction targets defined above, can be seen below:

  • {
        "id": "906fd471-973a-4973-993f-06e1c498af0a",
    
        // the extraction group this target belongs to
        "used_extraction_group_id": "f42b849a-c338-4408-87e4-5829385b1908",
    
        // this indicates the parent extraction target
        "used_extraction_target_id": "c0e671db-1de2-4e68-ba07-f42f971bd3dc",
    
        // this indicates that the object is a field extracted target
        "type": "field",
    
        // this contains the list of extracted values for the field.
        "extracted_values": [
            {
                "id": "fe8d3e39-1b97-4a95-a96f-6d3a1f2ab145",
    
                // the value extracted for the field
                "extracted_field": "Fountain Fresh Imports Ltd",
    
                // true, if validation succeded, false otherwise
                "validation_passed": true,
    
                // 
                "validation_messages": []
            }
        ],
    
        // the confidence of the extraction between 0 (minimum) and 1 (maximum)
        "confidence": 0.9979146003723145
    },
    
  • {
        "id": "7099e694-1378-40ff-9169-1001d0e71555",
    
        // the extraction group this target belongs to
        "used_extraction_group_id": "83be75ab-953e-4766-8565-359c9861961b",
    
        // this indicates the parent extraction target
        "used_extraction_target_id": "59e9b850-fbbf-4463-b38a-001472a5e411",
    
        // this indicates that the object is a table extracted target
        "type": "table",
    
        // this contains the list of cells for the field.
        "extracted_values": [
            // the first extracted cell
            {
                "id": "b8defa1d-f05d-4095-96c8-d9845ebcaae6",
    
                // value extracted for the cell
                "extracted_cell": "Total Amount",
    
                // the id of the table this cell belongs to
                "table_id": "4395d1e6-da7a-4f98-bf99-9b873a02f388",
    
                // the row position of the cell within the table
                "row": 0,
    
                // the column position of the cell within the table
                "column": 0,
    
                // true, if validation passed, false if validation failed
                "validation_passed": true,
    
                // list of messages that indicate why validation has failed, in case it did
                "validation_messages": []
            },
    
            // the second extracted cell
            {
                "id": "fc8346e3-5b16-45ad-968d-ab7b77ff8781",
                "extracted_cell": \n44",
                "table_id": "4395d1e6-da7a-4f98-bf99-9b873a02f388",
                "row": 0,
                "column": 1,
                "validation_passed": true,
                "validation_messages": []
            },
            ...
        ],
        // the confidence of the extraction between 0 (minimum) and 1 (maximum)
        "confidence": 0.9890895883242289
    }
    

Extracted Data Object

Now that we understand the extraction targets, the extracted targets and the relationship between them we can move on to the extracted data object. The extracted data object is an object that simply aggregates all of the extracted targets resulting from an extraction. On top of that, it contains information about the extraction, as well as a copy of the extraction configuration used in the extraction (which contains all of the extraction targets that generated the extracted targets). An annotated example of an extracted data object is provided below:

{
    "id": "65c20d3d84e5d82f3a8a5d0e",
    "user_id": "7ZawPyIwfZYt3LTlmdRCJhV0vlG3",
    "extraction_info": {
        // the id of the extraction info object
        "id": "65c20d3684e5d82f3a8a5d06",

        // id unique for the API key used to trigger the extraction
        "api_key_hash": "c487c3eed215c73541aa86e82a563c895275c1a7936555ecbef873e16440dace",

        // the id of the extraction
        "extraction_id": "be18f53b-5cf7-4787-b3d1-cd87810384a7",

        // the name of the extraction
        "extraction_name": "API Sync 3 Demo",

        // the id of the user who triggered the extraction (your id)
        "trigger_user_id": "7ZawPyIwfZYt3LTlmdRCJhV0vlG3",

        // the unix timestamp at which the extraction was triggered
        "trigger_unix_timestamp": 1707216182.5737329,

        // whether the extraction was synchronous or asynchronous
        "trigger_type": "sync",

        // how the document was inputted ("raw", "hosted", or "docheart"), as explained in the section about extraction triggering
        "input_type": "raw",

        // whether or not the extracted data was saved within DocHeart's database
        "extracted_data_saved": true,

        // the id of the configuration used for the extraction
        "used_extraction_configuration_id": "65c1d47df09cb9a0e925972e",

        // the name of the configuration used for the extraction
        "used_extraction_configuration_name": "Demo Invoice Mock",
        "preview": false
    },

    // the list of extracted targets collected in this extraction
    "extracted_targets": [
        {
            "id": "906fd471-973a-4973-993f-06e1c498af0a",
            "used_extraction_group_id": "f42b849a-c338-4408-87e4-5829385b1908",
            "used_extraction_target_id": "c0e671db-1de2-4e68-ba07-f42f971bd3dc",
            "type": "field",
            "extracted_values": [
                {
                    "id": "fe8d3e39-1b97-4a95-a96f-6d3a1f2ab145",
                    "extracted_field": "Fountain Fresh Imports Ltd",
                    "validation_passed": true,
                    "validation_messages": []
                }
            ],
            "confidence": 0.9979146003723145
        },
        {
            "id": "7099e694-1378-40ff-9169-1001d0e71555",
            "used_extraction_group_id": "83be75ab-953e-4766-8565-359c9861961b",
            "used_extraction_target_id": "59e9b850-fbbf-4463-b38a-001472a5e411",
            "type": "table",
            "extracted_values": [
                {
                    "id": "b8defa1d-f05d-4095-96c8-d9845ebcaae6",
                    "extracted_cell": "Total Amount",
                    "table_id": "4395d1e6-da7a-4f98-bf99-9b873a02f388",
                    "row": 0,
                    "column": 0,
                    "validation_passed": true,
                    "validation_messages": []
                },
                {
                    "id": "fc8346e3-5b16-45ad-968d-ab7b77ff8781",
                    "extracted_cell": \n44",
                    "table_id": "4395d1e6-da7a-4f98-bf99-9b873a02f388",
                    "row": 0,
                    "column": 1,
                    "validation_passed": true,
                    "validation_messages": []
                },
                ...
            ],
            "confidence": 0.9890895883242289
        }
    ],

    // the unix timestamp at which the extracted data object was creation
    "creation_unix_timestamp": 1707216189.0296614,

    // the DocHeart Vault storage id of the extracted document
    "extracted_document_storage_id": "ae378e24-f925-4dbe-9c55-e222059b16fe",

    // a copy of the configuration object used in the extraction (understanding the details of this object is not necessary)
    "used_extraction_configuration": {...}
}

The following sections will provide the endpoints that can be used to get the extracted data.

Get extracted data endpoint

This represents the main endpoint for fetching the extracted data. You provide the extraction id for the extraction you want to fetch the results for and the endpoint returns the extracted data object, which is described in the previous section. The extraction id is provided in the URL of the request.

An example of how to perform this request is provided below:

  • const response = await fetch("https://api.docheart.ai/docheart/api/extraction/result/get/{extraction_id}", {
        method: "GET",
        headers: {
            "X-Api": "<api_token>"
        }
    })
    
    
  • curl -X GET https://api.docheart.ai/docheart/api/extraction/result/get/{extraction_id} \
    -H "X-api: <api_token>" 
    
    

List extraction targets endpoint

The extracted data object is big and highly complex. When you only need the extracted data without any of the additional extracted information you can call the list extraction targets endpoint instead. This endpoint will only return the list of the extraction targets within the extracted data object, scrapping all of the additional information.

An example of how to perform this request is provided below:

  • const response = await fetch("https://api.docheart.ai/docheart/api/extraction/result/list/{extraction_id}", {
        method: "GET",
        headers: {
            "X-Api": "<api_token>"
        }
    })
    
  • curl -X GET https://api.docheart.ai/docheart/api/extraction/result/list/{extraction_id} \
    -H "X-api: <api_token>" 
    
    
  • {
        "extracted_targets": [
            {
                "id": "691af3d8-0810-45e1-b466-d0c84c8ccf05",
                "used_extraction_group_id": "d1febb78-420d-4794-915a-de8013b51136",
                "used_extraction_target_id": "86c5f369-e8c8-4728-8da5-c26b274bf809",
                "type": "field",
                "extracted_values": [
                    {
                        "id": "14edd937-2990-4cc8-b9c6-ea833570889b",
                        "extracted_field": "Emma",
                        "validation_passed": true,
                        "validation_messages": []
                    }
                ],
                "confidence": 1.0
            },
            {
                "id": "7d705d08-c71a-4b11-8b7f-0b2752b33c1c",
                "used_extraction_group_id": "d1febb78-420d-4794-915a-de8013b51136",
                "used_extraction_target_id": "c717358c-0ade-477a-8ad7-515bea12dcf3",
                "type": "field",
                "extracted_values": [
                    {
                        "id": "566c7b1e-376c-4bac-b98f-76648a884724",
                        "extracted_field": "emma@fieldlab.nl",
                        "validation_passed": false,
                        "validation_messages": [
                            "`emma@fieldlab.nl` does not match description `phone number`"
                        ]
                    }
                ],
                "confidence": 1.0
            },
            {
                "id": "4959db49-1478-4623-9f88-8981c73bea2e",
                "used_extraction_group_id": "ebf97cf7-e797-4f19-8939-391b59780f18",
                "used_extraction_target_id": "bea66abc-47de-4677-a0b6-23d72ef96078",
                "type": "table",
                "extracted_values": [
                    {
                        "id": "dd38d1ae-c04c-4c1a-8804-1380df070187",
                        "extracted_cell": "Product",
                        "table_id": "e813dbe5-31eb-497b-a013-9f062cb3dce5",
                        "row": 0,
                        "column": 0,
                        "validation_passed": true,
                        "validation_messages": []
                    },
                    ...
                ],
                "confidence": 0.9804787668916914
            },
            ...
        ]
    }