Extraction triggering

Overview

This reference page contains an overview of the API endpoints that can be used to trigger document data extraction via DocHeart API.

DocHeart supports various ways of triggering extractions. In terms of the interactions with the client, DocHeart extractions can be performed either synchronously or asynchronously.

When performing a synchronous extraction, DocHeart will keep your HTTP request pending until the extraction is performed and then it will return the extraction results in the HTTP response. Depending on the configuration of the extraction, this process can take up to several seconds.

In contrast to synchronous extractions, when triggering asynchronous extractions, DocHeart will return an HTTP response immediately, letting you know that the extraction procedure has started. The results of the extraction will be saved internally. You will be able to query the results of the extraction once saved.

The document the data will be extracted from can be provided in 3 different ways:

  • directly - the document is included in the HTTP request as a base64 encoded string
  • via an external URL - the URL at which the document can be downloaded is included in the HTTP request
  • via DocHeart Vault - the document is uploaded in DocHeart’s internal vault. A reference to the document is included in the HTTP request.

Configuring the extraction trigger

The body of the HTTP request you send to DocHeart to trigger an extraction needs to contain a JSON object that defines a series of properties that provide DocHeart information necessary for performing the extraction. Below you can find a a list of these properties.

  • configuration_name / configuration_id - the name (or id) of the configuration that you want to use to process the document. You need to provide either the configuration_name or the configuration_id, but not both.

  • extraction_name - the name that you want to give to the extraction that you are triggering. The extraction name can be any string value and it doesn’t have to be unique.

  • content_type - the MIME type describing the file format of the document. The 3 accepted MIME types are application/pdf, image/png and image/jpeg, corresponding to a PDF file, a PNG file or a JPG file respectively.

  • save_extraction - a boolean value (default: True) that tells DocHeart whether or not it should save the extracted data. Saving the extracted data allows you to query and process it later. Setting this property is only needed when performing synchronous extractions. When performing asynchronous extractions the extracted data will be saved by default.

  • delete_after_first_fetch - a boolean value (default: False) that tells Docheart whether to delete the extracted data after fetching the results for the first time. this is especially useful if you want to use asynchronous extractions and still delete the data afterwards.

  • input_type - it defines the way the document to extract data from is provided to DocHeart. This document can take 3 values:
    • raw - it means that the document will be provided directly as a base64 encoded string.
    • hosted - it means that the document will be provided via an external URL from which it can be downloaded
    • docheart - it means that the document will be provided via DocHeart Vault
  • document_raw_data / document_host_url / document_storage_id - These fields are used to specify the document to extract the data from. You only need to provide one of these 3 fields. The field you provide depends on how you set the input_type:
    • If input_type is set to raw, you need to provide the document as a base64 encoded string through the document_raw_data field.
    • If input_type is set to hosted, you need to provide the external URL from which the document can be downloaded through the document_host_url
    • If input_type is set to docheart, you need to provide the DocHeart vault storage id of the document through the document_storage_id field.

Below you can find an example of a trigger configuration object for each of the document input types:

  • {
        "configuration_name": "Invoice Config",
        "extraction_name": "Invoice Extraction 2",
        "content_type": "application/pdf",
        "save_extraction": true,
        "input_type": "raw",
        "document_raw_data": "JVBERi0xLjU..."
    }
    
  • {
        "configuration_name": "Invoice Config",
        "extraction_name": "Invoice Extraction 2",
        "content_type": "application/pdf",
        "save_extraction": true,
        "input_type": "hosted",
        "document_host_url": "https://example-document-host/document1"
    }
    
  • {
        "configuration_name": "Invoice Config",
        "extraction_name": "Invoice Extraction 2",
        "content_type": "application/pdf",
        "save_extraction": true,
        "input_type": "docheart",
        "document_storage_id": "d8d83a55-b2e5-425b-a648-9a22db37bc38"
    }
    

Synchronous trigger endpoint

This endpoint is used to trigger a synchronous extraction. An example of how to make a request to this endpoint is provided below. In this example, the document is inputted directly, but all input types described above are supported:

  • const response = await fetch("https://api.docheart.ai/docheart/api/extraction/trigger_sync", {
        method: "POST",
        headers: {
            "X-Api": "<api_token>"
        },
        body: body: JSON.stringify({
            "configuration_name": "<the name of the configuration>", 
            "extraction_name": "<the name you give to the current extraction>",
            "content_type": "<application/pdf | image/png | image/jpeg>", 
            "save_extraction": "<true | false>", // only for synchronous interactions
            "input_type": "raw", // (or "hosted" / "docheart")
            "document_raw_data": "JVBERi0xLjU..." // (or set "document_host_url" / "document_storage_id")
        })
    })
    
  • curl -X POST https://api.docheart.ai/docheart/api/extraction/trigger_sync \
    -H "X-api: <api_token>" \
    -d '{
        "configuration_name": "<the name of the configuration>", 
        "extraction_name": "<the name you give to the current extraction>",
        "content_type": "<application/pdf | image/png | image/jpeg>", 
        "save_extraction": <true | false>, // only for synchronous interactions
        "input_type": "raw", // (or "hosted" / "docheart")
        "document_raw_data": "JVBERi0xLjU..." // (or set "document_host_url" / "document_storage_id")
    }'
    
  • {
        "configuration_name": "Demo Invoice Mock",
        "extraction_name": "API Sync 3 Demo",
        "content_type": "application/pdf",
        "save_extraction": true,
        "input_type": "raw",
        "document_raw_data": "JVBERi0xLjQKJeLjz...."
    }
    
  • {
        "id": "65c1d57761b015cb9884b62d", // id of the response object
        "user_id": "7ZawPyIwfZYt3LTlmdRCJhV0vlG3", // the id of the user who made the request (your id)
        "extraction_info": { // information about the extraction
            "id": "65c1d57461b015cb9884b625", // the id of the extraction info object
            "api_key_hash": "c487c3eed215c73541aa86e82a563c895275c1a7936555ecbef873e16440dace", // id unique for the API key used to trigger the extraction
            "extraction_id": "5f93ebd5-cb89-48ff-9a7b-d45943b1e9c9", // the id of the extraction
            "extraction_name": "API Sync 3 Demo", // the name of the extraction
            "trigger_user_id": "7ZawPyIwfZYt3LTlmdRCJhV0vlG3",
            "trigger_unix_timestamp": 1707201908.3176403, // the unix timestamp of the extraction
            "trigger_type": "sync", // whether the extraction was synchronous or asynchronous
            "input_type": "raw", // how the document was inputted ("raw", "hosted", or "docheart"), as explained above
            "extracted_data_saved": true, // whether or not the extracted data was saved within DocHeart's database.
            "used_extraction_configuration_id": "65c1d47df09cb9a0e925972e", // the id of the configuration used for the extraction
            "used_extraction_configuration_name": "Demo Invoice Mock", // the name of the configuration used for the extraction
            "preview": false // whether or not this was a preview extraction (for all user triggered extractions, this field will be false)
        },
        "extracted_targets": [ // the data extracted from the document
            {
                "id": "38311d8f-95d3-4d9f-9455-20472c41198e", // the id of the extracted target
                "used_extraction_group_id": "f42b849a-c338-4408-87e4-5829385b1908", // the id of the extraction groups this target belongs to
                "used_extraction_target_id": "c0e671db-1de2-4e68-ba07-f42f971bd3dc", // the id of the extraction target used during extraction
                "type": "field", // whether or not this extracted target represents a field or a table
                "extracted_values": [ // the list of values extracted for the field
                    {
                        "id": "e4440346-9d78-4ea1-908f-128b8a77467e", // the id of the extracted value
                        "extracted_field": "Fountain Fresh Imports Ltd", // the actual data that was extractd from the document
                        "validation_passed": true, // whether or not the extracted value passed the validation
                        "validation_messages": [] // error messages explaing why the validation failed, empty if the validation passed
                    }
                ],
                "confidence": 0.9979146003723145 // how confident DocHeart is about the extraction
            }
        ],
        "creation_unix_timestamp": 1707201911.2523508, // the unix timestamp at which the extraction finished
        "extracted_document_storage_id": "6656f22b-10f6-4b72-9642-8f8f322386ed", // the DocHeart Vault storage id where the document was saved.
        "used_extraction_configuration": {...} // the object encoding the configuration used for this extraction
    }
    

Asynchronous trigger endpoint

This endpoint is used to trigger an extraction that is asynchronous. An example of how to make a request to this endpoint is provided below. In this example, the document is inputted via DocHeart vault, but all input types are supported.

  • const response = await fetch("https://api.docheart.ai/docheart/api/extraction/trigger_async", {
        method: "POST",
        headers: {
            "X-Api": "<api_token>"
        },
        body: body: JSON.stringify({
            "configuration_name": "<the name of the configuration>", 
            "extraction_name": "<the name you give to the current extraction>",
            "content_type": "<application/pdf | image/png | image/jpeg>", 
            "input_type": "docheart", // (or "hosted" / "raw")
            "document_storage_id": "6656f22b-10f6-4b72-9642-8f8f322386ed" // (or set "document_host_url" / "document_raw_data")
        })
    })
    
  • curl -X POST https://api.docheart.ai/docheart/api/extraction/trigger_async \
    -H "X-api: <api_token>" \
    -d '{
        "configuration_name": "<the name of the configuration>", 
        "extraction_name": "<the name you give to the current extraction>",
        "content_type": "<application/pdf | image/png | image/jpeg>", 
        "input_type": "docheart", // (or "hosted" / "raw")
        "document_storage_id": "6656f22b-10f6-4b72-9642-8f8f322386ed" // (or set "document_host_url" / "document_raw_data")
    }'
    
  • {
        "configuration_name": "Demo Invoice Mock",
        "extraction_name": "API Async Demo",
        "content_type": "application/pdf",
        "input_type": "docheart",
        "document_storage_id": "6656f22b-10f6-4b72-9642-8f8f322386ed"
    }
    
  • {
        "id": "65c1de0084e5d82f3a8a5cfc", // the id of the extraction info object
        "api_key_hash": "c487c3eed215c73541aa86e82a563c895275c1a7936555ecbef873e16440dace", // id unique for the API key used to trigger the extraction
        "extraction_id": "2225d9c3-c6e7-494c-b3c5-20d8e1dda484", // the id of the extraction
        "extraction_name": "API Async Demo", // the name of the extraction
        "trigger_user_id": "7ZawPyIwfZYt3LTlmdRCJhV0vlG3", // the id of the user who triggered the extraction (your id)
        "trigger_unix_timestamp": 1707204096.6902654, // the unix timestamp at which the extraction was triggered
        "trigger_type": "async", // whether the extraction was synchronous or asynchronous
        "input_type": "docheart", // how the document was inputted ("raw", "hosted", or "docheart"), as explained above
        "extracted_data_saved": true, // whether or not the extracted data was saved within DocHeart's database (always true for asynchonous extractions)
        "used_extraction_configuration_id": "65c1d47df09cb9a0e925972e", // the id of the configuration used for the extraction
        "used_extraction_configuration_name": "Demo Invoice Mock", // the name of the configuration used for the extraction
        "preview": false // whether or not this was a preview extraction (for all user triggered extractions, this field will be false)
    }
    

The Extraction Id

As you might have noticed, regardless of whether or not you trigger a synchronous or an asynchronous extraction, one of the fields of the response body will be the extraction_id. The extraction id is very important, as it represents a unique handle for the extraction you have just triggered. Whenever you want to query the status of an extraction or to get the extraction results, you have to provide the extraction id.