Extraction status queries

Overview

This page contains endpoints that can be used to query the status of an extraction process. As the data gets extracted, the document goes through a series of processing stages as part of an extraction pipeline. The latest status for a certain extraction process can be queried to get information about the current stage of the pipeline. Querying the status can also help you determine whether or not a certain document is done being processed.

When statuses are queried through DocHeart API, DocHeart API will return status objects that have the following format:

  • {
        "id": "<the id of the status object>",
        "status_type": "<encodes the status the extraction pipeline is currently in>",
        "message": "<any message associated to the status>",
        "creation_unix_timestamp": "<the unix timestamp of when the extraction pipeline reached this status>",
        "extraction_id": "<the id of the extraction for which the status was queried>",
        "trigger_user_id": "<the id of the user who triggered the extraction>"
    }
    
  • {
        "id": "65c1e72ce50339f3b09ae8cb",
        "status_type": "not_started",
        "message": "",
        "creation_unix_timestamp": 1707206444.0057235,
        "extraction_id": "590ee831-bad4-461d-bf3d-946fffa66ad8",
        "trigger_user_id": "7ZawPyIwfZYt3LTlmdRCJhV0vlG3"
    }
    

As you have probably noticed in the description of the status object, the field that describes the status of the extraction is the status_type. Below you can find a list of all possible values the status_type field can take and their associated meaning:

  • not_started - the extraction process has not started yet
  • started - the extraction process has just started
  • doc_converted_to_png - the document has just been converted to PNG
  • target_groups_formed - the extraction targets have been grouped by the search area in preparation for the extraction (don’t worry if you don’t fully understand this, it’s related to the inner workings of the pipeline)
  • extraction_started - the pipeline has just started extracting data from the document (all the statuses before were about preprocessing the document)
  • extraction_ended - the pipeline has just finished extracting data from the document
  • usage_statistics_aggregated - the statistics about the amount of resources used during the extraction process have been collected and saved
  • ended - the extraction process has just ended
  • error - an error has occurred during the extraction process (in which case the error message can be found in the message field)

While there are many status types, for most use cases of DocHeart, you can probably ignore most of them. In the most common use case, you will probably only be interested in whether or not the extraction has finished successfully. If this is the case, the only 2 status types you will have to check for are ended and error. The snippet of javascript code below illustrates this:

    // some method that one might implement to get a status object from the DocHeart API
    const status_object = await getStatusFromDocHeartAPI() 

    if (status_object.status_type === "ended") {
        // handle the case when the extraction ended succesfully
    }
    else if (status_object.status_type === "error") {
        console.error(status_object.message) // print the error
        // handle the case when the pipeline ended in an error
    }
    else {
        // handle the case when the pipeline is still running
    }

In the following section we will go over the endpoint that you can use to obtain the status of one or multiple extractions.

Get statuses endpoint

This endpoint is used to get the statuses for one or a set of extractions. The extractions for which the statuses are fetched are specified through their extractions ids in the body of the requests. The response is an object that maps each provided extraction id with a status object, as described in the previous section. An example of a request is provided below:

  • const response = await fetch("https://api.docheart.ai/docheart/api/extraction/status/list", {
        method: "POST",
        headers: {
            "X-Api": "<api_token>"
        },
        body: {
            // the list of ids corresponding to the extractions for which you want to obtain the statuses.
            "extraction_id_list": ["<id1>", "<id2>", ...] 
        }  
    })
    
  • curl -X POST https://api.docheart.ai/docheart/api/extraction/status/list \
    -H "X-api: <api_token>" \
    -d '{
            // the list of ids corresponding to the extractions for which you want to obtain the statuses.
            "extraction_id_list": ["<id1>", "<id2>", ...] 
    }'
    
  • {
        "extraction_id_list": [
            "590ee831-bad4-461d-bf3d-946fffa66ad8",
            "5f93ebd5-cb89-48ff-9a7b-d45943b1e9c9"
        ]
    }
    
  • {
        // object that maps each of the provided extraction ids to a status object
        "status_map": {
            // the status object for the first extraction id
            "590ee831-bad4-461d-bf3d-946fffa66ad8": {
                "id": "65c1e72ce50339f3b09ae8cb",
                "creation_unix_timestamp": 1707206444.0057235,
                "extraction_id": "590ee831-bad4-461d-bf3d-946fffa66ad8",
                "trigger_user_id": "7ZawPyIwfZYt3LTlmdRCJhV0vlG3",
                "status_type": "not_started", // the first extraction has not started yet
                "message": ""
            },
            // the status object for the second extraction id
            "5f93ebd5-cb89-48ff-9a7b-d45943b1e9c9": {
                "id": "65c1d57761b015cb9884b62e",
                "creation_unix_timestamp": 1707201911.2627761,
                "extraction_id": "5f93ebd5-cb89-48ff-9a7b-d45943b1e9c9",
                "trigger_user_id": "7ZawPyIwfZYt3LTlmdRCJhV0vlG3",
                "status_type": "ended", // the second extraction is done, so its results are ready to be queried
                "message": ""
            }
        },
    
        // the timestamp at which the status query was performed
        "query_unix_timestamp": 1707206444.0109003
    }