Files
docling-serve/docs/usage.md
Tiago Santana 9672f310b1 docs: Generate usage.md automatically (#340)
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-21 14:27:01 +02:00

16 KiB

Usage

The API provides two endpoints: one for urls, one for files. This is necessary to send files directly in binary format instead of base64-encoded strings.

Common parameters

On top of the source of file (see below), both endpoints support the same parameters.

Field Name Type Description
from_formats List[InputFormat] Input format(s) to convert from. String or list of strings. Allowed values: docx, pptx, html, image, pdf, asciidoc, md, csv, xlsx, xml_uspto, xml_jats, mets_gbs, json_docling, audio, vtt. Optional, defaults to all formats.
to_formats List[OutputFormat] Output format(s) to convert to. String or list of strings. Allowed values: md, json, html, html_split_page, text, doctags. Optional, defaults to Markdown.
image_export_mode ImageRefMode Image export mode for the document (in case of JSON, Markdown or HTML). Allowed values: placeholder, embedded, referenced. Optional, defaults to Embedded.
do_ocr bool If enabled, the bitmap content will be processed using OCR. Boolean. Optional, defaults to true
force_ocr bool If enabled, replace existing text with OCR-generated text over content. Boolean. Optional, defaults to false.
ocr_engine ocr_engines_enum The OCR engine to use. String. Allowed values: auto, easyocr, ocrmac, rapidocr, tesserocr, tesseract. Optional, defaults to easyocr.
ocr_lang List[str] or NoneType List of languages used by the OCR engine. Note that each OCR engine has different values for the language names. String or list of strings. Optional, defaults to empty.
pdf_backend PdfBackend The PDF backend to use. String. Allowed values: pypdfium2, dlparse_v1, dlparse_v2, dlparse_v4. Optional, defaults to dlparse_v4.
table_mode TableFormerMode Mode to use for table structure, String. Allowed values: fast, accurate. Optional, defaults to accurate.
table_cell_matching bool If true, matches table cells predictions back to PDF cells. Can break table output if PDF cells are merged across table columns. If false, let table structure model define the text cells, ignore PDF cells.
pipeline ProcessingPipeline Choose the pipeline to process PDF or image files.
page_range Tuple Only convert a range of pages. The page number starts at 1.
document_timeout float The timeout for processing each document, in seconds.
abort_on_error bool Abort on error if enabled. Boolean. Optional, defaults to false.
do_table_structure bool If enabled, the table structure will be extracted. Boolean. Optional, defaults to true.
include_images bool If enabled, images will be extracted from the document. Boolean. Optional, defaults to true.
images_scale float Scale factor for images. Float. Optional, defaults to 2.0.
md_page_break_placeholder str Add this placeholder between pages in the markdown output.
do_code_enrichment bool If enabled, perform OCR code enrichment. Boolean. Optional, defaults to false.
do_formula_enrichment bool If enabled, perform formula OCR, return LaTeX code. Boolean. Optional, defaults to false.
do_picture_classification bool If enabled, classify pictures in documents. Boolean. Optional, defaults to false.
do_picture_description bool If enabled, describe pictures in documents. Boolean. Optional, defaults to false.
picture_description_area_threshold float Minimum percentage of the area for a picture to be processed with the models.
picture_description_local PictureDescriptionLocal or NoneType Options for running a local vision-language model in the picture description. The parameters refer to a model hosted on Hugging Face. This parameter is mutually exclusive with picture_description_api.
picture_description_api PictureDescriptionApi or NoneType API details for using a vision-language model in the picture description. This parameter is mutually exclusive with picture_description_local.
vlm_pipeline_model VlmModelType or NoneType Preset of local and API models for the vlm pipeline. This parameter is mutually exclusive with vlm_pipeline_model_local and vlm_pipeline_model_api. Use the other options for more parameters.
vlm_pipeline_model_local VlmModelLocal or NoneType Options for running a local vision-language model for the vlm pipeline. The parameters refer to a model hosted on Hugging Face. This parameter is mutually exclusive with vlm_pipeline_model_api and vlm_pipeline_model.
vlm_pipeline_model_api VlmModelApi or NoneType API details for using a vision-language model for the vlm pipeline. This parameter is mutually exclusive with vlm_pipeline_model_local and vlm_pipeline_model.

Authentication

When authentication is activated (see the parameter DOCLING_SERVE_API_KEY in configuration.md), all the API requests must provide the header X-Api-Key with the correct secret key.

Convert endpoints

Source endpoint

The endpoint is /v1/convert/source, listening for POST requests of JSON payloads.

On top of the above parameters, you must send the URL(s) of the document you want process with either the http_sources or file_sources fields. The first is fetching URL(s) (optionally using with extra headers), the second allows to provide documents as base64-encoded strings. No options is required, they can be partially or completely omitted.

Simple payload example:

{
  "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}
Complete payload example:
{
  "options": {
    "from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
    "to_formats": ["md", "json", "html", "text", "doctags"],
    "image_export_mode": "placeholder",
    "do_ocr": true,
    "force_ocr": false,
    "ocr_engine": "easyocr",
    "ocr_lang": ["en"],
    "pdf_backend": "dlparse_v2",
    "table_mode": "fast",
    "abort_on_error": false,
  },
  "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}
CURL example:
curl -X 'POST' \
  'http://localhost:5001/v1/convert/source' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "options": {
    "from_formats": [
      "docx",
      "pptx",
      "html",
      "image",
      "pdf",
      "asciidoc",
      "md",
      "xlsx"
    ],
    "to_formats": ["md", "json", "html", "text", "doctags"],
    "image_export_mode": "placeholder",
    "do_ocr": true,
    "force_ocr": false,
    "ocr_engine": "easyocr",
    "ocr_lang": [
      "fr",
      "de",
      "es",
      "en"
    ],
    "pdf_backend": "dlparse_v2",
    "table_mode": "fast",
    "abort_on_error": false,
    "do_table_structure": true,
    "include_images": true,
    "images_scale": 2
  },
  "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}'
Python example:
import httpx

async_client = httpx.AsyncClient(timeout=60.0)
url = "http://localhost:5001/v1/convert/source"
payload = {
  "options": {
    "from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
    "to_formats": ["md", "json", "html", "text", "doctags"],
    "image_export_mode": "placeholder",
    "do_ocr": True,
    "force_ocr": False,
    "ocr_engine": "easyocr",
    "ocr_lang": "en",
    "pdf_backend": "dlparse_v2",
    "table_mode": "fast",
    "abort_on_error": False,
  },
  "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}

response = await async_client_client.post(url, json=payload)

data = response.json()

File as base64

The file_sources argument in the endpoint allows to send files as base64-encoded strings. When your PDF or other file type is too large, encoding it and passing it inline to curl can lead to an “Argument list too long” error on some systems. To avoid this, we write the JSON request body to a file and have curl read from that file.

CURL steps:
# 1. Base64-encode the file
B64_DATA=$(base64 -w 0 /path/to/file/pdf-to-convert.pdf)

# 2. Build the JSON with your options
cat <<EOF > /tmp/request_body.json
{
  "options": {
  },
  "file_sources": [{
    "base64_string": "${B64_DATA}",
    "filename": "pdf-to-convert.pdf"
  }]
}
EOF

# 3. POST the request to the docling service
curl -X POST "localhost:5001/v1/convert/source" \
     -H "Content-Type: application/json" \
     -d @/tmp/request_body.json

File endpoint

The endpoint is: /v1/convert/file, listening for POST requests of Form payloads (necessary as the files are sent as multipart/form data). You can send one or multiple files.

CURL example:
curl -X 'POST' \
  'http://127.0.0.1:5001/v1/convert/file' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'ocr_engine=easyocr' \
  -F 'pdf_backend=dlparse_v2' \
  -F 'from_formats=pdf' \
  -F 'from_formats=docx' \
  -F 'force_ocr=false' \
  -F 'image_export_mode=embedded' \
  -F 'ocr_lang=en' \
  -F 'ocr_lang=pl' \
  -F 'table_mode=fast' \
  -F 'files=@2206.01062v1.pdf;type=application/pdf' \
  -F 'abort_on_error=false' \
  -F 'to_formats=md' \
  -F 'to_formats=text' \
  -F 'do_ocr=true'
Python example:
import httpx

async_client = httpx.AsyncClient(timeout=60.0)
url = "http://localhost:5001/v1/convert/file"
parameters = {
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
"to_formats": ["md", "json", "html", "text", "doctags"],
"image_export_mode": "placeholder",
"do_ocr": True,
"force_ocr": False,
"ocr_engine": "easyocr",
"ocr_lang": ["en"],
"pdf_backend": "dlparse_v2",
"table_mode": "fast",
"abort_on_error": False,
}

current_dir = os.path.dirname(__file__)
file_path = os.path.join(current_dir, '2206.01062v1.pdf')

files = {
    'files': ('2206.01062v1.pdf', open(file_path, 'rb'), 'application/pdf'),
}

response = await async_client.post(url, files=files, data=parameters)
assert response.status_code == 200, "Response should be 200 OK"

data = response.json()

Picture description options

When the picture description enrichment is activated, users may specify which model and which execution mode to use for this task. There are two choices for the execution mode: local will run the vision-language model directly, api will invoke an external API endpoint.

The local option is specified with:

{
  "picture_description_local": {
    "repo_id": "",  // Repository id from the Hugging Face Hub.
    "generation_config": {"max_new_tokens": 200, "do_sample": false},  // HF generation config.
    "prompt": "Describe this image in a few sentences. ",  // Prompt used when calling the vision-language model.
  }
}

The possible values for generation_config are documented in the Hugging Face text generation docs.

The api option is specified with:

{
  "picture_description_api": {
    "url": "",  // Endpoint which accepts openai-api compatible requests.
    "headers": {},  // Headers used for calling the API endpoint. For example, it could include authentication headers.
    "params": {},  // Model parameters.
    "timeout": 20,  // Timeout for the API request.
    "prompt": "Describe this image in a few sentences. ",  // Prompt used when calling the vision-language model.
  }
}

Example URLs are:

  • http://localhost:8000/v1/chat/completions for the local vllm api, with example picture_description_api:

    • the HuggingFaceTB/SmolVLM-256M-Instruct model

      {
        "url": "http://localhost:8000/v1/chat/completions",
        "params": {
          "model": "HuggingFaceTB/SmolVLM-256M-Instruct",
          "max_completion_tokens": 200,
        }
      }
      
    • the ibm-granite/granite-vision-3.2-2b model

      {
        "url": "http://localhost:8000/v1/chat/completions",
        "params": {
          "model": "ibm-granite/granite-vision-3.2-2b",
          "max_completion_tokens": 200,
        }
      }
      
  • http://localhost:11434/v1/chat/completions for the local Ollama api, with example picture_description_api:

    • the granite3.2-vision:2b model

      {
        "url": "http://localhost:11434/v1/chat/completions",
        "params": {
          "model": "granite3.2-vision:2b"
        }
      }
      

Note that when using picture_description_api, the server must be launched with DOCLING_SERVE_ENABLE_REMOTE_SERVICES=true.

Response format

The response can be a JSON Document or a File.

  • If you process only one file, the response will be a JSON document with the following format:

    {
      "document": {
        "md_content": "",
        "json_content": {},
        "html_content": "",
        "text_content": "",
        "doctags_content": ""
        },
      "status": "<success|partial_success|skipped|failure>",
      "processing_time": 0.0,
      "timings": {},
      "errors": []
    }
    

    Depending on the value you set in output_formats, the different items will be populated with their respective results or empty.

    processing_time is the Docling processing time in seconds, and timings (when enabled in the backend) provides the detailed timing of all the internal Docling components.

  • If you set the parameter target to the zip mode, the response will be a zip file.

  • If multiple files are generated (multiple inputs, or one input but multiple outputs with the zip target mode), the response will be a zip file.

Asynchronous API

Both /v1/convert/source and /v1/convert/file endpoints are available as asynchronous variants. The advantage of the asynchronous endpoints is the possible to interrupt the connection, check for the progress update and fetch the result. This approach is more resilient against network instabilities and allows the client application logic to easily interleave conversion with other tasks.

Launch an asynchronous conversion with:

  • POST /v1/convert/source/async when providing the input as sources.
  • POST /v1/convert/file/async when providing the input as multipart-form files.

The response format is a task detail:

{
  "task_id": "<task_id>",  // the task_id which can be used for the next operations
  "task_status": "pending|started|success|failure",  // the task status
  "task_position": 1,  // the position in the queue
  "task_meta": null,  // metadata e.g. how many documents are in the total job and how many have been converted
}

Polling status

For checking the progress of the conversion task and wait for its completion, use the endpoint:

  • GET /v1/status/poll/{task_id}
Example waiting loop:
import time
import httpx

# ...
# response from the async task submission
task = response.json()

while task["task_status"] not in ("success", "failure"):
    response = httpx.get(f"{base_url}/status/poll/{task['task_id']}")
    task = response.json()

    time.sleep(5)

Subscribe with websockets

Using websocket you can get the client application being notified about updates of the conversion task. To start the websocket connection, use the endpoint:

  • /v1/status/ws/{task_id}

Websocket messages are JSON object with the following structure:

{
  "message": "connection|update|error",  // type of message being sent
  "task": {},  // the same content of the task description
  "error": "",  // description of the error
}
Example websocket usage:
from websockets.sync.client import connect

uri = f"ws://{base_url}/v1/status/ws/{task['task_id']}"
with connect(uri) as websocket:
    for message in websocket:
        try:
            payload = json.loads(message)
            if payload["message"] == "error":
                break
            if payload["message"] == "update" and payload["task"]["task_status"] in ("success", "failure"):
                break
        except:
          break

Fetch results

When the task is completed, the result can be fetched with the endpoint:

  • GET /v1/result/{task_id}