docs: simplify README and move details to docs (#102)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
Michele Dolfi
2025-03-17 13:40:12 +01:00
committed by GitHub
parent 422c402bab
commit fd8e40a008
8 changed files with 439 additions and 396 deletions

View File

@@ -3,7 +3,7 @@ config:
no-emphasis-as-header: false
first-line-heading: false
MD033:
allowed_elements: ["details", "summary"]
allowed_elements: ["details", "summary", "br"]
MD024:
siblings_only: true
globs:

View File

@@ -66,7 +66,7 @@ action-lint: .action-lint ## Lint GitHub Action workflows
md-lint: .md-lint ## Lint markdown files
.md-lint: $(wildcard */**/*.md) | md-lint-file
$(ECHO_PREFIX) printf " %-12s ./...\n" "[MD LINT]"
$(CMD_PREFIX) docker run --rm -v $$(pwd):/workdir davidanson/markdownlint-cli2:v0.14.0 "**/*.md"
$(CMD_PREFIX) docker run --rm -v $$(pwd):/workdir davidanson/markdownlint-cli2:v0.16.0 "**/*.md" "#.venv"
$(CMD_PREFIX) touch $@
.PHONY: py-Lint

435
README.md
View File

@@ -1,423 +1,70 @@
# Docling Serve
Running [Docling](https://github.com/docling-project/docling) as an API service.
Running [Docling](https://github.com/docling-project/docling) as an API service.
## Usage
## Getting started
The API provides two endpoints: one for urls, one for files. This is necessary to send files directly in binary format instead of base64-encoded strings.
Install the `docling-serve` package and run the server.
### Common parameters
```bash
# Using the python package
pip install "docling-serve"
docling-serve run
On top of the source of file (see below), both endpoints support the same parameters, which are almost the same as the Docling CLI.
- `from_format` (List[str]): Input format(s) to convert from. Allowed values: `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md`. Defaults to all formats.
- `to_formats` (List[str]): Output format(s) to convert to. Allowed values: `md`, `json`, `html`, `text`, `doctags`. Defaults to `md`.
- `do_ocr` (bool): If enabled, the bitmap content will be processed using OCR. Defaults to `True`.
- `image_export_mode`: Image export mode for the document (only in case of JSON, Markdown or HTML). Allowed values: embedded, placeholder, referenced. Optional, defaults to `embedded`.
- `force_ocr` (bool): If enabled, replace any existing text with OCR-generated text over the full content. Defaults to `False`.
- `ocr_engine` (str): OCR engine to use. Allowed values: `easyocr`, `tesseract_cli`, `tesseract`, `rapidocr`, `ocrmac`. Defaults to `easyocr`.
- `ocr_lang` (List[str]): List of languages used by the OCR engine. Note that each OCR engine has different values for the language names. Defaults to empty.
- `pdf_backend` (str): PDF backend to use. Allowed values: `pypdfium2`, `dlparse_v1`, `dlparse_v2`. Defaults to `dlparse_v2`.
- `table_mode` (str): Table mode to use. Allowed values: `fast`, `accurate`. Defaults to `fast`.
- `abort_on_error` (bool): If enabled, abort on error. Defaults to false.
- `return_as_file` (boo): If enabled, return the output as a file. Defaults to false.
- `do_table_structure` (bool): If enabled, the table structure will be extracted. Defaults to true.
- `include_images` (bool): If enabled, images will be extracted from the document. Defaults to true.
- `images_scale` (float): Scale factor for images. Defaults to 2.0.
### URL endpoint
The endpoint is `/v1alpha/convert/source`, listening for POST requests of JSON payloads.
On top of the above parameters, you must send the URL(s) of the document you want process with either the `http_sources` or `file_sources` fields.
The first is fetching URL(s) (optionally using with extra headers), the second allows to provide documents as base64-encoded strings.
No `options` is required, they can be partially or completely omitted.
Simple payload example:
```json
{
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}
# Using container images, e.g. with Podman
podman run -p 5001:5001 quay.io/docling-project/docling-serve
```
<details>
The server is available at
<summary>Complete payload example:</summary>
- API <http://127.0.0.1:5001>
- API documentation <http://127.0.0.1:5001/docs>
![swagger.png](img/swagger.png)
```json
{
"options": {
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
"to_formats": ["md", "json", "html", "text", "doctags"],
"image_export_mode": "placeholder",
"do_ocr": true,
"force_ocr": false,
"ocr_engine": "easyocr",
"ocr_lang": ["en"],
"pdf_backend": "dlparse_v2",
"table_mode": "fast",
"abort_on_error": false,
"return_as_file": false,
},
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}
```
Try it out with a simple conversion:
</details>
<details>
<summary>CURL example:</summary>
```sh
```bash
curl -X 'POST' \
'http://localhost:5001/v1alpha/convert/source' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"options": {
"from_formats": [
"docx",
"pptx",
"html",
"image",
"pdf",
"asciidoc",
"md",
"xlsx"
],
"to_formats": ["md", "json", "html", "text", "doctags"],
"image_export_mode": "placeholder",
"do_ocr": true,
"force_ocr": false,
"ocr_engine": "easyocr",
"ocr_lang": [
"fr",
"de",
"es",
"en"
],
"pdf_backend": "dlparse_v2",
"table_mode": "fast",
"abort_on_error": false,
"return_as_file": false,
"do_table_structure": true,
"include_images": true,
"images_scale": 2
},
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}'
"http_sources": [{"url": "https://arxiv.org/pdf/2501.17887"}]
}'
```
</details>
### Container images
<details>
<summary>Python example:</summary>
Available container images:
```python
import httpx
| Name | Description | Arch | Size |
| -----|-------------|------|------|
| [`ghcr.io/docling-project/docling-serve`](https://github.com/docling-project/docling-serve/pkgs/container/docling-serve) <br /> [`quay.io/docling-project/docling-serve`](https://quay.io/repository/docling-project/docling-serve) | Simple image for Docling Serve, installing all packages from the official pypi.org index. | `linux/amd64`, `linux/arm64` | 3.6 GB |
| [`ghcr.io/docling-project/docling-serve-cpu`](https://github.com/docling-project/docling-serve/pkgs/container/docling-serve-cpu) <br /> [`quay.io/docling-project/docling-serve-cpu`](https://quay.io/repository/docling-project/docling-serve-cpu) | Cpu-only image which installs `torch` from the pytorch cpu index. | `linux/amd64`, `linux/arm64` | 3.6 GB |
| [`ghcr.io/docling-project/docling-serve-cu124`](https://github.com/docling-project/docling-serve/pkgs/container/docling-serve-cu124) <br /> [`quay.io/docling-project/docling-serve-cu124`](https://quay.io/repository/docling-project/docling-serve-cu124) | Cuda 12.4 image which installs `torch` from the pytorch cu124 index. | `linux/amd64` | 8.7 GB |
async_client = httpx.AsyncClient(timeout=60.0)
url = "http://localhost:5001/v1alpha/convert/source"
payload = {
"options": {
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
"to_formats": ["md", "json", "html", "text", "doctags"],
"image_export_mode": "placeholder",
"do_ocr": True,
"force_ocr": False,
"ocr_engine": "easyocr",
"ocr_lang": "en",
"pdf_backend": "dlparse_v2",
"table_mode": "fast",
"abort_on_error": False,
"return_as_file": False,
},
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}
Coming soon: `docling-serve-slim` images will reduce the size by skipping the model weights download.
response = await async_client_client.post(url, json=payload)
data = response.json()
```
</details>
#### File as base64
The `file_sources` argument in the endpoint allows to send files as base64-encoded strings.
When your PDF or other file type is too large, encoding it and passing it inline to curl
can lead to an “Argument list too long” error on some systems. To avoid this, we write
the JSON request body to a file and have curl read from that file.
<details>
<summary>CURL steps:</summary>
```sh
# 1. Base64-encode the file
B64_DATA=$(base64 -w 0 /path/to/file/pdf-to-convert.pdf)
# 2. Build the JSON with your options
cat <<EOF > /tmp/request_body.json
{
"options": {
},
"file_sources": [{
"base64_string": "${B64_DATA}",
"filename": "pdf-to-convert.pdf"
}]
}
EOF
# 3. POST the request to the docling service
curl -X POST "localhost:5001/v1alpha/convert/source" \
-H "Content-Type: application/json" \
-d @/tmp/request_body.json
```
</details>
### File endpoint
The endpoint is: `/v1alpha/convert/file`, listening for POST requests of Form payloads (necessary as the files are sent as multipart/form data). You can send one or multiple files.
<details>
<summary>CURL example:</summary>
```sh
curl -X 'POST' \
'http://127.0.0.1:5001/v1alpha/convert/file' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'ocr_engine=easyocr' \
-F 'pdf_backend=dlparse_v2' \
-F 'from_formats=pdf' \
-F 'from_formats=docx' \
-F 'force_ocr=false' \
-F 'image_export_mode=embedded' \
-F 'ocr_lang=en' \
-F 'ocr_lang=pl' \
-F 'table_mode=fast' \
-F 'files=@2206.01062v1.pdf;type=application/pdf' \
-F 'abort_on_error=false' \
-F 'to_formats=md' \
-F 'to_formats=text' \
-F 'return_as_file=false' \
-F 'do_ocr=true'
```
</details>
<details>
<summary>Python example:</summary>
```python
import httpx
async_client = httpx.AsyncClient(timeout=60.0)
url = "http://localhost:5001/v1alpha/convert/file"
parameters = {
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
"to_formats": ["md", "json", "html", "text", "doctags"],
"image_export_mode": "placeholder",
"do_ocr": True,
"force_ocr": False,
"ocr_engine": "easyocr",
"ocr_lang": ["en"],
"pdf_backend": "dlparse_v2",
"table_mode": "fast",
"abort_on_error": False,
"return_as_file": False
}
current_dir = os.path.dirname(__file__)
file_path = os.path.join(current_dir, '2206.01062v1.pdf')
files = {
'files': ('2206.01062v1.pdf', open(file_path, 'rb'), 'application/pdf'),
}
response = await async_client.post(url, files=files, data={"parameters": json.dumps(parameters)})
assert response.status_code == 200, "Response should be 200 OK"
data = response.json()
```
</details>
### Response format
The response can be a JSON Document or a File.
- If you process only one file, the response will be a JSON document with the following format:
```jsonc
{
"document": {
"md_content": "",
"json_content": {},
"html_content": "",
"text_content": "",
"doctags_content": ""
},
"status": "<success|partial_success|skipped|failure>",
"processing_time": 0.0,
"timings": {},
"errors": []
}
```
Depending on the value you set in `output_formats`, the different items will be populated with their respective results or empty.
`processing_time` is the Docling processing time in seconds, and `timings` (when enabled in the backend) provides the detailed
timing of all the internal Docling components.
- If you set the parameter `return_as_file` to True, the response will be a zip file.
- If multiple files are generated (multiple inputs, or one input but multiple outputs with `return_as_file` True), the response will be a zip file.
## Run docling-serve
Clone the repository and run the following from within the cloned directory root.
### Demonstration UI
```bash
python -m venv venv
source venv/bin/activate
# Install the Python package with the extra dependencies
pip install "docling-serve[ui]"
docling-serve run --enable-ui
# Run the container image with the extra env parameters
podman run -p 5001:5001 -e DOCLING_SERVE_ENABLE_UI=true quay.io/docling-project/docling-serve
```
## Helpers
- A full Swagger UI is available at the `/docs` endpoint.
![swagger.png](img/swagger.png)
- An easy to use UI is available at the `/ui` endpoint.
An easy to use UI is available at the `/ui` endpoint.
![ui-input.png](img/ui-input.png)
![ui-output.png](img/ui-output.png)
## Development
## Documentation and advance usages
### CPU only
```sh
# Install uv if not already available
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync --extra cpu
```
### Cuda GPU
For GPU support use the following command:
```sh
# Install dependencies
uv sync
```
### Gradio UI and different OCR backends
`/ui` endpoint using `gradio` and different OCR backends can be enabled via package extras:
```sh
# Enable ui and rapidocr
uv sync --extra ui --extra rapidocr
```
```sh
# Enable tesserocr
uv sync --extra tesserocr
```
See `[project.optional-dependencies]` section in `pyproject.toml` for full list of options and runtime options with `uv run docling-serve --help`.
### Run the server
The `docling-serve` executable is a convenient script for launching the webserver both in
development and production mode.
```sh
# Run the server in development mode
# - reload is enabled by default
# - listening on the 127.0.0.1 address
# - ui is enabled by default
docling-serve dev
# Run the server in production mode
# - reload is disabled by default
# - listening on the 0.0.0.0 address
# - ui is disabled by default
docling-serve run
```
### Options
The `docling-serve` executable allows is controlled with both command line
options and environment variables.
<details>
<summary>`docling-serve` help message</summary>
```sh
$ docling-serve dev --help
Usage: docling-serve dev [OPTIONS]
Run a Docling Serve app in development mode. 🧪
This is equivalent to docling-serve run but with reload
enabled and listening on the 127.0.0.1 address.
Options can be set also with the corresponding ENV variable, with the exception
of --enable-ui, --host and --reload.
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --host TEXT The host to serve on. For local development in localhost │
│ use 127.0.0.1. To enable public access, e.g. in a │
│ container, use all the IP addresses available with │
│ 0.0.0.0. │
│ [default: 127.0.0.1] │
│ --port INTEGER The port to serve on. [default: 5001] │
│ --reload --no-reload Enable auto-reload of the server when (code) files │
│ change. This is resource intensive, use it only during │
│ development. │
│ [default: reload] │
│ --root-path TEXT The root path is used to tell your app that it is being │
│ served to the outside world with some path prefix set up │
│ in some termination proxy or similar. │
│ --proxy-headers --no-proxy-headers Enable/Disable X-Forwarded-Proto, X-Forwarded-For, │
│ X-Forwarded-Port to populate remote address info. │
│ [default: proxy-headers] │
│ --artifacts-path PATH If set to a valid directory, the model weights will be │
│ loaded from this path. │
│ [default: None] │
│ --enable-ui --no-enable-ui Enable the development UI. [default: enable-ui] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```
</details>
#### Environment variables
The environment variables controlling the `uvicorn` execution can be specified with the `UVICORN_` prefix:
- `UVICORN_WORKERS`: Number of workers to use.
- `UVICORN_RELOAD`: If `True`, this will enable auto-reload when you modify files, useful for development.
The environment variables controlling specifics of the Docling Serve app can be specified with the
`DOCLING_SERVE_` prefix:
- `DOCLING_SERVE_ARTIFACTS_PATH`: if set Docling will use only the local weights of models, for example `/opt/app-root/src/.cache/docling/models`.
- `DOCLING_SERVE_ENABLE_UI`: If `True`, The Gradio UI will be available at `/ui`.
Others:
- `TESSDATA_PREFIX`: Tesseract data location, example `/usr/share/tesseract/tessdata/`.
Visit the [Docling Serve documentation](./docs/README.md) for learning how to [configure the webserver](./docs/configuration.md), use all the [runtime options](./docs/usage.md) of the API and [deployment examples](./docs/deployment.md).
## Get help and support
@@ -433,14 +80,14 @@ If you use Docling in your projects, please consider citing the following:
```bib
@techreport{Docling,
author = {Deep Search Team},
month = {8},
title = {Docling Technical Report},
url = {https://arxiv.org/abs/2408.09869},
eprint = {2408.09869},
doi = {10.48550/arXiv.2408.09869},
version = {1.0.0},
year = {2024}
author = {Docling Contributors},
month = {1},
title = {Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion},
url = {https://arxiv.org/abs/2501.17887},
eprint = {2501.17887},
doi = {10.48550/arXiv.2501.17887},
version = {2.0.0},
year = {2025}
}
```

8
docs/README.md Normal file
View File

@@ -0,0 +1,8 @@
# Dolcing Serve documentation
This documentation pages explore the webserver configurations, runtime options, deployment examples as well as development best practices.
- [Configuration](./configuration.md)
- [Advance usage](./usage.md)
- [Deployment](./deployment.md)
- [Development](./development.md)

40
docs/configuration.md Normal file
View File

@@ -0,0 +1,40 @@
# Configuration
The `docling-serve` executable allows to configure the server via command line
options as well as environment variables.
Configurations are divided between the settings used for the `uvicorn` asgi
server and the actual app-specific configurations.
> [!WARNING]
> When the server is running with `reload` or with multiple `workers`, uvicorn
> will spawn multiple subprocessed. This invalides all the values configured
> via the CLI command line options. Please use environment variables in this
> type of deployments.
## Webserver configuration
The following table shows the options which are propagated directly to the
`uvicorn` webserver runtime.
| CLI option | ENV | Default | Description |
| -----------|-----|---------|-------------|
| `--host` | `UVICORN_HOST` | `0.0.0.0` for `run`, `localhost` for `dev` | THe host to serve on. |
| `--port` | `UVICORN_PORT` | `5001` | The port to serve on. |
| `--reload` | `UVICORN_RELOAD` | `false` for `run`, `true` for `dev` | Enable auto-reload of the server when (code) files change. |
| `--workers` | `UVICORN_WORKERS` | `1` | Use multiple worker processes. |
| `--root-path` | `UVICORN_ROOT_PATH` | `""` | The root path is used to tell your app that it is being served to the outside world with some |
| `--proxy-headers` | `UVICORN_PROXY_HEADERS` | `true` | Enable/Disable X-Forwarded-Proto, X-Forwarded-For, X-Forwarded-Port to populate remote address info. |
| `--timeout-keep-alive` | `UVICORN_TIMEOUT_KEEP_ALIVE` | `60` | Timeout for the server response. |
## Docling Serve configuration
THe following table describes the options to configure the Docling Serve app.
| CLI option | ENV | Default | Description |
| -----------|-----|---------|-------------|
| `--artifacts-path` | `DOCLING_SERVE_ARTIFACTS_PATH` | unset | If set to a valid directory, the model weights will be loaded from this path |
| `--enable-ui` | `DOCLING_SERVE_ENABLE_UI` | `false` | Enable the demonstrator UI. |
| | `DOCLING_SERVE_OPTIONS_CACHE_SIZE` | `2` | How many DocumentConveter objects (including their loaded models) to keep in the cache. |
| | `DOCLING_SERVE_CORS_ORIGINS` | `["*"]` | A list of origins that should be permitted to make cross-origin requests. |
| | `DOCLING_SERVE_CORS_METHODS` | `["*"]` | A list of HTTP methods that should be allowed for cross-origin requests. |
| | `DOCLING_SERVE_CORS_HEADERS` | `["*"]` | A list of HTTP request headers that should be supported for cross-origin requests. |

12
docs/deployment.md Normal file
View File

@@ -0,0 +1,12 @@
# Deployment
## Kubernetes and OpenShift
### Knative
The following manifest will launch Docling Serve using Knative to expose the application
with an external ingress endpoint.
```yaml
# TODO
```

57
docs/development.md Normal file
View File

@@ -0,0 +1,57 @@
# Development
## Install dependencies
### CPU only
```sh
# Install uv if not already available
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync --extra cpu
```
### Cuda GPU
For GPU support use the following command:
```sh
# Install dependencies
uv sync
```
### Gradio UI and different OCR backends
`/ui` endpoint using `gradio` and different OCR backends can be enabled via package extras:
```sh
# Enable ui and rapidocr
uv sync --extra ui --extra rapidocr
```
```sh
# Enable tesserocr
uv sync --extra tesserocr
```
See `[project.optional-dependencies]` section in `pyproject.toml` for full list of options and runtime options with `uv run docling-serve --help`.
### Run the server
The `docling-serve` executable is a convenient script for launching the webserver both in
development and production mode.
```sh
# Run the server in development mode
# - reload is enabled by default
# - listening on the 127.0.0.1 address
# - ui is enabled by default
docling-serve dev
# Run the server in production mode
# - reload is disabled by default
# - listening on the 0.0.0.0 address
# - ui is disabled by default
docling-serve run
```

279
docs/usage.md Normal file
View File

@@ -0,0 +1,279 @@
# Usage
The API provides two endpoints: one for urls, one for files. This is necessary to send files directly in binary format instead of base64-encoded strings.
## Common parameters
On top of the source of file (see below), both endpoints support the same parameters, which are almost the same as the Docling CLI.
- `from_format` (List[str]): Input format(s) to convert from. Allowed values: `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md`. Defaults to all formats.
- `to_formats` (List[str]): Output format(s) to convert to. Allowed values: `md`, `json`, `html`, `text`, `doctags`. Defaults to `md`.
- `do_ocr` (bool): If enabled, the bitmap content will be processed using OCR. Defaults to `True`.
- `image_export_mode`: Image export mode for the document (only in case of JSON, Markdown or HTML). Allowed values: embedded, placeholder, referenced. Optional, defaults to `embedded`.
- `force_ocr` (bool): If enabled, replace any existing text with OCR-generated text over the full content. Defaults to `False`.
- `ocr_engine` (str): OCR engine to use. Allowed values: `easyocr`, `tesseract_cli`, `tesseract`, `rapidocr`, `ocrmac`. Defaults to `easyocr`.
- `ocr_lang` (List[str]): List of languages used by the OCR engine. Note that each OCR engine has different values for the language names. Defaults to empty.
- `pdf_backend` (str): PDF backend to use. Allowed values: `pypdfium2`, `dlparse_v1`, `dlparse_v2`. Defaults to `dlparse_v2`.
- `table_mode` (str): Table mode to use. Allowed values: `fast`, `accurate`. Defaults to `fast`.
- `abort_on_error` (bool): If enabled, abort on error. Defaults to false.
- `return_as_file` (boo): If enabled, return the output as a file. Defaults to false.
- `do_table_structure` (bool): If enabled, the table structure will be extracted. Defaults to true.
- `include_images` (bool): If enabled, images will be extracted from the document. Defaults to true.
- `images_scale` (float): Scale factor for images. Defaults to 2.0.
## Convert endpoints
### Source endpoint
The endpoint is `/v1alpha/convert/source`, listening for POST requests of JSON payloads.
On top of the above parameters, you must send the URL(s) of the document you want process with either the `http_sources` or `file_sources` fields.
The first is fetching URL(s) (optionally using with extra headers), the second allows to provide documents as base64-encoded strings.
No `options` is required, they can be partially or completely omitted.
Simple payload example:
```json
{
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}
```
<details>
<summary>Complete payload example:</summary>
```json
{
"options": {
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
"to_formats": ["md", "json", "html", "text", "doctags"],
"image_export_mode": "placeholder",
"do_ocr": true,
"force_ocr": false,
"ocr_engine": "easyocr",
"ocr_lang": ["en"],
"pdf_backend": "dlparse_v2",
"table_mode": "fast",
"abort_on_error": false,
"return_as_file": false,
},
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}
```
</details>
<details>
<summary>CURL example:</summary>
```sh
curl -X 'POST' \
'http://localhost:5001/v1alpha/convert/source' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"options": {
"from_formats": [
"docx",
"pptx",
"html",
"image",
"pdf",
"asciidoc",
"md",
"xlsx"
],
"to_formats": ["md", "json", "html", "text", "doctags"],
"image_export_mode": "placeholder",
"do_ocr": true,
"force_ocr": false,
"ocr_engine": "easyocr",
"ocr_lang": [
"fr",
"de",
"es",
"en"
],
"pdf_backend": "dlparse_v2",
"table_mode": "fast",
"abort_on_error": false,
"return_as_file": false,
"do_table_structure": true,
"include_images": true,
"images_scale": 2
},
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}'
```
</details>
<details>
<summary>Python example:</summary>
```python
import httpx
async_client = httpx.AsyncClient(timeout=60.0)
url = "http://localhost:5001/v1alpha/convert/source"
payload = {
"options": {
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
"to_formats": ["md", "json", "html", "text", "doctags"],
"image_export_mode": "placeholder",
"do_ocr": True,
"force_ocr": False,
"ocr_engine": "easyocr",
"ocr_lang": "en",
"pdf_backend": "dlparse_v2",
"table_mode": "fast",
"abort_on_error": False,
"return_as_file": False,
},
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}
response = await async_client_client.post(url, json=payload)
data = response.json()
```
</details>
#### File as base64
The `file_sources` argument in the endpoint allows to send files as base64-encoded strings.
When your PDF or other file type is too large, encoding it and passing it inline to curl
can lead to an “Argument list too long” error on some systems. To avoid this, we write
the JSON request body to a file and have curl read from that file.
<details>
<summary>CURL steps:</summary>
```sh
# 1. Base64-encode the file
B64_DATA=$(base64 -w 0 /path/to/file/pdf-to-convert.pdf)
# 2. Build the JSON with your options
cat <<EOF > /tmp/request_body.json
{
"options": {
},
"file_sources": [{
"base64_string": "${B64_DATA}",
"filename": "pdf-to-convert.pdf"
}]
}
EOF
# 3. POST the request to the docling service
curl -X POST "localhost:5001/v1alpha/convert/source" \
-H "Content-Type: application/json" \
-d @/tmp/request_body.json
```
</details>
### File endpoint
The endpoint is: `/v1alpha/convert/file`, listening for POST requests of Form payloads (necessary as the files are sent as multipart/form data). You can send one or multiple files.
<details>
<summary>CURL example:</summary>
```sh
curl -X 'POST' \
'http://127.0.0.1:5001/v1alpha/convert/file' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'ocr_engine=easyocr' \
-F 'pdf_backend=dlparse_v2' \
-F 'from_formats=pdf' \
-F 'from_formats=docx' \
-F 'force_ocr=false' \
-F 'image_export_mode=embedded' \
-F 'ocr_lang=en' \
-F 'ocr_lang=pl' \
-F 'table_mode=fast' \
-F 'files=@2206.01062v1.pdf;type=application/pdf' \
-F 'abort_on_error=false' \
-F 'to_formats=md' \
-F 'to_formats=text' \
-F 'return_as_file=false' \
-F 'do_ocr=true'
```
</details>
<details>
<summary>Python example:</summary>
```python
import httpx
async_client = httpx.AsyncClient(timeout=60.0)
url = "http://localhost:5001/v1alpha/convert/file"
parameters = {
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
"to_formats": ["md", "json", "html", "text", "doctags"],
"image_export_mode": "placeholder",
"do_ocr": True,
"force_ocr": False,
"ocr_engine": "easyocr",
"ocr_lang": ["en"],
"pdf_backend": "dlparse_v2",
"table_mode": "fast",
"abort_on_error": False,
"return_as_file": False
}
current_dir = os.path.dirname(__file__)
file_path = os.path.join(current_dir, '2206.01062v1.pdf')
files = {
'files': ('2206.01062v1.pdf', open(file_path, 'rb'), 'application/pdf'),
}
response = await async_client.post(url, files=files, data={"parameters": json.dumps(parameters)})
assert response.status_code == 200, "Response should be 200 OK"
data = response.json()
```
</details>
## Response format
The response can be a JSON Document or a File.
- If you process only one file, the response will be a JSON document with the following format:
```jsonc
{
"document": {
"md_content": "",
"json_content": {},
"html_content": "",
"text_content": "",
"doctags_content": ""
},
"status": "<success|partial_success|skipped|failure>",
"processing_time": 0.0,
"timings": {},
"errors": []
}
```
Depending on the value you set in `output_formats`, the different items will be populated with their respective results or empty.
`processing_time` is the Docling processing time in seconds, and `timings` (when enabled in the backend) provides the detailed
timing of all the internal Docling components.
- If you set the parameter `return_as_file` to True, the response will be a zip file.
- If multiple files are generated (multiple inputs, or one input but multiple outputs with `return_as_file` True), the response will be a zip file.
## Asynchronous API
TBA