mirror of
https://github.com/docling-project/docling-serve.git
synced 2025-11-29 08:33:50 +00:00
docs: simplify README and move details to docs (#102)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
@@ -3,7 +3,7 @@ config:
|
|||||||
no-emphasis-as-header: false
|
no-emphasis-as-header: false
|
||||||
first-line-heading: false
|
first-line-heading: false
|
||||||
MD033:
|
MD033:
|
||||||
allowed_elements: ["details", "summary"]
|
allowed_elements: ["details", "summary", "br"]
|
||||||
MD024:
|
MD024:
|
||||||
siblings_only: true
|
siblings_only: true
|
||||||
globs:
|
globs:
|
||||||
|
|||||||
2
Makefile
2
Makefile
@@ -66,7 +66,7 @@ action-lint: .action-lint ## Lint GitHub Action workflows
|
|||||||
md-lint: .md-lint ## Lint markdown files
|
md-lint: .md-lint ## Lint markdown files
|
||||||
.md-lint: $(wildcard */**/*.md) | md-lint-file
|
.md-lint: $(wildcard */**/*.md) | md-lint-file
|
||||||
$(ECHO_PREFIX) printf " %-12s ./...\n" "[MD LINT]"
|
$(ECHO_PREFIX) printf " %-12s ./...\n" "[MD LINT]"
|
||||||
$(CMD_PREFIX) docker run --rm -v $$(pwd):/workdir davidanson/markdownlint-cli2:v0.14.0 "**/*.md"
|
$(CMD_PREFIX) docker run --rm -v $$(pwd):/workdir davidanson/markdownlint-cli2:v0.16.0 "**/*.md" "#.venv"
|
||||||
$(CMD_PREFIX) touch $@
|
$(CMD_PREFIX) touch $@
|
||||||
|
|
||||||
.PHONY: py-Lint
|
.PHONY: py-Lint
|
||||||
|
|||||||
431
README.md
431
README.md
@@ -2,422 +2,69 @@
|
|||||||
|
|
||||||
Running [Docling](https://github.com/docling-project/docling) as an API service.
|
Running [Docling](https://github.com/docling-project/docling) as an API service.
|
||||||
|
|
||||||
## Usage
|
## Getting started
|
||||||
|
|
||||||
The API provides two endpoints: one for urls, one for files. This is necessary to send files directly in binary format instead of base64-encoded strings.
|
Install the `docling-serve` package and run the server.
|
||||||
|
|
||||||
### Common parameters
|
```bash
|
||||||
|
# Using the python package
|
||||||
|
pip install "docling-serve"
|
||||||
|
docling-serve run
|
||||||
|
|
||||||
On top of the source of file (see below), both endpoints support the same parameters, which are almost the same as the Docling CLI.
|
# Using container images, e.g. with Podman
|
||||||
|
podman run -p 5001:5001 quay.io/docling-project/docling-serve
|
||||||
- `from_format` (List[str]): Input format(s) to convert from. Allowed values: `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md`. Defaults to all formats.
|
|
||||||
- `to_formats` (List[str]): Output format(s) to convert to. Allowed values: `md`, `json`, `html`, `text`, `doctags`. Defaults to `md`.
|
|
||||||
- `do_ocr` (bool): If enabled, the bitmap content will be processed using OCR. Defaults to `True`.
|
|
||||||
- `image_export_mode`: Image export mode for the document (only in case of JSON, Markdown or HTML). Allowed values: embedded, placeholder, referenced. Optional, defaults to `embedded`.
|
|
||||||
- `force_ocr` (bool): If enabled, replace any existing text with OCR-generated text over the full content. Defaults to `False`.
|
|
||||||
- `ocr_engine` (str): OCR engine to use. Allowed values: `easyocr`, `tesseract_cli`, `tesseract`, `rapidocr`, `ocrmac`. Defaults to `easyocr`.
|
|
||||||
- `ocr_lang` (List[str]): List of languages used by the OCR engine. Note that each OCR engine has different values for the language names. Defaults to empty.
|
|
||||||
- `pdf_backend` (str): PDF backend to use. Allowed values: `pypdfium2`, `dlparse_v1`, `dlparse_v2`. Defaults to `dlparse_v2`.
|
|
||||||
- `table_mode` (str): Table mode to use. Allowed values: `fast`, `accurate`. Defaults to `fast`.
|
|
||||||
- `abort_on_error` (bool): If enabled, abort on error. Defaults to false.
|
|
||||||
- `return_as_file` (boo): If enabled, return the output as a file. Defaults to false.
|
|
||||||
- `do_table_structure` (bool): If enabled, the table structure will be extracted. Defaults to true.
|
|
||||||
- `include_images` (bool): If enabled, images will be extracted from the document. Defaults to true.
|
|
||||||
- `images_scale` (float): Scale factor for images. Defaults to 2.0.
|
|
||||||
|
|
||||||
### URL endpoint
|
|
||||||
|
|
||||||
The endpoint is `/v1alpha/convert/source`, listening for POST requests of JSON payloads.
|
|
||||||
|
|
||||||
On top of the above parameters, you must send the URL(s) of the document you want process with either the `http_sources` or `file_sources` fields.
|
|
||||||
The first is fetching URL(s) (optionally using with extra headers), the second allows to provide documents as base64-encoded strings.
|
|
||||||
No `options` is required, they can be partially or completely omitted.
|
|
||||||
|
|
||||||
Simple payload example:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
|
|
||||||
}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
<details>
|
The server is available at
|
||||||
|
|
||||||
<summary>Complete payload example:</summary>
|
- API <http://127.0.0.1:5001>
|
||||||
|
- API documentation <http://127.0.0.1:5001/docs>
|
||||||
|

|
||||||
|
|
||||||
```json
|
Try it out with a simple conversion:
|
||||||
{
|
|
||||||
"options": {
|
|
||||||
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
|
|
||||||
"to_formats": ["md", "json", "html", "text", "doctags"],
|
|
||||||
"image_export_mode": "placeholder",
|
|
||||||
"do_ocr": true,
|
|
||||||
"force_ocr": false,
|
|
||||||
"ocr_engine": "easyocr",
|
|
||||||
"ocr_lang": ["en"],
|
|
||||||
"pdf_backend": "dlparse_v2",
|
|
||||||
"table_mode": "fast",
|
|
||||||
"abort_on_error": false,
|
|
||||||
"return_as_file": false,
|
|
||||||
},
|
|
||||||
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
```bash
|
||||||
|
|
||||||
<details>
|
|
||||||
|
|
||||||
<summary>CURL example:</summary>
|
|
||||||
|
|
||||||
```sh
|
|
||||||
curl -X 'POST' \
|
curl -X 'POST' \
|
||||||
'http://localhost:5001/v1alpha/convert/source' \
|
'http://localhost:5001/v1alpha/convert/source' \
|
||||||
-H 'accept: application/json' \
|
-H 'accept: application/json' \
|
||||||
-H 'Content-Type: application/json' \
|
-H 'Content-Type: application/json' \
|
||||||
-d '{
|
-d '{
|
||||||
"options": {
|
"http_sources": [{"url": "https://arxiv.org/pdf/2501.17887"}]
|
||||||
"from_formats": [
|
|
||||||
"docx",
|
|
||||||
"pptx",
|
|
||||||
"html",
|
|
||||||
"image",
|
|
||||||
"pdf",
|
|
||||||
"asciidoc",
|
|
||||||
"md",
|
|
||||||
"xlsx"
|
|
||||||
],
|
|
||||||
"to_formats": ["md", "json", "html", "text", "doctags"],
|
|
||||||
"image_export_mode": "placeholder",
|
|
||||||
"do_ocr": true,
|
|
||||||
"force_ocr": false,
|
|
||||||
"ocr_engine": "easyocr",
|
|
||||||
"ocr_lang": [
|
|
||||||
"fr",
|
|
||||||
"de",
|
|
||||||
"es",
|
|
||||||
"en"
|
|
||||||
],
|
|
||||||
"pdf_backend": "dlparse_v2",
|
|
||||||
"table_mode": "fast",
|
|
||||||
"abort_on_error": false,
|
|
||||||
"return_as_file": false,
|
|
||||||
"do_table_structure": true,
|
|
||||||
"include_images": true,
|
|
||||||
"images_scale": 2
|
|
||||||
},
|
|
||||||
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
|
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
</details>
|
### Container images
|
||||||
|
|
||||||
<details>
|
Available container images:
|
||||||
<summary>Python example:</summary>
|
|
||||||
|
|
||||||
```python
|
| Name | Description | Arch | Size |
|
||||||
import httpx
|
| -----|-------------|------|------|
|
||||||
|
| [`ghcr.io/docling-project/docling-serve`](https://github.com/docling-project/docling-serve/pkgs/container/docling-serve) <br /> [`quay.io/docling-project/docling-serve`](https://quay.io/repository/docling-project/docling-serve) | Simple image for Docling Serve, installing all packages from the official pypi.org index. | `linux/amd64`, `linux/arm64` | 3.6 GB |
|
||||||
|
| [`ghcr.io/docling-project/docling-serve-cpu`](https://github.com/docling-project/docling-serve/pkgs/container/docling-serve-cpu) <br /> [`quay.io/docling-project/docling-serve-cpu`](https://quay.io/repository/docling-project/docling-serve-cpu) | Cpu-only image which installs `torch` from the pytorch cpu index. | `linux/amd64`, `linux/arm64` | 3.6 GB |
|
||||||
|
| [`ghcr.io/docling-project/docling-serve-cu124`](https://github.com/docling-project/docling-serve/pkgs/container/docling-serve-cu124) <br /> [`quay.io/docling-project/docling-serve-cu124`](https://quay.io/repository/docling-project/docling-serve-cu124) | Cuda 12.4 image which installs `torch` from the pytorch cu124 index. | `linux/amd64` | 8.7 GB |
|
||||||
|
|
||||||
async_client = httpx.AsyncClient(timeout=60.0)
|
Coming soon: `docling-serve-slim` images will reduce the size by skipping the model weights download.
|
||||||
url = "http://localhost:5001/v1alpha/convert/source"
|
|
||||||
payload = {
|
|
||||||
"options": {
|
|
||||||
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
|
|
||||||
"to_formats": ["md", "json", "html", "text", "doctags"],
|
|
||||||
"image_export_mode": "placeholder",
|
|
||||||
"do_ocr": True,
|
|
||||||
"force_ocr": False,
|
|
||||||
"ocr_engine": "easyocr",
|
|
||||||
"ocr_lang": "en",
|
|
||||||
"pdf_backend": "dlparse_v2",
|
|
||||||
"table_mode": "fast",
|
|
||||||
"abort_on_error": False,
|
|
||||||
"return_as_file": False,
|
|
||||||
},
|
|
||||||
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
|
|
||||||
}
|
|
||||||
|
|
||||||
response = await async_client_client.post(url, json=payload)
|
### Demonstration UI
|
||||||
|
|
||||||
data = response.json()
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
#### File as base64
|
|
||||||
|
|
||||||
The `file_sources` argument in the endpoint allows to send files as base64-encoded strings.
|
|
||||||
When your PDF or other file type is too large, encoding it and passing it inline to curl
|
|
||||||
can lead to an “Argument list too long” error on some systems. To avoid this, we write
|
|
||||||
the JSON request body to a file and have curl read from that file.
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>CURL steps:</summary>
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# 1. Base64-encode the file
|
|
||||||
B64_DATA=$(base64 -w 0 /path/to/file/pdf-to-convert.pdf)
|
|
||||||
|
|
||||||
# 2. Build the JSON with your options
|
|
||||||
cat <<EOF > /tmp/request_body.json
|
|
||||||
{
|
|
||||||
"options": {
|
|
||||||
},
|
|
||||||
"file_sources": [{
|
|
||||||
"base64_string": "${B64_DATA}",
|
|
||||||
"filename": "pdf-to-convert.pdf"
|
|
||||||
}]
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
|
|
||||||
# 3. POST the request to the docling service
|
|
||||||
curl -X POST "localhost:5001/v1alpha/convert/source" \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d @/tmp/request_body.json
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
### File endpoint
|
|
||||||
|
|
||||||
The endpoint is: `/v1alpha/convert/file`, listening for POST requests of Form payloads (necessary as the files are sent as multipart/form data). You can send one or multiple files.
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>CURL example:</summary>
|
|
||||||
|
|
||||||
```sh
|
|
||||||
curl -X 'POST' \
|
|
||||||
'http://127.0.0.1:5001/v1alpha/convert/file' \
|
|
||||||
-H 'accept: application/json' \
|
|
||||||
-H 'Content-Type: multipart/form-data' \
|
|
||||||
-F 'ocr_engine=easyocr' \
|
|
||||||
-F 'pdf_backend=dlparse_v2' \
|
|
||||||
-F 'from_formats=pdf' \
|
|
||||||
-F 'from_formats=docx' \
|
|
||||||
-F 'force_ocr=false' \
|
|
||||||
-F 'image_export_mode=embedded' \
|
|
||||||
-F 'ocr_lang=en' \
|
|
||||||
-F 'ocr_lang=pl' \
|
|
||||||
-F 'table_mode=fast' \
|
|
||||||
-F 'files=@2206.01062v1.pdf;type=application/pdf' \
|
|
||||||
-F 'abort_on_error=false' \
|
|
||||||
-F 'to_formats=md' \
|
|
||||||
-F 'to_formats=text' \
|
|
||||||
-F 'return_as_file=false' \
|
|
||||||
-F 'do_ocr=true'
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>Python example:</summary>
|
|
||||||
|
|
||||||
```python
|
|
||||||
import httpx
|
|
||||||
|
|
||||||
async_client = httpx.AsyncClient(timeout=60.0)
|
|
||||||
url = "http://localhost:5001/v1alpha/convert/file"
|
|
||||||
parameters = {
|
|
||||||
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
|
|
||||||
"to_formats": ["md", "json", "html", "text", "doctags"],
|
|
||||||
"image_export_mode": "placeholder",
|
|
||||||
"do_ocr": True,
|
|
||||||
"force_ocr": False,
|
|
||||||
"ocr_engine": "easyocr",
|
|
||||||
"ocr_lang": ["en"],
|
|
||||||
"pdf_backend": "dlparse_v2",
|
|
||||||
"table_mode": "fast",
|
|
||||||
"abort_on_error": False,
|
|
||||||
"return_as_file": False
|
|
||||||
}
|
|
||||||
|
|
||||||
current_dir = os.path.dirname(__file__)
|
|
||||||
file_path = os.path.join(current_dir, '2206.01062v1.pdf')
|
|
||||||
|
|
||||||
files = {
|
|
||||||
'files': ('2206.01062v1.pdf', open(file_path, 'rb'), 'application/pdf'),
|
|
||||||
}
|
|
||||||
|
|
||||||
response = await async_client.post(url, files=files, data={"parameters": json.dumps(parameters)})
|
|
||||||
assert response.status_code == 200, "Response should be 200 OK"
|
|
||||||
|
|
||||||
data = response.json()
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
### Response format
|
|
||||||
|
|
||||||
The response can be a JSON Document or a File.
|
|
||||||
|
|
||||||
- If you process only one file, the response will be a JSON document with the following format:
|
|
||||||
|
|
||||||
```jsonc
|
|
||||||
{
|
|
||||||
"document": {
|
|
||||||
"md_content": "",
|
|
||||||
"json_content": {},
|
|
||||||
"html_content": "",
|
|
||||||
"text_content": "",
|
|
||||||
"doctags_content": ""
|
|
||||||
},
|
|
||||||
"status": "<success|partial_success|skipped|failure>",
|
|
||||||
"processing_time": 0.0,
|
|
||||||
"timings": {},
|
|
||||||
"errors": []
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Depending on the value you set in `output_formats`, the different items will be populated with their respective results or empty.
|
|
||||||
|
|
||||||
`processing_time` is the Docling processing time in seconds, and `timings` (when enabled in the backend) provides the detailed
|
|
||||||
timing of all the internal Docling components.
|
|
||||||
|
|
||||||
- If you set the parameter `return_as_file` to True, the response will be a zip file.
|
|
||||||
- If multiple files are generated (multiple inputs, or one input but multiple outputs with `return_as_file` True), the response will be a zip file.
|
|
||||||
|
|
||||||
## Run docling-serve
|
|
||||||
|
|
||||||
Clone the repository and run the following from within the cloned directory root.
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m venv venv
|
# Install the Python package with the extra dependencies
|
||||||
source venv/bin/activate
|
|
||||||
pip install "docling-serve[ui]"
|
pip install "docling-serve[ui]"
|
||||||
docling-serve run --enable-ui
|
docling-serve run --enable-ui
|
||||||
|
|
||||||
|
# Run the container image with the extra env parameters
|
||||||
|
podman run -p 5001:5001 -e DOCLING_SERVE_ENABLE_UI=true quay.io/docling-project/docling-serve
|
||||||
```
|
```
|
||||||
|
|
||||||
## Helpers
|
An easy to use UI is available at the `/ui` endpoint.
|
||||||
|
|
||||||
- A full Swagger UI is available at the `/docs` endpoint.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
- An easy to use UI is available at the `/ui` endpoint.
|
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
## Development
|
## Documentation and advance usages
|
||||||
|
|
||||||
### CPU only
|
Visit the [Docling Serve documentation](./docs/README.md) for learning how to [configure the webserver](./docs/configuration.md), use all the [runtime options](./docs/usage.md) of the API and [deployment examples](./docs/deployment.md).
|
||||||
|
|
||||||
```sh
|
|
||||||
# Install uv if not already available
|
|
||||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
||||||
|
|
||||||
# Install dependencies
|
|
||||||
uv sync --extra cpu
|
|
||||||
```
|
|
||||||
|
|
||||||
### Cuda GPU
|
|
||||||
|
|
||||||
For GPU support use the following command:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# Install dependencies
|
|
||||||
uv sync
|
|
||||||
```
|
|
||||||
|
|
||||||
### Gradio UI and different OCR backends
|
|
||||||
|
|
||||||
`/ui` endpoint using `gradio` and different OCR backends can be enabled via package extras:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# Enable ui and rapidocr
|
|
||||||
uv sync --extra ui --extra rapidocr
|
|
||||||
```
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# Enable tesserocr
|
|
||||||
uv sync --extra tesserocr
|
|
||||||
```
|
|
||||||
|
|
||||||
See `[project.optional-dependencies]` section in `pyproject.toml` for full list of options and runtime options with `uv run docling-serve --help`.
|
|
||||||
|
|
||||||
### Run the server
|
|
||||||
|
|
||||||
The `docling-serve` executable is a convenient script for launching the webserver both in
|
|
||||||
development and production mode.
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# Run the server in development mode
|
|
||||||
# - reload is enabled by default
|
|
||||||
# - listening on the 127.0.0.1 address
|
|
||||||
# - ui is enabled by default
|
|
||||||
docling-serve dev
|
|
||||||
|
|
||||||
# Run the server in production mode
|
|
||||||
# - reload is disabled by default
|
|
||||||
# - listening on the 0.0.0.0 address
|
|
||||||
# - ui is disabled by default
|
|
||||||
docling-serve run
|
|
||||||
```
|
|
||||||
|
|
||||||
### Options
|
|
||||||
|
|
||||||
The `docling-serve` executable allows is controlled with both command line
|
|
||||||
options and environment variables.
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>`docling-serve` help message</summary>
|
|
||||||
|
|
||||||
```sh
|
|
||||||
$ docling-serve dev --help
|
|
||||||
|
|
||||||
Usage: docling-serve dev [OPTIONS]
|
|
||||||
|
|
||||||
Run a Docling Serve app in development mode. 🧪
|
|
||||||
This is equivalent to docling-serve run but with reload
|
|
||||||
enabled and listening on the 127.0.0.1 address.
|
|
||||||
|
|
||||||
Options can be set also with the corresponding ENV variable, with the exception
|
|
||||||
of --enable-ui, --host and --reload.
|
|
||||||
|
|
||||||
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────╮
|
|
||||||
│ --host TEXT The host to serve on. For local development in localhost │
|
|
||||||
│ use 127.0.0.1. To enable public access, e.g. in a │
|
|
||||||
│ container, use all the IP addresses available with │
|
|
||||||
│ 0.0.0.0. │
|
|
||||||
│ [default: 127.0.0.1] │
|
|
||||||
│ --port INTEGER The port to serve on. [default: 5001] │
|
|
||||||
│ --reload --no-reload Enable auto-reload of the server when (code) files │
|
|
||||||
│ change. This is resource intensive, use it only during │
|
|
||||||
│ development. │
|
|
||||||
│ [default: reload] │
|
|
||||||
│ --root-path TEXT The root path is used to tell your app that it is being │
|
|
||||||
│ served to the outside world with some path prefix set up │
|
|
||||||
│ in some termination proxy or similar. │
|
|
||||||
│ --proxy-headers --no-proxy-headers Enable/Disable X-Forwarded-Proto, X-Forwarded-For, │
|
|
||||||
│ X-Forwarded-Port to populate remote address info. │
|
|
||||||
│ [default: proxy-headers] │
|
|
||||||
│ --artifacts-path PATH If set to a valid directory, the model weights will be │
|
|
||||||
│ loaded from this path. │
|
|
||||||
│ [default: None] │
|
|
||||||
│ --enable-ui --no-enable-ui Enable the development UI. [default: enable-ui] │
|
|
||||||
│ --help Show this message and exit. │
|
|
||||||
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
#### Environment variables
|
|
||||||
|
|
||||||
The environment variables controlling the `uvicorn` execution can be specified with the `UVICORN_` prefix:
|
|
||||||
|
|
||||||
- `UVICORN_WORKERS`: Number of workers to use.
|
|
||||||
- `UVICORN_RELOAD`: If `True`, this will enable auto-reload when you modify files, useful for development.
|
|
||||||
|
|
||||||
The environment variables controlling specifics of the Docling Serve app can be specified with the
|
|
||||||
`DOCLING_SERVE_` prefix:
|
|
||||||
|
|
||||||
- `DOCLING_SERVE_ARTIFACTS_PATH`: if set Docling will use only the local weights of models, for example `/opt/app-root/src/.cache/docling/models`.
|
|
||||||
- `DOCLING_SERVE_ENABLE_UI`: If `True`, The Gradio UI will be available at `/ui`.
|
|
||||||
|
|
||||||
Others:
|
|
||||||
|
|
||||||
- `TESSDATA_PREFIX`: Tesseract data location, example `/usr/share/tesseract/tessdata/`.
|
|
||||||
|
|
||||||
## Get help and support
|
## Get help and support
|
||||||
|
|
||||||
@@ -433,14 +80,14 @@ If you use Docling in your projects, please consider citing the following:
|
|||||||
|
|
||||||
```bib
|
```bib
|
||||||
@techreport{Docling,
|
@techreport{Docling,
|
||||||
author = {Deep Search Team},
|
author = {Docling Contributors},
|
||||||
month = {8},
|
month = {1},
|
||||||
title = {Docling Technical Report},
|
title = {Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion},
|
||||||
url = {https://arxiv.org/abs/2408.09869},
|
url = {https://arxiv.org/abs/2501.17887},
|
||||||
eprint = {2408.09869},
|
eprint = {2501.17887},
|
||||||
doi = {10.48550/arXiv.2408.09869},
|
doi = {10.48550/arXiv.2501.17887},
|
||||||
version = {1.0.0},
|
version = {2.0.0},
|
||||||
year = {2024}
|
year = {2025}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
8
docs/README.md
Normal file
8
docs/README.md
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
# Dolcing Serve documentation
|
||||||
|
|
||||||
|
This documentation pages explore the webserver configurations, runtime options, deployment examples as well as development best practices.
|
||||||
|
|
||||||
|
- [Configuration](./configuration.md)
|
||||||
|
- [Advance usage](./usage.md)
|
||||||
|
- [Deployment](./deployment.md)
|
||||||
|
- [Development](./development.md)
|
||||||
40
docs/configuration.md
Normal file
40
docs/configuration.md
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
# Configuration
|
||||||
|
|
||||||
|
The `docling-serve` executable allows to configure the server via command line
|
||||||
|
options as well as environment variables.
|
||||||
|
Configurations are divided between the settings used for the `uvicorn` asgi
|
||||||
|
server and the actual app-specific configurations.
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> When the server is running with `reload` or with multiple `workers`, uvicorn
|
||||||
|
> will spawn multiple subprocessed. This invalides all the values configured
|
||||||
|
> via the CLI command line options. Please use environment variables in this
|
||||||
|
> type of deployments.
|
||||||
|
|
||||||
|
## Webserver configuration
|
||||||
|
|
||||||
|
The following table shows the options which are propagated directly to the
|
||||||
|
`uvicorn` webserver runtime.
|
||||||
|
|
||||||
|
| CLI option | ENV | Default | Description |
|
||||||
|
| -----------|-----|---------|-------------|
|
||||||
|
| `--host` | `UVICORN_HOST` | `0.0.0.0` for `run`, `localhost` for `dev` | THe host to serve on. |
|
||||||
|
| `--port` | `UVICORN_PORT` | `5001` | The port to serve on. |
|
||||||
|
| `--reload` | `UVICORN_RELOAD` | `false` for `run`, `true` for `dev` | Enable auto-reload of the server when (code) files change. |
|
||||||
|
| `--workers` | `UVICORN_WORKERS` | `1` | Use multiple worker processes. |
|
||||||
|
| `--root-path` | `UVICORN_ROOT_PATH` | `""` | The root path is used to tell your app that it is being served to the outside world with some |
|
||||||
|
| `--proxy-headers` | `UVICORN_PROXY_HEADERS` | `true` | Enable/Disable X-Forwarded-Proto, X-Forwarded-For, X-Forwarded-Port to populate remote address info. |
|
||||||
|
| `--timeout-keep-alive` | `UVICORN_TIMEOUT_KEEP_ALIVE` | `60` | Timeout for the server response. |
|
||||||
|
|
||||||
|
## Docling Serve configuration
|
||||||
|
|
||||||
|
THe following table describes the options to configure the Docling Serve app.
|
||||||
|
|
||||||
|
| CLI option | ENV | Default | Description |
|
||||||
|
| -----------|-----|---------|-------------|
|
||||||
|
| `--artifacts-path` | `DOCLING_SERVE_ARTIFACTS_PATH` | unset | If set to a valid directory, the model weights will be loaded from this path |
|
||||||
|
| `--enable-ui` | `DOCLING_SERVE_ENABLE_UI` | `false` | Enable the demonstrator UI. |
|
||||||
|
| | `DOCLING_SERVE_OPTIONS_CACHE_SIZE` | `2` | How many DocumentConveter objects (including their loaded models) to keep in the cache. |
|
||||||
|
| | `DOCLING_SERVE_CORS_ORIGINS` | `["*"]` | A list of origins that should be permitted to make cross-origin requests. |
|
||||||
|
| | `DOCLING_SERVE_CORS_METHODS` | `["*"]` | A list of HTTP methods that should be allowed for cross-origin requests. |
|
||||||
|
| | `DOCLING_SERVE_CORS_HEADERS` | `["*"]` | A list of HTTP request headers that should be supported for cross-origin requests. |
|
||||||
12
docs/deployment.md
Normal file
12
docs/deployment.md
Normal file
@@ -0,0 +1,12 @@
|
|||||||
|
# Deployment
|
||||||
|
|
||||||
|
## Kubernetes and OpenShift
|
||||||
|
|
||||||
|
### Knative
|
||||||
|
|
||||||
|
The following manifest will launch Docling Serve using Knative to expose the application
|
||||||
|
with an external ingress endpoint.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# TODO
|
||||||
|
```
|
||||||
57
docs/development.md
Normal file
57
docs/development.md
Normal file
@@ -0,0 +1,57 @@
|
|||||||
|
# Development
|
||||||
|
|
||||||
|
## Install dependencies
|
||||||
|
|
||||||
|
### CPU only
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Install uv if not already available
|
||||||
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
uv sync --extra cpu
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cuda GPU
|
||||||
|
|
||||||
|
For GPU support use the following command:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Install dependencies
|
||||||
|
uv sync
|
||||||
|
```
|
||||||
|
|
||||||
|
### Gradio UI and different OCR backends
|
||||||
|
|
||||||
|
`/ui` endpoint using `gradio` and different OCR backends can be enabled via package extras:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Enable ui and rapidocr
|
||||||
|
uv sync --extra ui --extra rapidocr
|
||||||
|
```
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Enable tesserocr
|
||||||
|
uv sync --extra tesserocr
|
||||||
|
```
|
||||||
|
|
||||||
|
See `[project.optional-dependencies]` section in `pyproject.toml` for full list of options and runtime options with `uv run docling-serve --help`.
|
||||||
|
|
||||||
|
### Run the server
|
||||||
|
|
||||||
|
The `docling-serve` executable is a convenient script for launching the webserver both in
|
||||||
|
development and production mode.
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Run the server in development mode
|
||||||
|
# - reload is enabled by default
|
||||||
|
# - listening on the 127.0.0.1 address
|
||||||
|
# - ui is enabled by default
|
||||||
|
docling-serve dev
|
||||||
|
|
||||||
|
# Run the server in production mode
|
||||||
|
# - reload is disabled by default
|
||||||
|
# - listening on the 0.0.0.0 address
|
||||||
|
# - ui is disabled by default
|
||||||
|
docling-serve run
|
||||||
|
```
|
||||||
279
docs/usage.md
Normal file
279
docs/usage.md
Normal file
@@ -0,0 +1,279 @@
|
|||||||
|
# Usage
|
||||||
|
|
||||||
|
The API provides two endpoints: one for urls, one for files. This is necessary to send files directly in binary format instead of base64-encoded strings.
|
||||||
|
|
||||||
|
## Common parameters
|
||||||
|
|
||||||
|
On top of the source of file (see below), both endpoints support the same parameters, which are almost the same as the Docling CLI.
|
||||||
|
|
||||||
|
- `from_format` (List[str]): Input format(s) to convert from. Allowed values: `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md`. Defaults to all formats.
|
||||||
|
- `to_formats` (List[str]): Output format(s) to convert to. Allowed values: `md`, `json`, `html`, `text`, `doctags`. Defaults to `md`.
|
||||||
|
- `do_ocr` (bool): If enabled, the bitmap content will be processed using OCR. Defaults to `True`.
|
||||||
|
- `image_export_mode`: Image export mode for the document (only in case of JSON, Markdown or HTML). Allowed values: embedded, placeholder, referenced. Optional, defaults to `embedded`.
|
||||||
|
- `force_ocr` (bool): If enabled, replace any existing text with OCR-generated text over the full content. Defaults to `False`.
|
||||||
|
- `ocr_engine` (str): OCR engine to use. Allowed values: `easyocr`, `tesseract_cli`, `tesseract`, `rapidocr`, `ocrmac`. Defaults to `easyocr`.
|
||||||
|
- `ocr_lang` (List[str]): List of languages used by the OCR engine. Note that each OCR engine has different values for the language names. Defaults to empty.
|
||||||
|
- `pdf_backend` (str): PDF backend to use. Allowed values: `pypdfium2`, `dlparse_v1`, `dlparse_v2`. Defaults to `dlparse_v2`.
|
||||||
|
- `table_mode` (str): Table mode to use. Allowed values: `fast`, `accurate`. Defaults to `fast`.
|
||||||
|
- `abort_on_error` (bool): If enabled, abort on error. Defaults to false.
|
||||||
|
- `return_as_file` (boo): If enabled, return the output as a file. Defaults to false.
|
||||||
|
- `do_table_structure` (bool): If enabled, the table structure will be extracted. Defaults to true.
|
||||||
|
- `include_images` (bool): If enabled, images will be extracted from the document. Defaults to true.
|
||||||
|
- `images_scale` (float): Scale factor for images. Defaults to 2.0.
|
||||||
|
|
||||||
|
## Convert endpoints
|
||||||
|
|
||||||
|
### Source endpoint
|
||||||
|
|
||||||
|
The endpoint is `/v1alpha/convert/source`, listening for POST requests of JSON payloads.
|
||||||
|
|
||||||
|
On top of the above parameters, you must send the URL(s) of the document you want process with either the `http_sources` or `file_sources` fields.
|
||||||
|
The first is fetching URL(s) (optionally using with extra headers), the second allows to provide documents as base64-encoded strings.
|
||||||
|
No `options` is required, they can be partially or completely omitted.
|
||||||
|
|
||||||
|
Simple payload example:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>Complete payload example:</summary>
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"options": {
|
||||||
|
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
|
||||||
|
"to_formats": ["md", "json", "html", "text", "doctags"],
|
||||||
|
"image_export_mode": "placeholder",
|
||||||
|
"do_ocr": true,
|
||||||
|
"force_ocr": false,
|
||||||
|
"ocr_engine": "easyocr",
|
||||||
|
"ocr_lang": ["en"],
|
||||||
|
"pdf_backend": "dlparse_v2",
|
||||||
|
"table_mode": "fast",
|
||||||
|
"abort_on_error": false,
|
||||||
|
"return_as_file": false,
|
||||||
|
},
|
||||||
|
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>CURL example:</summary>
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -X 'POST' \
|
||||||
|
'http://localhost:5001/v1alpha/convert/source' \
|
||||||
|
-H 'accept: application/json' \
|
||||||
|
-H 'Content-Type: application/json' \
|
||||||
|
-d '{
|
||||||
|
"options": {
|
||||||
|
"from_formats": [
|
||||||
|
"docx",
|
||||||
|
"pptx",
|
||||||
|
"html",
|
||||||
|
"image",
|
||||||
|
"pdf",
|
||||||
|
"asciidoc",
|
||||||
|
"md",
|
||||||
|
"xlsx"
|
||||||
|
],
|
||||||
|
"to_formats": ["md", "json", "html", "text", "doctags"],
|
||||||
|
"image_export_mode": "placeholder",
|
||||||
|
"do_ocr": true,
|
||||||
|
"force_ocr": false,
|
||||||
|
"ocr_engine": "easyocr",
|
||||||
|
"ocr_lang": [
|
||||||
|
"fr",
|
||||||
|
"de",
|
||||||
|
"es",
|
||||||
|
"en"
|
||||||
|
],
|
||||||
|
"pdf_backend": "dlparse_v2",
|
||||||
|
"table_mode": "fast",
|
||||||
|
"abort_on_error": false,
|
||||||
|
"return_as_file": false,
|
||||||
|
"do_table_structure": true,
|
||||||
|
"include_images": true,
|
||||||
|
"images_scale": 2
|
||||||
|
},
|
||||||
|
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Python example:</summary>
|
||||||
|
|
||||||
|
```python
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
async_client = httpx.AsyncClient(timeout=60.0)
|
||||||
|
url = "http://localhost:5001/v1alpha/convert/source"
|
||||||
|
payload = {
|
||||||
|
"options": {
|
||||||
|
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
|
||||||
|
"to_formats": ["md", "json", "html", "text", "doctags"],
|
||||||
|
"image_export_mode": "placeholder",
|
||||||
|
"do_ocr": True,
|
||||||
|
"force_ocr": False,
|
||||||
|
"ocr_engine": "easyocr",
|
||||||
|
"ocr_lang": "en",
|
||||||
|
"pdf_backend": "dlparse_v2",
|
||||||
|
"table_mode": "fast",
|
||||||
|
"abort_on_error": False,
|
||||||
|
"return_as_file": False,
|
||||||
|
},
|
||||||
|
"http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
|
||||||
|
}
|
||||||
|
|
||||||
|
response = await async_client_client.post(url, json=payload)
|
||||||
|
|
||||||
|
data = response.json()
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
#### File as base64
|
||||||
|
|
||||||
|
The `file_sources` argument in the endpoint allows to send files as base64-encoded strings.
|
||||||
|
When your PDF or other file type is too large, encoding it and passing it inline to curl
|
||||||
|
can lead to an “Argument list too long” error on some systems. To avoid this, we write
|
||||||
|
the JSON request body to a file and have curl read from that file.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>CURL steps:</summary>
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# 1. Base64-encode the file
|
||||||
|
B64_DATA=$(base64 -w 0 /path/to/file/pdf-to-convert.pdf)
|
||||||
|
|
||||||
|
# 2. Build the JSON with your options
|
||||||
|
cat <<EOF > /tmp/request_body.json
|
||||||
|
{
|
||||||
|
"options": {
|
||||||
|
},
|
||||||
|
"file_sources": [{
|
||||||
|
"base64_string": "${B64_DATA}",
|
||||||
|
"filename": "pdf-to-convert.pdf"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# 3. POST the request to the docling service
|
||||||
|
curl -X POST "localhost:5001/v1alpha/convert/source" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d @/tmp/request_body.json
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
### File endpoint
|
||||||
|
|
||||||
|
The endpoint is: `/v1alpha/convert/file`, listening for POST requests of Form payloads (necessary as the files are sent as multipart/form data). You can send one or multiple files.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>CURL example:</summary>
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -X 'POST' \
|
||||||
|
'http://127.0.0.1:5001/v1alpha/convert/file' \
|
||||||
|
-H 'accept: application/json' \
|
||||||
|
-H 'Content-Type: multipart/form-data' \
|
||||||
|
-F 'ocr_engine=easyocr' \
|
||||||
|
-F 'pdf_backend=dlparse_v2' \
|
||||||
|
-F 'from_formats=pdf' \
|
||||||
|
-F 'from_formats=docx' \
|
||||||
|
-F 'force_ocr=false' \
|
||||||
|
-F 'image_export_mode=embedded' \
|
||||||
|
-F 'ocr_lang=en' \
|
||||||
|
-F 'ocr_lang=pl' \
|
||||||
|
-F 'table_mode=fast' \
|
||||||
|
-F 'files=@2206.01062v1.pdf;type=application/pdf' \
|
||||||
|
-F 'abort_on_error=false' \
|
||||||
|
-F 'to_formats=md' \
|
||||||
|
-F 'to_formats=text' \
|
||||||
|
-F 'return_as_file=false' \
|
||||||
|
-F 'do_ocr=true'
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Python example:</summary>
|
||||||
|
|
||||||
|
```python
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
async_client = httpx.AsyncClient(timeout=60.0)
|
||||||
|
url = "http://localhost:5001/v1alpha/convert/file"
|
||||||
|
parameters = {
|
||||||
|
"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
|
||||||
|
"to_formats": ["md", "json", "html", "text", "doctags"],
|
||||||
|
"image_export_mode": "placeholder",
|
||||||
|
"do_ocr": True,
|
||||||
|
"force_ocr": False,
|
||||||
|
"ocr_engine": "easyocr",
|
||||||
|
"ocr_lang": ["en"],
|
||||||
|
"pdf_backend": "dlparse_v2",
|
||||||
|
"table_mode": "fast",
|
||||||
|
"abort_on_error": False,
|
||||||
|
"return_as_file": False
|
||||||
|
}
|
||||||
|
|
||||||
|
current_dir = os.path.dirname(__file__)
|
||||||
|
file_path = os.path.join(current_dir, '2206.01062v1.pdf')
|
||||||
|
|
||||||
|
files = {
|
||||||
|
'files': ('2206.01062v1.pdf', open(file_path, 'rb'), 'application/pdf'),
|
||||||
|
}
|
||||||
|
|
||||||
|
response = await async_client.post(url, files=files, data={"parameters": json.dumps(parameters)})
|
||||||
|
assert response.status_code == 200, "Response should be 200 OK"
|
||||||
|
|
||||||
|
data = response.json()
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
## Response format
|
||||||
|
|
||||||
|
The response can be a JSON Document or a File.
|
||||||
|
|
||||||
|
- If you process only one file, the response will be a JSON document with the following format:
|
||||||
|
|
||||||
|
```jsonc
|
||||||
|
{
|
||||||
|
"document": {
|
||||||
|
"md_content": "",
|
||||||
|
"json_content": {},
|
||||||
|
"html_content": "",
|
||||||
|
"text_content": "",
|
||||||
|
"doctags_content": ""
|
||||||
|
},
|
||||||
|
"status": "<success|partial_success|skipped|failure>",
|
||||||
|
"processing_time": 0.0,
|
||||||
|
"timings": {},
|
||||||
|
"errors": []
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Depending on the value you set in `output_formats`, the different items will be populated with their respective results or empty.
|
||||||
|
|
||||||
|
`processing_time` is the Docling processing time in seconds, and `timings` (when enabled in the backend) provides the detailed
|
||||||
|
timing of all the internal Docling components.
|
||||||
|
|
||||||
|
- If you set the parameter `return_as_file` to True, the response will be a zip file.
|
||||||
|
- If multiple files are generated (multiple inputs, or one input but multiple outputs with `return_as_file` True), the response will be a zip file.
|
||||||
|
|
||||||
|
## Asynchronous API
|
||||||
|
|
||||||
|
TBA
|
||||||
Reference in New Issue
Block a user