diff --git a/.markdownlint-cli2.yaml b/.markdownlint-cli2.yaml index 95d2b82..5dff393 100644 --- a/.markdownlint-cli2.yaml +++ b/.markdownlint-cli2.yaml @@ -3,7 +3,7 @@ config: no-emphasis-as-header: false first-line-heading: false MD033: - allowed_elements: ["details", "summary"] + allowed_elements: ["details", "summary", "br"] MD024: siblings_only: true globs: diff --git a/Makefile b/Makefile index a5f10fe..a080f92 100644 --- a/Makefile +++ b/Makefile @@ -66,7 +66,7 @@ action-lint: .action-lint ## Lint GitHub Action workflows md-lint: .md-lint ## Lint markdown files .md-lint: $(wildcard */**/*.md) | md-lint-file $(ECHO_PREFIX) printf " %-12s ./...\n" "[MD LINT]" - $(CMD_PREFIX) docker run --rm -v $$(pwd):/workdir davidanson/markdownlint-cli2:v0.14.0 "**/*.md" + $(CMD_PREFIX) docker run --rm -v $$(pwd):/workdir davidanson/markdownlint-cli2:v0.16.0 "**/*.md" "#.venv" $(CMD_PREFIX) touch $@ .PHONY: py-Lint diff --git a/README.md b/README.md index b4e9dae..ff54f25 100644 --- a/README.md +++ b/README.md @@ -1,423 +1,70 @@ # Docling Serve - Running [Docling](https://github.com/docling-project/docling) as an API service. +Running [Docling](https://github.com/docling-project/docling) as an API service. -## Usage +## Getting started -The API provides two endpoints: one for urls, one for files. This is necessary to send files directly in binary format instead of base64-encoded strings. +Install the `docling-serve` package and run the server. -### Common parameters +```bash +# Using the python package +pip install "docling-serve" +docling-serve run -On top of the source of file (see below), both endpoints support the same parameters, which are almost the same as the Docling CLI. - -- `from_format` (List[str]): Input format(s) to convert from. Allowed values: `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md`. Defaults to all formats. -- `to_formats` (List[str]): Output format(s) to convert to. Allowed values: `md`, `json`, `html`, `text`, `doctags`. Defaults to `md`. -- `do_ocr` (bool): If enabled, the bitmap content will be processed using OCR. Defaults to `True`. -- `image_export_mode`: Image export mode for the document (only in case of JSON, Markdown or HTML). Allowed values: embedded, placeholder, referenced. Optional, defaults to `embedded`. -- `force_ocr` (bool): If enabled, replace any existing text with OCR-generated text over the full content. Defaults to `False`. -- `ocr_engine` (str): OCR engine to use. Allowed values: `easyocr`, `tesseract_cli`, `tesseract`, `rapidocr`, `ocrmac`. Defaults to `easyocr`. -- `ocr_lang` (List[str]): List of languages used by the OCR engine. Note that each OCR engine has different values for the language names. Defaults to empty. -- `pdf_backend` (str): PDF backend to use. Allowed values: `pypdfium2`, `dlparse_v1`, `dlparse_v2`. Defaults to `dlparse_v2`. -- `table_mode` (str): Table mode to use. Allowed values: `fast`, `accurate`. Defaults to `fast`. -- `abort_on_error` (bool): If enabled, abort on error. Defaults to false. -- `return_as_file` (boo): If enabled, return the output as a file. Defaults to false. -- `do_table_structure` (bool): If enabled, the table structure will be extracted. Defaults to true. -- `include_images` (bool): If enabled, images will be extracted from the document. Defaults to true. -- `images_scale` (float): Scale factor for images. Defaults to 2.0. - -### URL endpoint - -The endpoint is `/v1alpha/convert/source`, listening for POST requests of JSON payloads. - -On top of the above parameters, you must send the URL(s) of the document you want process with either the `http_sources` or `file_sources` fields. -The first is fetching URL(s) (optionally using with extra headers), the second allows to provide documents as base64-encoded strings. -No `options` is required, they can be partially or completely omitted. - -Simple payload example: - -```json -{ - "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}] -} +# Using container images, e.g. with Podman +podman run -p 5001:5001 quay.io/docling-project/docling-serve ``` -
+The server is available at -Complete payload example: +- API +- API documentation + ![swagger.png](img/swagger.png) -```json -{ - "options": { - "from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"], - "to_formats": ["md", "json", "html", "text", "doctags"], - "image_export_mode": "placeholder", - "do_ocr": true, - "force_ocr": false, - "ocr_engine": "easyocr", - "ocr_lang": ["en"], - "pdf_backend": "dlparse_v2", - "table_mode": "fast", - "abort_on_error": false, - "return_as_file": false, - }, - "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}] -} -``` +Try it out with a simple conversion: -
- -
- -CURL example: - -```sh +```bash curl -X 'POST' \ 'http://localhost:5001/v1alpha/convert/source' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ - "options": { - "from_formats": [ - "docx", - "pptx", - "html", - "image", - "pdf", - "asciidoc", - "md", - "xlsx" - ], - "to_formats": ["md", "json", "html", "text", "doctags"], - "image_export_mode": "placeholder", - "do_ocr": true, - "force_ocr": false, - "ocr_engine": "easyocr", - "ocr_lang": [ - "fr", - "de", - "es", - "en" - ], - "pdf_backend": "dlparse_v2", - "table_mode": "fast", - "abort_on_error": false, - "return_as_file": false, - "do_table_structure": true, - "include_images": true, - "images_scale": 2 - }, - "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}] -}' + "http_sources": [{"url": "https://arxiv.org/pdf/2501.17887"}] + }' ``` -
+### Container images -
-Python example: +Available container images: -```python -import httpx +| Name | Description | Arch | Size | +| -----|-------------|------|------| +| [`ghcr.io/docling-project/docling-serve`](https://github.com/docling-project/docling-serve/pkgs/container/docling-serve)
[`quay.io/docling-project/docling-serve`](https://quay.io/repository/docling-project/docling-serve) | Simple image for Docling Serve, installing all packages from the official pypi.org index. | `linux/amd64`, `linux/arm64` | 3.6 GB | +| [`ghcr.io/docling-project/docling-serve-cpu`](https://github.com/docling-project/docling-serve/pkgs/container/docling-serve-cpu)
[`quay.io/docling-project/docling-serve-cpu`](https://quay.io/repository/docling-project/docling-serve-cpu) | Cpu-only image which installs `torch` from the pytorch cpu index. | `linux/amd64`, `linux/arm64` | 3.6 GB | +| [`ghcr.io/docling-project/docling-serve-cu124`](https://github.com/docling-project/docling-serve/pkgs/container/docling-serve-cu124)
[`quay.io/docling-project/docling-serve-cu124`](https://quay.io/repository/docling-project/docling-serve-cu124) | Cuda 12.4 image which installs `torch` from the pytorch cu124 index. | `linux/amd64` | 8.7 GB | -async_client = httpx.AsyncClient(timeout=60.0) -url = "http://localhost:5001/v1alpha/convert/source" -payload = { - "options": { - "from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"], - "to_formats": ["md", "json", "html", "text", "doctags"], - "image_export_mode": "placeholder", - "do_ocr": True, - "force_ocr": False, - "ocr_engine": "easyocr", - "ocr_lang": "en", - "pdf_backend": "dlparse_v2", - "table_mode": "fast", - "abort_on_error": False, - "return_as_file": False, - }, - "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}] -} +Coming soon: `docling-serve-slim` images will reduce the size by skipping the model weights download. -response = await async_client_client.post(url, json=payload) - -data = response.json() -``` - -
- -#### File as base64 - -The `file_sources` argument in the endpoint allows to send files as base64-encoded strings. -When your PDF or other file type is too large, encoding it and passing it inline to curl -can lead to an โ€œArgument list too longโ€ error on some systems. To avoid this, we write -the JSON request body to a file and have curl read from that file. - -
-CURL steps: - -```sh -# 1. Base64-encode the file -B64_DATA=$(base64 -w 0 /path/to/file/pdf-to-convert.pdf) - -# 2. Build the JSON with your options -cat < /tmp/request_body.json -{ - "options": { - }, - "file_sources": [{ - "base64_string": "${B64_DATA}", - "filename": "pdf-to-convert.pdf" - }] -} -EOF - -# 3. POST the request to the docling service -curl -X POST "localhost:5001/v1alpha/convert/source" \ - -H "Content-Type: application/json" \ - -d @/tmp/request_body.json -``` - -
- -### File endpoint - -The endpoint is: `/v1alpha/convert/file`, listening for POST requests of Form payloads (necessary as the files are sent as multipart/form data). You can send one or multiple files. - -
-CURL example: - -```sh -curl -X 'POST' \ - 'http://127.0.0.1:5001/v1alpha/convert/file' \ - -H 'accept: application/json' \ - -H 'Content-Type: multipart/form-data' \ - -F 'ocr_engine=easyocr' \ - -F 'pdf_backend=dlparse_v2' \ - -F 'from_formats=pdf' \ - -F 'from_formats=docx' \ - -F 'force_ocr=false' \ - -F 'image_export_mode=embedded' \ - -F 'ocr_lang=en' \ - -F 'ocr_lang=pl' \ - -F 'table_mode=fast' \ - -F 'files=@2206.01062v1.pdf;type=application/pdf' \ - -F 'abort_on_error=false' \ - -F 'to_formats=md' \ - -F 'to_formats=text' \ - -F 'return_as_file=false' \ - -F 'do_ocr=true' -``` - -
- -
-Python example: - -```python -import httpx - -async_client = httpx.AsyncClient(timeout=60.0) -url = "http://localhost:5001/v1alpha/convert/file" -parameters = { -"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"], -"to_formats": ["md", "json", "html", "text", "doctags"], -"image_export_mode": "placeholder", -"do_ocr": True, -"force_ocr": False, -"ocr_engine": "easyocr", -"ocr_lang": ["en"], -"pdf_backend": "dlparse_v2", -"table_mode": "fast", -"abort_on_error": False, -"return_as_file": False -} - -current_dir = os.path.dirname(__file__) -file_path = os.path.join(current_dir, '2206.01062v1.pdf') - -files = { - 'files': ('2206.01062v1.pdf', open(file_path, 'rb'), 'application/pdf'), -} - -response = await async_client.post(url, files=files, data={"parameters": json.dumps(parameters)}) -assert response.status_code == 200, "Response should be 200 OK" - -data = response.json() -``` - -
- -### Response format - -The response can be a JSON Document or a File. - -- If you process only one file, the response will be a JSON document with the following format: - - ```jsonc - { - "document": { - "md_content": "", - "json_content": {}, - "html_content": "", - "text_content": "", - "doctags_content": "" - }, - "status": "", - "processing_time": 0.0, - "timings": {}, - "errors": [] - } - ``` - - Depending on the value you set in `output_formats`, the different items will be populated with their respective results or empty. - - `processing_time` is the Docling processing time in seconds, and `timings` (when enabled in the backend) provides the detailed - timing of all the internal Docling components. - -- If you set the parameter `return_as_file` to True, the response will be a zip file. -- If multiple files are generated (multiple inputs, or one input but multiple outputs with `return_as_file` True), the response will be a zip file. - -## Run docling-serve - -Clone the repository and run the following from within the cloned directory root. +### Demonstration UI ```bash -python -m venv venv -source venv/bin/activate +# Install the Python package with the extra dependencies pip install "docling-serve[ui]" docling-serve run --enable-ui + +# Run the container image with the extra env parameters +podman run -p 5001:5001 -e DOCLING_SERVE_ENABLE_UI=true quay.io/docling-project/docling-serve ``` -## Helpers - -- A full Swagger UI is available at the `/docs` endpoint. - -![swagger.png](img/swagger.png) - -- An easy to use UI is available at the `/ui` endpoint. +An easy to use UI is available at the `/ui` endpoint. ![ui-input.png](img/ui-input.png) ![ui-output.png](img/ui-output.png) -## Development +## Documentation and advance usages -### CPU only - -```sh -# Install uv if not already available -curl -LsSf https://astral.sh/uv/install.sh | sh - -# Install dependencies -uv sync --extra cpu -``` - -### Cuda GPU - -For GPU support use the following command: - -```sh -# Install dependencies -uv sync -``` - -### Gradio UI and different OCR backends - -`/ui` endpoint using `gradio` and different OCR backends can be enabled via package extras: - -```sh -# Enable ui and rapidocr -uv sync --extra ui --extra rapidocr -``` - -```sh -# Enable tesserocr -uv sync --extra tesserocr -``` - -See `[project.optional-dependencies]` section in `pyproject.toml` for full list of options and runtime options with `uv run docling-serve --help`. - -### Run the server - -The `docling-serve` executable is a convenient script for launching the webserver both in -development and production mode. - -```sh -# Run the server in development mode -# - reload is enabled by default -# - listening on the 127.0.0.1 address -# - ui is enabled by default -docling-serve dev - -# Run the server in production mode -# - reload is disabled by default -# - listening on the 0.0.0.0 address -# - ui is disabled by default -docling-serve run -``` - -### Options - -The `docling-serve` executable allows is controlled with both command line -options and environment variables. - -
-`docling-serve` help message - -```sh -$ docling-serve dev --help - - Usage: docling-serve dev [OPTIONS] - - Run a Docling Serve app in development mode. ๐Ÿงช - This is equivalent to docling-serve run but with reload - enabled and listening on the 127.0.0.1 address. - - Options can be set also with the corresponding ENV variable, with the exception - of --enable-ui, --host and --reload. - -โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ --host TEXT The host to serve on. For local development in localhost โ”‚ -โ”‚ use 127.0.0.1. To enable public access, e.g. in a โ”‚ -โ”‚ container, use all the IP addresses available with โ”‚ -โ”‚ 0.0.0.0. โ”‚ -โ”‚ [default: 127.0.0.1] โ”‚ -โ”‚ --port INTEGER The port to serve on. [default: 5001] โ”‚ -โ”‚ --reload --no-reload Enable auto-reload of the server when (code) files โ”‚ -โ”‚ change. This is resource intensive, use it only during โ”‚ -โ”‚ development. โ”‚ -โ”‚ [default: reload] โ”‚ -โ”‚ --root-path TEXT The root path is used to tell your app that it is being โ”‚ -โ”‚ served to the outside world with some path prefix set up โ”‚ -โ”‚ in some termination proxy or similar. โ”‚ -โ”‚ --proxy-headers --no-proxy-headers Enable/Disable X-Forwarded-Proto, X-Forwarded-For, โ”‚ -โ”‚ X-Forwarded-Port to populate remote address info. โ”‚ -โ”‚ [default: proxy-headers] โ”‚ -โ”‚ --artifacts-path PATH If set to a valid directory, the model weights will be โ”‚ -โ”‚ loaded from this path. โ”‚ -โ”‚ [default: None] โ”‚ -โ”‚ --enable-ui --no-enable-ui Enable the development UI. [default: enable-ui] โ”‚ -โ”‚ --help Show this message and exit. โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ -``` - -
- -#### Environment variables - -The environment variables controlling the `uvicorn` execution can be specified with the `UVICORN_` prefix: - -- `UVICORN_WORKERS`: Number of workers to use. -- `UVICORN_RELOAD`: If `True`, this will enable auto-reload when you modify files, useful for development. - -The environment variables controlling specifics of the Docling Serve app can be specified with the -`DOCLING_SERVE_` prefix: - -- `DOCLING_SERVE_ARTIFACTS_PATH`: if set Docling will use only the local weights of models, for example `/opt/app-root/src/.cache/docling/models`. -- `DOCLING_SERVE_ENABLE_UI`: If `True`, The Gradio UI will be available at `/ui`. - -Others: - -- `TESSDATA_PREFIX`: Tesseract data location, example `/usr/share/tesseract/tessdata/`. +Visit the [Docling Serve documentation](./docs/README.md) for learning how to [configure the webserver](./docs/configuration.md), use all the [runtime options](./docs/usage.md) of the API and [deployment examples](./docs/deployment.md). ## Get help and support @@ -433,14 +80,14 @@ If you use Docling in your projects, please consider citing the following: ```bib @techreport{Docling, - author = {Deep Search Team}, - month = {8}, - title = {Docling Technical Report}, - url = {https://arxiv.org/abs/2408.09869}, - eprint = {2408.09869}, - doi = {10.48550/arXiv.2408.09869}, - version = {1.0.0}, - year = {2024} + author = {Docling Contributors}, + month = {1}, + title = {Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion}, + url = {https://arxiv.org/abs/2501.17887}, + eprint = {2501.17887}, + doi = {10.48550/arXiv.2501.17887}, + version = {2.0.0}, + year = {2025} } ``` diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..50600f3 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,8 @@ +# Dolcing Serve documentation + +This documentation pages explore the webserver configurations, runtime options, deployment examples as well as development best practices. + +- [Configuration](./configuration.md) +- [Advance usage](./usage.md) +- [Deployment](./deployment.md) +- [Development](./development.md) diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 0000000..e375794 --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,40 @@ +# Configuration + +The `docling-serve` executable allows to configure the server via command line +options as well as environment variables. +Configurations are divided between the settings used for the `uvicorn` asgi +server and the actual app-specific configurations. + + > [!WARNING] +> When the server is running with `reload` or with multiple `workers`, uvicorn +> will spawn multiple subprocessed. This invalides all the values configured +> via the CLI command line options. Please use environment variables in this +> type of deployments. + +## Webserver configuration + +The following table shows the options which are propagated directly to the +`uvicorn` webserver runtime. + +| CLI option | ENV | Default | Description | +| -----------|-----|---------|-------------| +| `--host` | `UVICORN_HOST` | `0.0.0.0` for `run`, `localhost` for `dev` | THe host to serve on. | +| `--port` | `UVICORN_PORT` | `5001` | The port to serve on. | +| `--reload` | `UVICORN_RELOAD` | `false` for `run`, `true` for `dev` | Enable auto-reload of the server when (code) files change. | +| `--workers` | `UVICORN_WORKERS` | `1` | Use multiple worker processes. | +| `--root-path` | `UVICORN_ROOT_PATH` | `""` | The root path is used to tell your app that it is being served to the outside world with some | +| `--proxy-headers` | `UVICORN_PROXY_HEADERS` | `true` | Enable/Disable X-Forwarded-Proto, X-Forwarded-For, X-Forwarded-Port to populate remote address info. | +| `--timeout-keep-alive` | `UVICORN_TIMEOUT_KEEP_ALIVE` | `60` | Timeout for the server response. | + +## Docling Serve configuration + +THe following table describes the options to configure the Docling Serve app. + +| CLI option | ENV | Default | Description | +| -----------|-----|---------|-------------| +| `--artifacts-path` | `DOCLING_SERVE_ARTIFACTS_PATH` | unset | If set to a valid directory, the model weights will be loaded from this path | +| `--enable-ui` | `DOCLING_SERVE_ENABLE_UI` | `false` | Enable the demonstrator UI. | +| | `DOCLING_SERVE_OPTIONS_CACHE_SIZE` | `2` | How many DocumentConveter objects (including their loaded models) to keep in the cache. | +| | `DOCLING_SERVE_CORS_ORIGINS` | `["*"]` | A list of origins that should be permitted to make cross-origin requests. | +| | `DOCLING_SERVE_CORS_METHODS` | `["*"]` | A list of HTTP methods that should be allowed for cross-origin requests. | +| | `DOCLING_SERVE_CORS_HEADERS` | `["*"]` | A list of HTTP request headers that should be supported for cross-origin requests. | diff --git a/docs/deployment.md b/docs/deployment.md new file mode 100644 index 0000000..98a8c4e --- /dev/null +++ b/docs/deployment.md @@ -0,0 +1,12 @@ +# Deployment + +## Kubernetes and OpenShift + +### Knative + +The following manifest will launch Docling Serve using Knative to expose the application +with an external ingress endpoint. + +```yaml +# TODO +``` diff --git a/docs/development.md b/docs/development.md new file mode 100644 index 0000000..572fc09 --- /dev/null +++ b/docs/development.md @@ -0,0 +1,57 @@ +# Development + +## Install dependencies + +### CPU only + +```sh +# Install uv if not already available +curl -LsSf https://astral.sh/uv/install.sh | sh + +# Install dependencies +uv sync --extra cpu +``` + +### Cuda GPU + +For GPU support use the following command: + +```sh +# Install dependencies +uv sync +``` + +### Gradio UI and different OCR backends + +`/ui` endpoint using `gradio` and different OCR backends can be enabled via package extras: + +```sh +# Enable ui and rapidocr +uv sync --extra ui --extra rapidocr +``` + +```sh +# Enable tesserocr +uv sync --extra tesserocr +``` + +See `[project.optional-dependencies]` section in `pyproject.toml` for full list of options and runtime options with `uv run docling-serve --help`. + +### Run the server + +The `docling-serve` executable is a convenient script for launching the webserver both in +development and production mode. + +```sh +# Run the server in development mode +# - reload is enabled by default +# - listening on the 127.0.0.1 address +# - ui is enabled by default +docling-serve dev + +# Run the server in production mode +# - reload is disabled by default +# - listening on the 0.0.0.0 address +# - ui is disabled by default +docling-serve run +``` diff --git a/docs/usage.md b/docs/usage.md new file mode 100644 index 0000000..0740539 --- /dev/null +++ b/docs/usage.md @@ -0,0 +1,279 @@ +# Usage + +The API provides two endpoints: one for urls, one for files. This is necessary to send files directly in binary format instead of base64-encoded strings. + +## Common parameters + +On top of the source of file (see below), both endpoints support the same parameters, which are almost the same as the Docling CLI. + +- `from_format` (List[str]): Input format(s) to convert from. Allowed values: `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md`. Defaults to all formats. +- `to_formats` (List[str]): Output format(s) to convert to. Allowed values: `md`, `json`, `html`, `text`, `doctags`. Defaults to `md`. +- `do_ocr` (bool): If enabled, the bitmap content will be processed using OCR. Defaults to `True`. +- `image_export_mode`: Image export mode for the document (only in case of JSON, Markdown or HTML). Allowed values: embedded, placeholder, referenced. Optional, defaults to `embedded`. +- `force_ocr` (bool): If enabled, replace any existing text with OCR-generated text over the full content. Defaults to `False`. +- `ocr_engine` (str): OCR engine to use. Allowed values: `easyocr`, `tesseract_cli`, `tesseract`, `rapidocr`, `ocrmac`. Defaults to `easyocr`. +- `ocr_lang` (List[str]): List of languages used by the OCR engine. Note that each OCR engine has different values for the language names. Defaults to empty. +- `pdf_backend` (str): PDF backend to use. Allowed values: `pypdfium2`, `dlparse_v1`, `dlparse_v2`. Defaults to `dlparse_v2`. +- `table_mode` (str): Table mode to use. Allowed values: `fast`, `accurate`. Defaults to `fast`. +- `abort_on_error` (bool): If enabled, abort on error. Defaults to false. +- `return_as_file` (boo): If enabled, return the output as a file. Defaults to false. +- `do_table_structure` (bool): If enabled, the table structure will be extracted. Defaults to true. +- `include_images` (bool): If enabled, images will be extracted from the document. Defaults to true. +- `images_scale` (float): Scale factor for images. Defaults to 2.0. + +## Convert endpoints + +### Source endpoint + +The endpoint is `/v1alpha/convert/source`, listening for POST requests of JSON payloads. + +On top of the above parameters, you must send the URL(s) of the document you want process with either the `http_sources` or `file_sources` fields. +The first is fetching URL(s) (optionally using with extra headers), the second allows to provide documents as base64-encoded strings. +No `options` is required, they can be partially or completely omitted. + +Simple payload example: + +```json +{ + "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}] +} +``` + +
+ +Complete payload example: + +```json +{ + "options": { + "from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"], + "to_formats": ["md", "json", "html", "text", "doctags"], + "image_export_mode": "placeholder", + "do_ocr": true, + "force_ocr": false, + "ocr_engine": "easyocr", + "ocr_lang": ["en"], + "pdf_backend": "dlparse_v2", + "table_mode": "fast", + "abort_on_error": false, + "return_as_file": false, + }, + "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}] +} +``` + +
+ +
+ +CURL example: + +```sh +curl -X 'POST' \ + 'http://localhost:5001/v1alpha/convert/source' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "options": { + "from_formats": [ + "docx", + "pptx", + "html", + "image", + "pdf", + "asciidoc", + "md", + "xlsx" + ], + "to_formats": ["md", "json", "html", "text", "doctags"], + "image_export_mode": "placeholder", + "do_ocr": true, + "force_ocr": false, + "ocr_engine": "easyocr", + "ocr_lang": [ + "fr", + "de", + "es", + "en" + ], + "pdf_backend": "dlparse_v2", + "table_mode": "fast", + "abort_on_error": false, + "return_as_file": false, + "do_table_structure": true, + "include_images": true, + "images_scale": 2 + }, + "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}] +}' +``` + +
+ +
+Python example: + +```python +import httpx + +async_client = httpx.AsyncClient(timeout=60.0) +url = "http://localhost:5001/v1alpha/convert/source" +payload = { + "options": { + "from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"], + "to_formats": ["md", "json", "html", "text", "doctags"], + "image_export_mode": "placeholder", + "do_ocr": True, + "force_ocr": False, + "ocr_engine": "easyocr", + "ocr_lang": "en", + "pdf_backend": "dlparse_v2", + "table_mode": "fast", + "abort_on_error": False, + "return_as_file": False, + }, + "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}] +} + +response = await async_client_client.post(url, json=payload) + +data = response.json() +``` + +
+ +#### File as base64 + +The `file_sources` argument in the endpoint allows to send files as base64-encoded strings. +When your PDF or other file type is too large, encoding it and passing it inline to curl +can lead to an โ€œArgument list too longโ€ error on some systems. To avoid this, we write +the JSON request body to a file and have curl read from that file. + +
+CURL steps: + +```sh +# 1. Base64-encode the file +B64_DATA=$(base64 -w 0 /path/to/file/pdf-to-convert.pdf) + +# 2. Build the JSON with your options +cat < /tmp/request_body.json +{ + "options": { + }, + "file_sources": [{ + "base64_string": "${B64_DATA}", + "filename": "pdf-to-convert.pdf" + }] +} +EOF + +# 3. POST the request to the docling service +curl -X POST "localhost:5001/v1alpha/convert/source" \ + -H "Content-Type: application/json" \ + -d @/tmp/request_body.json +``` + +
+ +### File endpoint + +The endpoint is: `/v1alpha/convert/file`, listening for POST requests of Form payloads (necessary as the files are sent as multipart/form data). You can send one or multiple files. + +
+CURL example: + +```sh +curl -X 'POST' \ + 'http://127.0.0.1:5001/v1alpha/convert/file' \ + -H 'accept: application/json' \ + -H 'Content-Type: multipart/form-data' \ + -F 'ocr_engine=easyocr' \ + -F 'pdf_backend=dlparse_v2' \ + -F 'from_formats=pdf' \ + -F 'from_formats=docx' \ + -F 'force_ocr=false' \ + -F 'image_export_mode=embedded' \ + -F 'ocr_lang=en' \ + -F 'ocr_lang=pl' \ + -F 'table_mode=fast' \ + -F 'files=@2206.01062v1.pdf;type=application/pdf' \ + -F 'abort_on_error=false' \ + -F 'to_formats=md' \ + -F 'to_formats=text' \ + -F 'return_as_file=false' \ + -F 'do_ocr=true' +``` + +
+ +
+Python example: + +```python +import httpx + +async_client = httpx.AsyncClient(timeout=60.0) +url = "http://localhost:5001/v1alpha/convert/file" +parameters = { +"from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"], +"to_formats": ["md", "json", "html", "text", "doctags"], +"image_export_mode": "placeholder", +"do_ocr": True, +"force_ocr": False, +"ocr_engine": "easyocr", +"ocr_lang": ["en"], +"pdf_backend": "dlparse_v2", +"table_mode": "fast", +"abort_on_error": False, +"return_as_file": False +} + +current_dir = os.path.dirname(__file__) +file_path = os.path.join(current_dir, '2206.01062v1.pdf') + +files = { + 'files': ('2206.01062v1.pdf', open(file_path, 'rb'), 'application/pdf'), +} + +response = await async_client.post(url, files=files, data={"parameters": json.dumps(parameters)}) +assert response.status_code == 200, "Response should be 200 OK" + +data = response.json() +``` + +
+ +## Response format + +The response can be a JSON Document or a File. + +- If you process only one file, the response will be a JSON document with the following format: + + ```jsonc + { + "document": { + "md_content": "", + "json_content": {}, + "html_content": "", + "text_content": "", + "doctags_content": "" + }, + "status": "", + "processing_time": 0.0, + "timings": {}, + "errors": [] + } + ``` + + Depending on the value you set in `output_formats`, the different items will be populated with their respective results or empty. + + `processing_time` is the Docling processing time in seconds, and `timings` (when enabled in the backend) provides the detailed + timing of all the internal Docling components. + +- If you set the parameter `return_as_file` to True, the response will be a zip file. +- If multiple files are generated (multiple inputs, or one input but multiple outputs with `return_as_file` True), the response will be a zip file. + +## Asynchronous API + +TBA