In the age of Artificial Intelligence (AI) and Machine Learning (ML), data is the new gold. However, much of this valuable data lies trapped within diverse document formats, making it difficult to access and analyze. This is where a robust Python document parser becomes indispensable. For developers, researchers, and AI enthusiasts seeking to leverage the power of Large Language Models (LLMs) and generative AI, the ability to extract clean, structured data from documents is paramount. This article introduces you to Docling, a versatile python document parser that not only simplifies this process but also offers seamless integration with leading AI frameworks. We will guide you through a detailed tutorial and setup, and showcase the power of Docling.
Why a Robust Document Parser is Essential for Gen AI
Large Language Models (LLMs), such as GPT, Claude, and Gemini, have revolutionized many fields. However, these models thrive on structured data. Unstructured data, such as that found in PDFs, Word documents, and presentations, presents a challenge. A reliable Document parser Python based solution is the key to unlocking the potential of these documents for Gen AI applications. Without an efficient way to extract and organize information, the full power of LLMs remains untapped. Docling is designed to address these challenges by providing advanced features for understanding and transforming various document formats into structured JSON. This allows Large Language Models (LLMs) to process JSON files much more effectively than Excel files. This makes it easier for developers to use and resolve this problem.
Introducing Docling: The Versatile “Python Document Parser”

- Multi-Format Support: Docling supports a wide array of popular document formats, including
PDF
,DOCX
,PPTX
,XLSX
,Images
,HTML
,AsciiDoc
, andMarkdown
. - Versatile Export Options: It allows you to export the parsed content into
JSON
,HTML
, andMarkdown
. JSON export retains detailed metadata, making it ideal for further processing by AI models. - Advanced PDF Understanding: Docling employs AI to understand page layout, reading order, and table structures within PDF documents accurately, including OCR for scanned PDFs.
- Seamless Integration: Docling seamlessly integrates with LlamaIndex and LangChain, allowing you to create powerful Retrieval Augmented Generation (RAG) and QA applications.
- Simple Interface: Docling offers a simple command-line interface for those who prefer working in the terminal, alongside the Python API.
- Document parser free: Docling is released as a completely open-source library, making it accessible to everyone without any cost.
Parse pdf and export into JSON
Docling accurately parses PDF documents and exports the content into structured JSON format. This includes not only text but also metadata about layouts, tables, and figures.
Parse XLSX and export into JSON
The tool extracts data from XLSX files, preserving the table structure, and exports the extracted information into a JSON format, making it easy to process.
Parse PPTX and export into JSON
Docling converts the text and metadata from PPTX files and exports them to JSON, capturing presentation slides and related content in the desired output format.
Parse pdf and export into HTML and Markdown
Users can export PDFs to well-formatted HTML and Markdown documents. While JSON retains more metadata, these options are useful for display purposes.
Parse XLSX and export into HTML and Markdown
Similarly, XLSX files can be converted to HTML and Markdown, with tables and text preserved for web display.
Parse PPTX and export into HTML and Markdown
Docling enables you to convert PPTX files to HTML and Markdown. It retains the content and layout of the presentation slides, which makes it useful for content display.
Getting Started with Docling: A Quick Tutorial
Here is a quick guide on how to get started with Docling:
Installation
To begin, you must install Docling using pip:
pip install docling
Docling is compatible with macOS, Linux, and Windows environments, supporting both x86_64 and arm64 architectures.
Basic Usage
Here’s an example of using Docling’s Python API:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # Document URL or local path
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown()) # Output: "## Docling Technical Report[...]"
This code snippet downloads the Docling technical report from arXiv, converts it, and prints the result in Markdown format.
CLI Usage
For command-line users, Docling offers a straightforward interface:
docling convert input.pdf --output output.json
The command takes input files and saves it into the output format.
For more advanced options such as document chunking, custom pipelines and more, please refer to the official documentation.
Document parser with 🦙 LlamaIndex & 🦜🔗 LangChain
Here is an example of how to use Docling with the Llamaindex.
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from docling.document_converter import DocumentConverter
from docling.integrations.llamaindex import DoclingReader
# Configure Docling reader to read from given path or url.
reader = DoclingReader(document_converter=DocumentConverter())
documents = reader.load_data("https://arxiv.org/pdf/2408.09869")
# Build llama index
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What is Docling?")
print(response)
Advanced Features and Use Cases of Docling
Docling offers more than simple document conversion. It supports document chunking, which is crucial for RAG applications. By converting documents into smaller, context-aware chunks, it enables more accurate and efficient querying by LLMs. Also, Docling is versatile enough to be used in local and cloud environments, making it suitable for a wide array of applications.
Docling runs on both x86 and ARM architectures, making it flexible for various hardware configurations.
It also supports customizable document processing pipelines and different model configurations. This helps users to tailor its performance to their own specific use cases and requirements.
Performance and Benchmarking of Docling documents parser
Docling’s performance has been tested on different systems, and the results are promising. On x86 CPU, Docling processes a page in about 3.1 seconds, while on M3 Max SoC, it takes around 1.27 seconds per page. With a Nvidia L4 GPU, Docling processes a page in about 0.49 seconds. The major contributor to the overall processing time is the OCR functionality. The result shows that enabling the GPU significantly accelerates the process.
Docling stands out as a significantly faster solution compared to many of its open-source alternatives, making it a strong choice for document processing tasks.
How Docling is Different from Other Python document parser
Docling differentiates itself from other document parsers through its combination of accuracy, speed, and accessibility. Its permissive MIT license allows organizations to integrate Docling freely into their solutions without incurring licensing fees. Additionally, it offers highly accurate, resource-efficient, and fast models. Docling stands out as a cost-effective, open-source library, offering a reliable solution for document conversion without hallucinations.
Ecosystem and Integrations
Docling is quickly becoming a mainstream package for document conversion. There are several integrations provided by the Docling team and the growing community. The native integration with LlamaIndex and LangChain helps it to provide support for various generative AI applications. Docling is integrated with the open IBM data-prep-kit which enables large-scale, multi-modal data processing. It is also integrated with InstructLab which helps to support the enhancement of knowledge taxonomy and the fine-tuning of LLMs.
Docling is a system package in Red Hat® Enterprise Linux® AI distribution that helps develop, test, and run the Granite family of large language models.
Future Development
The Docling team is continuously working on improving the project. Future updates will include:
- Equation and Code Extraction: Further enhancing Docling’s capabilities to extract equations and code snippets accurately.
- Metadata Extraction: Improving metadata extraction, including titles, authors, references, and language.
- Native LangChain Extension: Enhancing its native Langchain extension for better workflow.
Conclusion
Docling stands out as an exceptional python document parser that addresses the challenges of data extraction from various document types. With its strong feature set, high performance, and integration capabilities, Docling is an indispensable tool for developers, researchers, and companies working with Gen AI and other data processing tasks. The ease of setup and the comprehensive features make it a must-have tool in your arsenal.
Now is the perfect time to explore Docling and contribute to its growth. I highly encourage you to leverage its features, refer to its documentation, and share your feedback within the community.
Leave a Reply