Python Document Parser: Guide with Docling Tutorial and Setup

In the age of Artificial Intelligence (AI) and Machine Learning (ML), data is the new gold. However, much of this valuable data lies trapped within diverse document formats, making it difficult to access and analyze. This is where a robust Python document parser becomes indispensable. For developers, researchers, and AI enthusiasts seeking to leverage the power of Large Language Models (LLMs) and generative AI, the ability to extract clean, structured data from documents is paramount. This article introduces you to Docling, a versatile python document parser that not only simplifies this process but also offers seamless integration with leading AI frameworks. We will guide you through a detailed tutorial and setup, and showcase the power of Docling.

Why a Robust Document Parser is Essential for Gen AI

Large Language Models (LLMs), such as GPT, Claude, and Gemini, have revolutionized many fields. However, these models thrive on structured data. Unstructured data, such as that found in PDFs, Word documents, and presentations, presents a challenge. A reliable Document parser Python based solution is the key to unlocking the potential of these documents for Gen AI applications. Without an efficient way to extract and organize information, the full power of LLMs remains untapped. Docling is designed to address these challenges by providing advanced features for understanding and transforming various document formats into structured JSON. This allows Large Language Models (LLMs) to process JSON files much more effectively than Excel files. This makes it easier for developers to use and resolve this problem.

Introducing Docling: The Versatile “Python Document Parser”

Docling: Multi Document Parser
  • Multi-Format Support: Docling supports a wide array of popular document formats, including PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc, and Markdown.
  • Versatile Export Options: It allows you to export the parsed content into JSON, HTML, and Markdown. JSON export retains detailed metadata, making it ideal for further processing by AI models.
  • Advanced PDF Understanding: Docling employs AI to understand page layout, reading order, and table structures within PDF documents accurately, including OCR for scanned PDFs.
  • Seamless Integration: Docling seamlessly integrates with LlamaIndex and LangChain, allowing you to create powerful Retrieval Augmented Generation (RAG) and QA applications.
  • Simple Interface: Docling offers a simple command-line interface for those who prefer working in the terminal, alongside the Python API.
  • Document parser free: Docling is released as a completely open-source library, making it accessible to everyone without any cost.

Parse pdf and export into JSON

Docling accurately parses PDF documents and exports the content into structured JSON format. This includes not only text but also metadata about layouts, tables, and figures.

Parse XLSX and export into JSON

The tool extracts data from XLSX files, preserving the table structure, and exports the extracted information into a JSON format, making it easy to process.

Parse PPTX and export into JSON

Docling converts the text and metadata from PPTX files and exports them to JSON, capturing presentation slides and related content in the desired output format.

Parse pdf and export into HTML and Markdown

Users can export PDFs to well-formatted HTML and Markdown documents. While JSON retains more metadata, these options are useful for display purposes.

Parse XLSX and export into HTML and Markdown

Similarly, XLSX files can be converted to HTML and Markdown, with tables and text preserved for web display.

Parse PPTX and export into HTML and Markdown

Docling enables you to convert PPTX files to HTML and Markdown. It retains the content and layout of the presentation slides, which makes it useful for content display.

Getting Started with Docling: A Quick Tutorial

Here is a quick guide on how to get started with Docling:
Installation
To begin, you must install Docling using pip:

pip install docling

Docling is compatible with macOS, Linux, and Windows environments, supporting both x86_64 and arm64 architectures.

Basic Usage

Here’s an example of using Docling’s Python API:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # Document URL or local path
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # Output: "## Docling Technical Report[...]"

This code snippet downloads the Docling technical report from arXiv, converts it, and prints the result in Markdown format.

CLI Usage

For command-line users, Docling offers a straightforward interface:

docling convert input.pdf --output output.json

The command takes input files and saves it into the output format.

For more advanced options such as document chunking, custom pipelines and more, please refer to the official documentation.

Document parser with 🦙 LlamaIndex & 🦜🔗 LangChain

Here is an example of how to use Docling with the Llamaindex.

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from docling.document_converter import DocumentConverter
from docling.integrations.llamaindex import DoclingReader

# Configure Docling reader to read from given path or url.
reader = DoclingReader(document_converter=DocumentConverter())
documents = reader.load_data("https://arxiv.org/pdf/2408.09869")

# Build llama index
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What is Docling?")

print(response)

Advanced Features and Use Cases of Docling

Docling offers more than simple document conversion. It supports document chunking, which is crucial for RAG applications. By converting documents into smaller, context-aware chunks, it enables more accurate and efficient querying by LLMs. Also, Docling is versatile enough to be used in local and cloud environments, making it suitable for a wide array of applications.
Docling runs on both x86 and ARM architectures, making it flexible for various hardware configurations.
It also supports customizable document processing pipelines and different model configurations. This helps users to tailor its performance to their own specific use cases and requirements.

Performance and Benchmarking of Docling documents parser

Docling’s performance has been tested on different systems, and the results are promising. On x86 CPU, Docling processes a page in about 3.1 seconds, while on M3 Max SoC, it takes around 1.27 seconds per page. With a Nvidia L4 GPU, Docling processes a page in about 0.49 seconds. The major contributor to the overall processing time is the OCR functionality. The result shows that enabling the GPU significantly accelerates the process.
Docling stands out as a significantly faster solution compared to many of its open-source alternatives, making it a strong choice for document processing tasks.

How Docling is Different from Other Python document parser

Docling differentiates itself from other document parsers through its combination of accuracy, speed, and accessibility. Its permissive MIT license allows organizations to integrate Docling freely into their solutions without incurring licensing fees. Additionally, it offers highly accurate, resource-efficient, and fast models. Docling stands out as a cost-effective, open-source library, offering a reliable solution for document conversion without hallucinations.

Ecosystem and Integrations

Docling is quickly becoming a mainstream package for document conversion. There are several integrations provided by the Docling team and the growing community. The native integration with LlamaIndex and LangChain helps it to provide support for various generative AI applications. Docling is integrated with the open IBM data-prep-kit which enables large-scale, multi-modal data processing. It is also integrated with InstructLab which helps to support the enhancement of knowledge taxonomy and the fine-tuning of LLMs.
Docling is a system package in Red Hat® Enterprise Linux® AI distribution that helps develop, test, and run the Granite family of large language models.

Future Development

The Docling team is continuously working on improving the project. Future updates will include:

  • Equation and Code Extraction: Further enhancing Docling’s capabilities to extract equations and code snippets accurately.
  • Metadata Extraction: Improving metadata extraction, including titles, authors, references, and language.
  • Native LangChain Extension: Enhancing its native Langchain extension for better workflow.

Conclusion

Docling stands out as an exceptional python document parser that addresses the challenges of data extraction from various document types. With its strong feature set, high performance, and integration capabilities, Docling is an indispensable tool for developers, researchers, and companies working with Gen AI and other data processing tasks. The ease of setup and the comprehensive features make it a must-have tool in your arsenal.

Now is the perfect time to explore Docling and contribute to its growth. I highly encourage you to leverage its features, refer to its documentation, and share your feedback within the community.

Author’s Bio

Vineet Tiwari

Vineet Tiwari is an accomplished Solution Architect with over 5 years of experience in AI, ML, Web3, and Cloud technologies. Specializing in Large Language Models (LLMs) and blockchain systems, he excels in building secure AI solutions and custom decentralized platforms tailored to unique business needs.

Vineet’s expertise spans cloud-native architectures, data-driven machine learning models, and innovative blockchain implementations. Passionate about leveraging technology to drive business transformation, he combines technical mastery with a forward-thinking approach to deliver scalable, secure, and cutting-edge solutions. With a strong commitment to innovation, Vineet empowers businesses to thrive in an ever-evolving digital landscape.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *