Author: Vineet Tiwari

  • Python Document Parser: Guide with Docling Tutorial and Setup

    Python Document Parser: Guide with Docling Tutorial and Setup

    In the age of Artificial Intelligence (AI) and Machine Learning (ML), data is the new gold. However, much of this valuable data lies trapped within diverse document formats, making it difficult to access and analyze. This is where a robust Python document parser becomes indispensable. For developers, researchers, and AI enthusiasts seeking to leverage the power of Large Language Models (LLMs) and generative AI, the ability to extract clean, structured data from documents is paramount. This article introduces you to Docling, a versatile python document parser that not only simplifies this process but also offers seamless integration with leading AI frameworks. We will guide you through a detailed tutorial and setup, and showcase the power of Docling.

    Why a Robust Document Parser is Essential for Gen AI

    Large Language Models (LLMs), such as GPT, Claude, and Gemini, have revolutionized many fields. However, these models thrive on structured data. Unstructured data, such as that found in PDFs, Word documents, and presentations, presents a challenge. A reliable Document parser Python based solution is the key to unlocking the potential of these documents for Gen AI applications. Without an efficient way to extract and organize information, the full power of LLMs remains untapped. Docling is designed to address these challenges by providing advanced features for understanding and transforming various document formats into structured JSON. This allows Large Language Models (LLMs) to process JSON files much more effectively than Excel files. This makes it easier for developers to use and resolve this problem.

    Introducing Docling: The Versatile “Python Document Parser”

    Docling: Multi Document Parser
    • Multi-Format Support: Docling supports a wide array of popular document formats, including PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc, and Markdown.
    • Versatile Export Options: It allows you to export the parsed content into JSON, HTML, and Markdown. JSON export retains detailed metadata, making it ideal for further processing by AI models.
    • Advanced PDF Understanding: Docling employs AI to understand page layout, reading order, and table structures within PDF documents accurately, including OCR for scanned PDFs.
    • Seamless Integration: Docling seamlessly integrates with LlamaIndex and LangChain, allowing you to create powerful Retrieval Augmented Generation (RAG) and QA applications.
    • Simple Interface: Docling offers a simple command-line interface for those who prefer working in the terminal, alongside the Python API.
    • Document parser free: Docling is released as a completely open-source library, making it accessible to everyone without any cost.

    Parse pdf and export into JSON

    Docling accurately parses PDF documents and exports the content into structured JSON format. This includes not only text but also metadata about layouts, tables, and figures.

    Parse XLSX and export into JSON

    The tool extracts data from XLSX files, preserving the table structure, and exports the extracted information into a JSON format, making it easy to process.

    Parse PPTX and export into JSON

    Docling converts the text and metadata from PPTX files and exports them to JSON, capturing presentation slides and related content in the desired output format.

    Parse pdf and export into HTML and Markdown

    Users can export PDFs to well-formatted HTML and Markdown documents. While JSON retains more metadata, these options are useful for display purposes.

    Parse XLSX and export into HTML and Markdown

    Similarly, XLSX files can be converted to HTML and Markdown, with tables and text preserved for web display.

    Parse PPTX and export into HTML and Markdown

    Docling enables you to convert PPTX files to HTML and Markdown. It retains the content and layout of the presentation slides, which makes it useful for content display.

    Getting Started with Docling: A Quick Tutorial

    Here is a quick guide on how to get started with Docling:
    Installation
    To begin, you must install Docling using pip:

    pip install docling

    Docling is compatible with macOS, Linux, and Windows environments, supporting both x86_64 and arm64 architectures.

    Basic Usage

    Here’s an example of using Docling’s Python API:

    from docling.document_converter import DocumentConverter
    
    source = "https://arxiv.org/pdf/2408.09869"  # Document URL or local path
    converter = DocumentConverter()
    result = converter.convert(source)
    print(result.document.export_to_markdown())  # Output: "## Docling Technical Report[...]"

    This code snippet downloads the Docling technical report from arXiv, converts it, and prints the result in Markdown format.

    CLI Usage

    For command-line users, Docling offers a straightforward interface:

    docling convert input.pdf --output output.json

    The command takes input files and saves it into the output format.

    For more advanced options such as document chunking, custom pipelines and more, please refer to the official documentation.

    Document parser with 🦙 LlamaIndex & 🦜🔗 LangChain

    Here is an example of how to use Docling with the Llamaindex.

    from llama_index import VectorStoreIndex, SimpleDirectoryReader
    from docling.document_converter import DocumentConverter
    from docling.integrations.llamaindex import DoclingReader
    
    # Configure Docling reader to read from given path or url.
    reader = DoclingReader(document_converter=DocumentConverter())
    documents = reader.load_data("https://arxiv.org/pdf/2408.09869")
    
    # Build llama index
    index = VectorStoreIndex.from_documents(documents)
    
    # Query
    query_engine = index.as_query_engine()
    response = query_engine.query("What is Docling?")
    
    print(response)

    Advanced Features and Use Cases of Docling

    Docling offers more than simple document conversion. It supports document chunking, which is crucial for RAG applications. By converting documents into smaller, context-aware chunks, it enables more accurate and efficient querying by LLMs. Also, Docling is versatile enough to be used in local and cloud environments, making it suitable for a wide array of applications.
    Docling runs on both x86 and ARM architectures, making it flexible for various hardware configurations.
    It also supports customizable document processing pipelines and different model configurations. This helps users to tailor its performance to their own specific use cases and requirements.

    Performance and Benchmarking of Docling documents parser

    Docling’s performance has been tested on different systems, and the results are promising. On x86 CPU, Docling processes a page in about 3.1 seconds, while on M3 Max SoC, it takes around 1.27 seconds per page. With a Nvidia L4 GPU, Docling processes a page in about 0.49 seconds. The major contributor to the overall processing time is the OCR functionality. The result shows that enabling the GPU significantly accelerates the process.
    Docling stands out as a significantly faster solution compared to many of its open-source alternatives, making it a strong choice for document processing tasks.

    How Docling is Different from Other Python document parser

    Docling differentiates itself from other document parsers through its combination of accuracy, speed, and accessibility. Its permissive MIT license allows organizations to integrate Docling freely into their solutions without incurring licensing fees. Additionally, it offers highly accurate, resource-efficient, and fast models. Docling stands out as a cost-effective, open-source library, offering a reliable solution for document conversion without hallucinations.

    Ecosystem and Integrations

    Docling is quickly becoming a mainstream package for document conversion. There are several integrations provided by the Docling team and the growing community. The native integration with LlamaIndex and LangChain helps it to provide support for various generative AI applications. Docling is integrated with the open IBM data-prep-kit which enables large-scale, multi-modal data processing. It is also integrated with InstructLab which helps to support the enhancement of knowledge taxonomy and the fine-tuning of LLMs.
    Docling is a system package in Red Hat® Enterprise Linux® AI distribution that helps develop, test, and run the Granite family of large language models.

    Future Development

    The Docling team is continuously working on improving the project. Future updates will include:

    • Equation and Code Extraction: Further enhancing Docling’s capabilities to extract equations and code snippets accurately.
    • Metadata Extraction: Improving metadata extraction, including titles, authors, references, and language.
    • Native LangChain Extension: Enhancing its native Langchain extension for better workflow.

    Conclusion

    Docling stands out as an exceptional python document parser that addresses the challenges of data extraction from various document types. With its strong feature set, high performance, and integration capabilities, Docling is an indispensable tool for developers, researchers, and companies working with Gen AI and other data processing tasks. The ease of setup and the comprehensive features make it a must-have tool in your arsenal.

    Now is the perfect time to explore Docling and contribute to its growth. I highly encourage you to leverage its features, refer to its documentation, and share your feedback within the community.

  • LLM Continual Learning Without Fine-Tuning: The InCA Revolution

    LLM Continual Learning Without Fine-Tuning: The InCA Revolution

    The Challenge of LLM Continual Learning

    LLM Continual learning is a complex issue. Large Language Models (LLMs) are powerful, They can perform a huge range of tasks. However, there’s a problem. They struggle with continual learning. This is the ability to learn new things without forgetting what they already know. Traditional methods rely on fine-tuning. but struggle to learn new tasks without forgetting old ones. This means updating the LLM’s core parameters which leads to problems. These problems make effective LLM continual learning a significant challenge. Therefore new approaches are needed.

    Introducing InCA: A New Paradigm for LLM Continual Learning

    Enter InCA. InCA, or “In-context Continual Learning Assisted by an External Continual Learner”, offers a new paradigm for LLM continual learning. It avoids fine-tuning. It uses in-context learning and an external learner instead. In this system, the LLM is a black box with unchanged parameters. The external learner manages the learning process. It stores information and selects the most relevant context for the LLM. This design prevents catastrophic forgetting. It also enables scalable LLM continual learning.

    How InCA Works & How InCA Achieves Effective LLM Continual Learning

    Overview of the InCA framework. The diagram depicts the stages of generating semantic tags for the input, identifying the most similar classes via the ECL, and constructing the prediction prompt with class summaries, which together enables efficient in-context continual learning without retaining any training data.

    Overview of the InCA framework. The diagram depicts the stages of generating semantic tags for the input, identifying the most similar classes via the ECL, and constructing the prediction prompt with class summaries, which together enable efficient in-context continual learning without retaining any training data.

    InCA works in three steps:

    • Tag Generation: The system extracts semantic tags from the input text. Tags include topics, keywords, and relevant entities. These tags capture the core meaning of the text. An LLM will be used to generate these tags.
    • External Continual Learning (ECL): The tags are used by the ECL. The ECL identifies the most probable classes for each input. It does this without any training. It uses statistical methods to represent classes with Gaussian distributions. The Mahalanobis distance is used to measure class similarity. This step efficiently selects the most relevant context for the LLM.
    • In-context Learning with Class Summaries: Summaries of the top ‘k’ classes are prepared at the time the class is added. Then, the summaries are combined with the input test instance. This creates a prompt for the LLM. The LLM then uses this prompt to predict the final class.

    InCA is entirely ‘replay-free’. It does not require storing previous task data. This makes it memory efficient.

    The Benefits of InCA for LLM Continual Learning

    InCA offers several benefits:

    • No Fine-Tuning: This saves significant computational resources. It also reduces the complexities associated with fine-tuning.
    • Avoids Catastrophic Forgetting: The external learner helps preserve previous knowledge.
    • Scalable Learning: InCA can handle an increasing number of tasks without issue. It avoids long prompts and the associated performance problems.
    • Efficient Context Selection: The ECL ensures the LLM only focuses on the most relevant information. This speeds up processing and improves accuracy.
    • Memory Efficient: InCA doesn’t require storing large amounts of previous training data.

    InCA’s Performance in LLM Continual Learning

    Research shows that InCA outperforms traditional continual learning methods. Fine-tuning approaches, like EWC and L2P, fall short of the performance achieved by InCA. InCA performs better than long-context LLMs. These results show the effectiveness of the external learner and the overall InCA approach.

    Key Takeaways

    InCA presents a significant advancement in continual learning for LLMs. It provides a more efficient and scalable approach. This approach could enable LLMs to adapt to new information more readily, and open up new possibilities for using them in diverse scenarios.

    Looking Ahead

    Although the early outcomes are quite encouraging, additional investigation is needed. In the future, researchers plan to explore how to apply InCA to various other NLP tasks. They also plan to improve InCA’s overall performance.

  • Garbage In, Garbage Out: Why Data Quality is the Cornerstone of AI Success

    Garbage In, Garbage Out: Why Data Quality is the Cornerstone of AI Success

    AI projects fail more often due to poor data quality than flawed algorithms. Learn why focusing on data cleansing, preparation, and governance is crucial for successful AI, Machine Learning, and Generative AI initiatives.

    We all know AI is the buzzword of the decade. From chatbots and virtual assistants to advanced predictive analytics, the possibilities seem limitless. But behind every successful AI application lies a critical, often overlooked, component: data.

    Wrong AI response and hallucination Due to bad Data
    Wrong AI response due to bad data

    We all know AI is the buzzword of the decade. From chatbots and virtual assistants to advanced predictive analytics, the possibilities seem limitless. But behind every successful AI application lies a critical, often overlooked, component: data.

    It’s easy to get caught up in the excitement of cutting-edge algorithms and powerful models, but the reality is stark: if your data is poor, your AI will be poor. The old adage “Garbage In, Garbage Out” (GIGO) has never been more relevant than in the world of Artificial Intelligence. This isn’t just about missing values or misspellings; it’s about a fundamental understanding that data quality is the bedrock of any AI initiative.

    Why Data Quality Matters More Than You Think

    Data Flow for Good AI Response
    Data Flow for Good AI Response

    You might be thinking, “Yeah, yeah, data quality. I know.” But consider this:

    • Machine Learning & Model Accuracy: Machine learning models learn from data. If the data is biased, inconsistent, or inaccurate, the model will learn to make biased, inconsistent, and inaccurate predictions. No matter how sophisticated your model is, it won’t overcome flawed input.
    • Generative AI Hallucinations: Even the most impressive generative AI models can produce nonsensical outputs (known as “hallucinations”) when fed unreliable data. These models learn patterns from data, and if the underlying data is flawed, the patterns will be flawed too.
    • The Impact on Business Decisions: Ultimately, AI is meant to drive better business decisions. If the data underlying these decisions is unreliable, the outcomes will be detrimental, leading to missed opportunities, financial losses, and damage to reputation.
    • Increased Development Time & Costs: Debugging problems caused by bad data can consume vast amounts of development time. Identifying and correcting data quality issues is time-consuming and can require specialised expertise. This significantly increases project costs and delays time-to-market.

    Beyond the Basic Clean-Up

    Data quality goes beyond just removing duplicates and correcting spelling mistakes. It involves a comprehensive approach encompassing:

    • Completeness: Ensuring all relevant data is present. Are you missing vital fields? Are critical records incomplete?
    • Accuracy: Making sure data is correct and truthful. Are values consistent across different systems?
    • Consistency: Data should be uniform across your different sources.
    • Validity: Data should conform to defined rules and formats.
    • Timeliness: Keeping data up-to-date and relevant. Outdated data can lead to inaccurate results.
    • Data Governance: Implementing policies and processes to ensure data is managed effectively.

    Key Steps to Improve Data Quality for AI:

    1. Data Audit: Start by understanding your current data landscape. Where is your data coming from? What are the potential quality issues?
    2. Define Data Quality Metrics: Identify which aspects of data quality matter most for your specific AI use case.
    3. Data Cleansing & Preparation: Develop processes to correct errors, fill missing data, and transform data into a usable format.
    4. Implement Data Governance: Define clear ownership and responsibilities for data quality.
    5. Continuous Monitoring: Data quality is an ongoing process. Implement monitoring to identify and address issues proactively.
    6. Invest in Data Engineering: A team with experience in data processing and ETL pipelines is important for the success of the project

    Don’t Neglect the Foundation

    AI has the potential to transform businesses, but its success hinges on the quality of its fuel – data. Instead of chasing the latest algorithms, make sure you’re not skipping the important part. Prioritising data quality is not just a technical consideration; it’s a strategic imperative. By investing in building a robust data foundation, you can unlock the true power of AI and realize its full potential. Remember, the best AI strategy always begins with the best data.