Author: Vineet Tiwari

  • ECL vs RAG, What is ETL: AI Learning, Data, and Transformation

    ECL vs RAG, What is ETL: AI Learning, Data, and Transformation

    ECL vs RAG: A Deep Dive into Two Innovative AI Approaches

    In the world of advanced AI, particularly with large language models (LLMs), two innovative approaches stand out: the External Continual Learner (ECL) and Retrieval-Augmented Generation (RAG). While both aim to enhance the capabilities of AI models, they serve different purposes and use distinct mechanisms. Understanding the nuances of ECL vs RAG is essential for choosing the right method for your specific needs.

    ecl vs etl vs rag
    ECL vs ETC vs RAG

    What is an External Continual Learner (ECL)?

    An External Continual Learner (ECL) is a method designed to assist large language models (LLMs) in incremental learning without suffering from catastrophic forgetting. The ECL functions as an external module that intelligently selects relevant information for each new input, ensuring that the LLM can learn new tasks without losing its previously acquired knowledge.

    The core features of the ECL include:

    • Incremental Learning: The ability to learn continuously without forgetting past knowledge.
    • Tag Generation: Using the LLM to generate descriptive tags for input text.
    • Gaussian Class Representation: Representing each class with a statistical distribution of its tag embeddings.
    • Mahalanobis Distance Scoring: Selecting the most relevant classes for each input using distance calculations.

    The goal of the ECL is to streamline the in-context learning (ICL) process by reducing the number of relevant examples that need to be included in the prompt, addressing scalability issues.

    What is Retrieval-Augmented Generation (RAG)?

    Retrieval-Augmented Generation (RAG) is a framework that enhances the performance of large language models by providing them with external information during the generation process. Instead of relying solely on their pre-trained knowledge, RAG models access a knowledge base and retrieve relevant snippets of information to inform the generation.

    The key aspects of RAG include:

    • External Knowledge Retrieval: Accessing an external repository (e.g., a database or document collection) for relevant information.
    • Contextual Augmentation: Using the retrieved information to enhance the input given to the LLM.
    • Generation Phase: The LLM generates text based on the augmented input.
    • Focus on Content: RAG aims to add domain-specific or real-time knowledge to content generation.

    Key Differences: ECL vs RAG

    While both ECL and RAG aim to enhance LLMs, their fundamental approaches differ. Here’s a breakdown of the key distinctions between ECL vs RAG:

    • Purpose: The ECL is focused on enabling continual learning and preventing forgetting, while RAG is centered around providing external knowledge for enhanced generation.
    • Method of Information Use: The ECL filters context to select relevant classes for an in-context learning prompt, using statistical measures. RAG retrieves specific text snippets from an external source and uses that for text generation.
    • Learning Mechanism: The ECL learns class statistics incrementally and does not store training instances to deal with CF and ICS. RAG does not directly learn from external data but retrieves and uses it during the generation process.
    • Scalability and Efficiency: The ECL focuses on managing the context length of the prompt, making ICL scalable. RAG adds extra steps in content retrieval and processing, which can be less efficient and more computationally demanding.
    • Application: ECL is well-suited for class-incremental learning, where the goal is to learn a sequence of classification tasks. RAG excels in scenarios that require up-to-date information or context from an external knowledge base.
    • Text Retrieval vs Tag-based Classification: RAG uses text-based similarity search to find similar instances, whereas the ECL uses tag embeddings to classify and determine class similarity.

    When to Use ECL vs RAG

    The choice between ECL and RAG depends on the specific problem you are trying to solve.

    • Choose ECL when:
      • You need to train a classifier with class-incremental learning.
      • You want to avoid catastrophic forgetting and improve scalability in ICL settings.
      • Your task requires focus on relevant class information from past experiences.
    • Choose RAG when:
      • You need to incorporate external knowledge into the output of LLMs.
      • You are working with information that is not present in the model’s pre-training.
      • The aim is to provide up-to-date information or domain-specific context for text generation.

    What is ETL? A Simple Explanation of Extract, Transform, Load

    In the realm of data management, ETL stands for Extract, Transform, Load. It’s a fundamental process used to integrate data from multiple sources into a unified, centralized repository, such as a data warehouse or data lake. Understanding what is ETL is crucial for anyone working with data, as it forms the backbone of data warehousing and business intelligence (BI) systems.

    Breaking Down the ETL Process

    The ETL process involves three main stages: Extract, Transform, and Load. Let’s explore each of these steps in detail:

    1. Extract

    The extract stage is the initial step in the ETL process, where data is gathered from various sources. These sources can be diverse, including:

    • Relational Databases: Such as MySQL, PostgreSQL, Oracle, and SQL Server.
    • NoSQL Databases: Like MongoDB, Cassandra, and Couchbase.
    • APIs: Data extracted from various applications or platforms via their APIs.
    • Flat Files: Data from CSV, TXT, JSON, and XML files.
    • Cloud Services: Data sources like AWS, Google Cloud, and Azure platforms.

    During the extract stage, the ETL tool reads data from these sources, ensuring all required data is captured while minimizing the impact on the source system’s performance. This data is often pulled in its raw format.

    2. Transform

    The transform stage is where the extracted data is cleaned, processed, and converted into a format that is suitable for the target system. The data is transformed and prepared for analysis. This stage often involves various tasks:

    • Data Cleaning: Removing or correcting errors, inconsistencies, duplicates, and incomplete data.
    • Data Standardization: Converting data to a common format (e.g., date and time, units of measure) for consistency.
    • Data Mapping: Ensuring that the data fields from source systems correspond correctly to fields in the target system.
    • Data Aggregation: Combining data to provide summary views and derived calculations.
    • Data Enrichment: Enhancing the data with additional information from other sources.
    • Data Filtering: Removing unnecessary data based on specific rules.
    • Data Validation: Ensuring that the data conforms to predefined business rules and constraints.

    The transformation process is crucial for ensuring the quality, reliability, and consistency of the data.

    3. Load

    The load stage is the final step, where the transformed data is written into the target system. This target can be a:

    • Data Warehouse: A central repository for large amounts of structured data.
    • Data Lake: A repository for storing both structured and unstructured data in its raw format.
    • Relational Databases: Where processed data will be used for reporting and analysis.
    • Specific Application Systems: Data used by business applications for various purposes.

    The load process can involve a full load, which loads all data, or an incremental load, which loads only the changes since the last load. The goal is to ensure data is written efficiently and accurately.

    Why is ETL Important?

    The ETL process is critical for several reasons:

    • Data Consolidation: It brings together data from different sources into a unified view, breaking down data silos.
    • Data Quality: By cleaning, standardizing, and validating data, ETL enhances the reliability and accuracy of the information.
    • Data Preparation: It transforms the raw data to be analysis ready, making it usable for reporting and business intelligence.
    • Data Accessibility: ETL makes data accessible and actionable, allowing organizations to gain insights and make data-driven decisions.
    • Improved Efficiency: By automating data integration, ETL saves time and resources while reducing the risk of human errors.

    When to use ETL?

    The ETL process is particularly useful for organizations that:

    • Handle a diverse range of data from various sources.
    • Require high-quality, consistent, and reliable data.
    • Need to create data warehouses or data lakes.
    • Use data to enable Business Intelligence or data driven decision making.

    ECL vs RAG

    FeatureECL (External Continual Learner)RAG (Retrieval-Augmented Generation)
    PurposeIncremental learning, prevent forgettingEnhanced text generation via external knowledge
    MethodTag-based filtering and statistical selection of relevant classesText-based retrieval of relevant information from an external source
    LearningIncremental statistical learning; no LLM parameter update.No learning; rather, retrieval of external information.
    Data HandlingUses tagged data to optimize prompts.Uses text queries to retrieve from external knowledge bases
    FocusManaging prompt size for effective ICL.Augmenting text generation with external knowledge
    Parameter UpdatesExternal module parameters updated; no LLM parameter update.No parameter updates at all.

    ETL vs RAG

    FeatureETL (Extract, Transform, Load)RAG (Retrieval-Augmented Generation)
    PurposeData migration, transformation, and preparationEnhanced text generation via external knowledge
    MethodData extraction, transformation, and loading.Text-based retrieval of relevant information from an external source
    LearningNo machine learning; a data processing pipeline.No learning; rather, retrieval of external information.
    Data HandlingWorks with bulk data at rest.Utilizes text-based queries for dynamic data retrieval.
    FocusPreparing data for storage or analytics.Augmenting text generation with external knowledge
    Parameter UpdatesNo parameter update; rules are predefinedNo parameter updates at all.

    The terms ECL, RAG, and ETL represent distinct but important approaches in AI and data management. The External Continual Learner (ECL) helps LLMs to learn incrementally. Retrieval-Augmented Generation (RAG) enhances text generation with external knowledge. ETL is a data management process for data migration and preparation. A clear understanding of ECL vs RAG vs ETL allows developers and data professionals to select the right tools for the right tasks. By understanding these core differences, you can effectively enhance your AI capabilities and optimize your data management workflows, thereby improving project outcomes.

  • Python Document Parser: Guide with Docling Tutorial and Setup

    Python Document Parser: Guide with Docling Tutorial and Setup

    In the age of Artificial Intelligence (AI) and Machine Learning (ML), data is the new gold. However, much of this valuable data lies trapped within diverse document formats, making it difficult to access and analyze. This is where a robust Python document parser becomes indispensable. For developers, researchers, and AI enthusiasts seeking to leverage the power of Large Language Models (LLMs) and generative AI, the ability to extract clean, structured data from documents is paramount. This article introduces you to Docling, a versatile python document parser that not only simplifies this process but also offers seamless integration with leading AI frameworks. We will guide you through a detailed tutorial and setup, and showcase the power of Docling.

    Why a Robust Document Parser is Essential for Gen AI

    Large Language Models (LLMs), such as GPT, Claude, and Gemini, have revolutionized many fields. However, these models thrive on structured data. Unstructured data, such as that found in PDFs, Word documents, and presentations, presents a challenge. A reliable Document parser Python based solution is the key to unlocking the potential of these documents for Gen AI applications. Without an efficient way to extract and organize information, the full power of LLMs remains untapped. Docling is designed to address these challenges by providing advanced features for understanding and transforming various document formats into structured JSON. This allows Large Language Models (LLMs) to process JSON files much more effectively than Excel files. This makes it easier for developers to use and resolve this problem.

    Introducing Docling: The Versatile “Python Document Parser”

    Docling: Multi Document Parser
    • Multi-Format Support: Docling supports a wide array of popular document formats, including PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc, and Markdown.
    • Versatile Export Options: It allows you to export the parsed content into JSON, HTML, and Markdown. JSON export retains detailed metadata, making it ideal for further processing by AI models.
    • Advanced PDF Understanding: Docling employs AI to understand page layout, reading order, and table structures within PDF documents accurately, including OCR for scanned PDFs.
    • Seamless Integration: Docling seamlessly integrates with LlamaIndex and LangChain, allowing you to create powerful Retrieval Augmented Generation (RAG) and QA applications.
    • Simple Interface: Docling offers a simple command-line interface for those who prefer working in the terminal, alongside the Python API.
    • Document parser free: Docling is released as a completely open-source library, making it accessible to everyone without any cost.

    Parse pdf and export into JSON

    Docling accurately parses PDF documents and exports the content into structured JSON format. This includes not only text but also metadata about layouts, tables, and figures.

    Parse XLSX and export into JSON

    The tool extracts data from XLSX files, preserving the table structure, and exports the extracted information into a JSON format, making it easy to process.

    Parse PPTX and export into JSON

    Docling converts the text and metadata from PPTX files and exports them to JSON, capturing presentation slides and related content in the desired output format.

    Parse pdf and export into HTML and Markdown

    Users can export PDFs to well-formatted HTML and Markdown documents. While JSON retains more metadata, these options are useful for display purposes.

    Parse XLSX and export into HTML and Markdown

    Similarly, XLSX files can be converted to HTML and Markdown, with tables and text preserved for web display.

    Parse PPTX and export into HTML and Markdown

    Docling enables you to convert PPTX files to HTML and Markdown. It retains the content and layout of the presentation slides, which makes it useful for content display.

    Getting Started with Docling: A Quick Tutorial

    Here is a quick guide on how to get started with Docling:
    Installation
    To begin, you must install Docling using pip:

    pip install docling

    Docling is compatible with macOS, Linux, and Windows environments, supporting both x86_64 and arm64 architectures.

    Basic Usage

    Here’s an example of using Docling’s Python API:

    from docling.document_converter import DocumentConverter
    
    source = "https://arxiv.org/pdf/2408.09869"  # Document URL or local path
    converter = DocumentConverter()
    result = converter.convert(source)
    print(result.document.export_to_markdown())  # Output: "## Docling Technical Report[...]"

    This code snippet downloads the Docling technical report from arXiv, converts it, and prints the result in Markdown format.

    CLI Usage

    For command-line users, Docling offers a straightforward interface:

    docling convert input.pdf --output output.json

    The command takes input files and saves it into the output format.

    For more advanced options such as document chunking, custom pipelines and more, please refer to the official documentation.

    Document parser with 🦙 LlamaIndex & 🦜🔗 LangChain

    Here is an example of how to use Docling with the Llamaindex.

    from llama_index import VectorStoreIndex, SimpleDirectoryReader
    from docling.document_converter import DocumentConverter
    from docling.integrations.llamaindex import DoclingReader
    
    # Configure Docling reader to read from given path or url.
    reader = DoclingReader(document_converter=DocumentConverter())
    documents = reader.load_data("https://arxiv.org/pdf/2408.09869")
    
    # Build llama index
    index = VectorStoreIndex.from_documents(documents)
    
    # Query
    query_engine = index.as_query_engine()
    response = query_engine.query("What is Docling?")
    
    print(response)

    Advanced Features and Use Cases of Docling

    Docling offers more than simple document conversion. It supports document chunking, which is crucial for RAG applications. By converting documents into smaller, context-aware chunks, it enables more accurate and efficient querying by LLMs. Also, Docling is versatile enough to be used in local and cloud environments, making it suitable for a wide array of applications.
    Docling runs on both x86 and ARM architectures, making it flexible for various hardware configurations.
    It also supports customizable document processing pipelines and different model configurations. This helps users to tailor its performance to their own specific use cases and requirements.

    Performance and Benchmarking of Docling documents parser

    Docling’s performance has been tested on different systems, and the results are promising. On x86 CPU, Docling processes a page in about 3.1 seconds, while on M3 Max SoC, it takes around 1.27 seconds per page. With a Nvidia L4 GPU, Docling processes a page in about 0.49 seconds. The major contributor to the overall processing time is the OCR functionality. The result shows that enabling the GPU significantly accelerates the process.
    Docling stands out as a significantly faster solution compared to many of its open-source alternatives, making it a strong choice for document processing tasks.

    How Docling is Different from Other Python document parser

    Docling differentiates itself from other document parsers through its combination of accuracy, speed, and accessibility. Its permissive MIT license allows organizations to integrate Docling freely into their solutions without incurring licensing fees. Additionally, it offers highly accurate, resource-efficient, and fast models. Docling stands out as a cost-effective, open-source library, offering a reliable solution for document conversion without hallucinations.

    Ecosystem and Integrations

    Docling is quickly becoming a mainstream package for document conversion. There are several integrations provided by the Docling team and the growing community. The native integration with LlamaIndex and LangChain helps it to provide support for various generative AI applications. Docling is integrated with the open IBM data-prep-kit which enables large-scale, multi-modal data processing. It is also integrated with InstructLab which helps to support the enhancement of knowledge taxonomy and the fine-tuning of LLMs.
    Docling is a system package in Red Hat® Enterprise Linux® AI distribution that helps develop, test, and run the Granite family of large language models.

    Future Development

    The Docling team is continuously working on improving the project. Future updates will include:

    • Equation and Code Extraction: Further enhancing Docling’s capabilities to extract equations and code snippets accurately.
    • Metadata Extraction: Improving metadata extraction, including titles, authors, references, and language.
    • Native LangChain Extension: Enhancing its native Langchain extension for better workflow.

    Conclusion

    Docling stands out as an exceptional python document parser that addresses the challenges of data extraction from various document types. With its strong feature set, high performance, and integration capabilities, Docling is an indispensable tool for developers, researchers, and companies working with Gen AI and other data processing tasks. The ease of setup and the comprehensive features make it a must-have tool in your arsenal.

    Now is the perfect time to explore Docling and contribute to its growth. I highly encourage you to leverage its features, refer to its documentation, and share your feedback within the community.

  • LLM Continual Learning Without Fine-Tuning: The InCA Revolution

    LLM Continual Learning Without Fine-Tuning: The InCA Revolution

    The Challenge of LLM Continual Learning

    LLM Continual learning is a complex issue. Large Language Models (LLMs) are powerful, They can perform a huge range of tasks. However, there’s a problem. They struggle with continual learning. This is the ability to learn new things without forgetting what they already know. Traditional methods rely on fine-tuning. but struggle to learn new tasks without forgetting old ones. This means updating the LLM’s core parameters which leads to problems. These problems make effective LLM continual learning a significant challenge. Therefore new approaches are needed.

    Introducing InCA: A New Paradigm for LLM Continual Learning

    Enter InCA. InCA, or “In-context Continual Learning Assisted by an External Continual Learner”, offers a new paradigm for LLM continual learning. It avoids fine-tuning. It uses in-context learning and an external learner instead. In this system, the LLM is a black box with unchanged parameters. The external learner manages the learning process. It stores information and selects the most relevant context for the LLM. This design prevents catastrophic forgetting. It also enables scalable LLM continual learning.

    How InCA Works & How InCA Achieves Effective LLM Continual Learning

    Overview of the InCA framework. The diagram depicts the stages of generating semantic tags for the input, identifying the most similar classes via the ECL, and constructing the prediction prompt with class summaries, which together enables efficient in-context continual learning without retaining any training data.

    Overview of the InCA framework. The diagram depicts the stages of generating semantic tags for the input, identifying the most similar classes via the ECL, and constructing the prediction prompt with class summaries, which together enable efficient in-context continual learning without retaining any training data.

    InCA works in three steps:

    • Tag Generation: The system extracts semantic tags from the input text. Tags include topics, keywords, and relevant entities. These tags capture the core meaning of the text. An LLM will be used to generate these tags.
    • External Continual Learning (ECL): The tags are used by the ECL. The ECL identifies the most probable classes for each input. It does this without any training. It uses statistical methods to represent classes with Gaussian distributions. The Mahalanobis distance is used to measure class similarity. This step efficiently selects the most relevant context for the LLM.
    • In-context Learning with Class Summaries: Summaries of the top ‘k’ classes are prepared at the time the class is added. Then, the summaries are combined with the input test instance. This creates a prompt for the LLM. The LLM then uses this prompt to predict the final class.

    InCA is entirely ‘replay-free’. It does not require storing previous task data. This makes it memory efficient.

    The Benefits of InCA for LLM Continual Learning

    InCA offers several benefits:

    • No Fine-Tuning: This saves significant computational resources. It also reduces the complexities associated with fine-tuning.
    • Avoids Catastrophic Forgetting: The external learner helps preserve previous knowledge.
    • Scalable Learning: InCA can handle an increasing number of tasks without issue. It avoids long prompts and the associated performance problems.
    • Efficient Context Selection: The ECL ensures the LLM only focuses on the most relevant information. This speeds up processing and improves accuracy.
    • Memory Efficient: InCA doesn’t require storing large amounts of previous training data.

    InCA’s Performance in LLM Continual Learning

    Research shows that InCA outperforms traditional continual learning methods. Fine-tuning approaches, like EWC and L2P, fall short of the performance achieved by InCA. InCA performs better than long-context LLMs. These results show the effectiveness of the external learner and the overall InCA approach.

    Key Takeaways

    InCA presents a significant advancement in continual learning for LLMs. It provides a more efficient and scalable approach. This approach could enable LLMs to adapt to new information more readily, and open up new possibilities for using them in diverse scenarios.

    Looking Ahead

    Although the early outcomes are quite encouraging, additional investigation is needed. In the future, researchers plan to explore how to apply InCA to various other NLP tasks. They also plan to improve InCA’s overall performance.

  • Garbage In, Garbage Out: Why Data Quality is the Cornerstone of AI Success

    Garbage In, Garbage Out: Why Data Quality is the Cornerstone of AI Success

    AI projects fail more often due to poor data quality than flawed algorithms. Learn why focusing on data cleansing, preparation, and governance is crucial for successful AI, Machine Learning, and Generative AI initiatives.

    We all know AI is the buzzword of the decade. From chatbots and virtual assistants to advanced predictive analytics, the possibilities seem limitless. But behind every successful AI application lies a critical, often overlooked, component: data.

    Wrong AI response and hallucination Due to bad Data
    Wrong AI response due to bad data

    We all know AI is the buzzword of the decade. From chatbots and virtual assistants to advanced predictive analytics, the possibilities seem limitless. But behind every successful AI application lies a critical, often overlooked, component: data.

    It’s easy to get caught up in the excitement of cutting-edge algorithms and powerful models, but the reality is stark: if your data is poor, your AI will be poor. The old adage “Garbage In, Garbage Out” (GIGO) has never been more relevant than in the world of Artificial Intelligence. This isn’t just about missing values or misspellings; it’s about a fundamental understanding that data quality is the bedrock of any AI initiative.

    Why Data Quality Matters More Than You Think

    Data Flow for Good AI Response
    Data Flow for Good AI Response

    You might be thinking, “Yeah, yeah, data quality. I know.” But consider this:

    • Machine Learning & Model Accuracy: Machine learning models learn from data. If the data is biased, inconsistent, or inaccurate, the model will learn to make biased, inconsistent, and inaccurate predictions. No matter how sophisticated your model is, it won’t overcome flawed input.
    • Generative AI Hallucinations: Even the most impressive generative AI models can produce nonsensical outputs (known as “hallucinations”) when fed unreliable data. These models learn patterns from data, and if the underlying data is flawed, the patterns will be flawed too.
    • The Impact on Business Decisions: Ultimately, AI is meant to drive better business decisions. If the data underlying these decisions is unreliable, the outcomes will be detrimental, leading to missed opportunities, financial losses, and damage to reputation.
    • Increased Development Time & Costs: Debugging problems caused by bad data can consume vast amounts of development time. Identifying and correcting data quality issues is time-consuming and can require specialised expertise. This significantly increases project costs and delays time-to-market.

    Beyond the Basic Clean-Up

    Data quality goes beyond just removing duplicates and correcting spelling mistakes. It involves a comprehensive approach encompassing:

    • Completeness: Ensuring all relevant data is present. Are you missing vital fields? Are critical records incomplete?
    • Accuracy: Making sure data is correct and truthful. Are values consistent across different systems?
    • Consistency: Data should be uniform across your different sources.
    • Validity: Data should conform to defined rules and formats.
    • Timeliness: Keeping data up-to-date and relevant. Outdated data can lead to inaccurate results.
    • Data Governance: Implementing policies and processes to ensure data is managed effectively.

    Key Steps to Improve Data Quality for AI:

    1. Data Audit: Start by understanding your current data landscape. Where is your data coming from? What are the potential quality issues?
    2. Define Data Quality Metrics: Identify which aspects of data quality matter most for your specific AI use case.
    3. Data Cleansing & Preparation: Develop processes to correct errors, fill missing data, and transform data into a usable format.
    4. Implement Data Governance: Define clear ownership and responsibilities for data quality.
    5. Continuous Monitoring: Data quality is an ongoing process. Implement monitoring to identify and address issues proactively.
    6. Invest in Data Engineering: A team with experience in data processing and ETL pipelines is important for the success of the project

    Don’t Neglect the Foundation

    AI has the potential to transform businesses, but its success hinges on the quality of its fuel – data. Instead of chasing the latest algorithms, make sure you’re not skipping the important part. Prioritising data quality is not just a technical consideration; it’s a strategic imperative. By investing in building a robust data foundation, you can unlock the true power of AI and realize its full potential. Remember, the best AI strategy always begins with the best data.