Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Coursera

Preprocessing Unstructured Data for LLMs and RAG Systems

Packt via Coursera

Overview

Coursera Plus Monthly Sale:
All Certificates & Courses 40% Off!
Grab it
Unlock the full potential of unstructured data by mastering preprocessing techniques for LLMs and Retrieval-Augmented Generation (RAG) systems. This comprehensive course equips you with the skills to prepare unstructured data for advanced AI applications, ensuring high-quality input for improved outcomes. From understanding the complexities of data preprocessing to hands-on projects, you'll gain valuable insights into cutting-edge frameworks and tools. Your journey begins with setting up a robust development environment, including API accounts and key integrations. You'll then dive into the nuances of preprocessing unstructured data, tackling challenges such as data normalization, chunking, and metadata extraction. With the Unstructured Framework as your guide, you'll efficiently preprocess HTML, PDFs, and PPTX documents, ensuring optimal data structuring. The course emphasizes real-world applications, offering hands-on experience with semantic similarity, vector databases, and hybrid search strategies. You'll explore advanced document layout detection techniques, leveraging tools like Visual Transformers and LangChain to preprocess complex documents and extract meaningful insights. Finally, you'll apply all these skills in building a fully functional RAG system, integrating learned techniques for dynamic data interaction. This course is ideal for data engineers, AI practitioners, and developers looking to refine their preprocessing skills. While familiarity with Python and basic API usage is helpful, the course is structured for both intermediates and those seeking advanced expertise.

Syllabus

  • Introduction
    • In this module, we will introduce you to the course, highlighting its goals, the skills and knowledge you'll need to succeed, and how the content is organized to guide you through the process of preparing unstructured data for large language models (LLMs) and retrieval-augmented generation (RAG) systems.
  • Development Environment Setup
    • In this module, we will guide you through setting up the necessary development environment, including creating and configuring API accounts, integrating the Unstructured framework, and performing a test run to ensure everything is operational before proceeding with data preprocessing tasks.
  • Data Preprocessing for LLMs - Deep Dive
    • In this module, we will explore the intricacies of data preprocessing for LLMs, delving into the challenges posed by unstructured data and the techniques required to overcome them. You'll learn about the entire workflow—from cleaning and normalizing data to structuring and chunking it—culminating in a comprehensive overview of the Unstructured framework.
  • Hands-on: The Unstructured Framework - Preprocessing HTML, PDFs & PPTX Documents
    • In this module, we will dive into hands-on exercises using the Unstructured framework to preprocess different document types. You'll explore the steps involved in extracting and normalizing data from PDFs, PPTX files, and HTML, and discover how these processes improve data quality for downstream use cases in LLMs and RAG systems.
  • Chunking and Metadata Extraction
    • In this module, we will focus on chunking and metadata extraction, exploring how to segment document content into logical units and enrich it with metadata for advanced applications like semantic similarity and hybrid search. Through hands-on activities, you’ll learn how to optimize document processing workflows, structure document elements effectively, and integrate results into a vector database.
  • Preprocessing Complex Documents - PDFs and Images
    • In this module, we will tackle the challenges of preprocessing complex documents, including PDFs and images, by leveraging advanced tools like DLD and ViT. You’ll explore hands-on methods for extracting and summarizing table content, gain insights into preprocessing HTML and PDF files efficiently, and evaluate the trade-offs between different preprocessing techniques.
  • Build a RAG System Using Learned Techniques - Full Use Case
    • In this module, we will synthesize the skills and techniques learned throughout the course to build a complete RAG system. From preprocessing and structuring complex documents to creating a searchable database and enabling conversational interactions with your documents, you’ll gain hands-on experience in deploying an end-to-end solution tailored for real-world applications.
  • Wrap up
    • In this module, we will conclude the course by revisiting the major milestones and skills acquired. You’ll receive guidance on applying your knowledge to real-world scenarios and discover resources to continue your journey in advanced data preprocessing and RAG system development.

Taught by

Packt - Course Instructors

Reviews

Start your review of Preprocessing Unstructured Data for LLMs and RAG Systems

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.