diff --git a/samples/multi-modal/img/multimodal-rag1.png b/samples/multi-modal/img/multimodal-rag1.png new file mode 100644 index 00000000..3592ca2d Binary files /dev/null and b/samples/multi-modal/img/multimodal-rag1.png differ diff --git a/samples/multi-modal/img/vectordb.png b/samples/multi-modal/img/vectordb.png new file mode 100644 index 00000000..7cd82268 Binary files /dev/null and b/samples/multi-modal/img/vectordb.png differ diff --git a/samples/multi-modal/multimodal_rag_with_nova.ipynb b/samples/multi-modal/multimodal_rag_with_nova.ipynb new file mode 100644 index 00000000..6894f4e2 --- /dev/null +++ b/samples/multi-modal/multimodal_rag_with_nova.ipynb @@ -0,0 +1,655 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Multimodal RAG with Amazon Bedrock, Amazon Nova and LangChain" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook demonstrates how to implement a multi-modal Retrieval-Augmented Generation (RAG) system using **Amazon Bedrock with Amazon Nova and LangChain**. Many documents contain a mixture of content types, including text and images. Traditional RAG applications often lose valuable information captured in images. With the emergence of Multimodal Large Language Models (MLLMs), we can now leverage both text and image data in our RAG systems.\n", + "\n", + "In this notebook, we'll explore one approach to multi-modal RAG (`Option 1`):\n", + "\n", + "1. Use multimodal embeddings (such as Amazon Titan) to embed both images and text\n", + "2. Retrieve relevant information using similarity search\n", + "3. Pass raw images and text chunks to a multimodal LLM for answer synthesis using [Amazon Nova](https://aws.amazon.com/ai/generative-ai/nova/)\n", + "\n", + "We'll use the following tools and technologies:\n", + "\n", + "- [LangChain](https://python.langchain.com/docs/introduction/) to build a multimodal RAG system\n", + "- [faiss](https://github.com/facebookresearch/faiss) for similarity search\n", + "- [Amazon Nova](https://docs.aws.amazon.com/nova/latest/userguide/what-is-nova.html ) for answer synthesis\n", + "- [Amazon Titan Multimodal Embeddings](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html) for image embeddings\n", + "- [Amazon Bedrock](https://aws.amazon.com/bedrock/) for accessing powerful AI models, like the ones above\n", + "- [pymupdf](https://pymupdf.readthedocs.io/en/latest/) to parse images, text, and tables from documents (PDFs)\n", + "- [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) for interacting with Amazon Bedrock\n", + "\n", + "This approach allows us to create a more comprehensive RAG system that can understand and utilize both textual and visual information from our documents.\n", + "\n", + "## Prerequisites\n", + "\n", + "Before running this notebook, ensure you have the following packages and dependencies installed:\n", + "\n", + "- Python 3.10 or later\n", + "- langchain\n", + "- boto3\n", + "- faiss\n", + "- pymupdf\n", + "- tabula\n", + "- tesseract\n", + "- requests\n", + "\n", + "Let's get started with building our multi-modal RAG system using Amazon Bedrock!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "![Multimodal RAG with Amazon Bedrock](img/multimodal-rag1.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Importing the libs" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# !pip install --upgrade jpype1 tabula-py PyMuPDF\n", + "# !pip install --upgrade boto3 requests numpy tqdm botocore ipython\n", + "# !pip install --upgrade faiss-cpu\n", + "# !pip install --upgrade langchain-aws\n", + "# !pip install --upgrade langchain-text-splitters" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "import boto3\n", + "import tabula\n", + "import faiss\n", + "import json\n", + "import base64\n", + "import pymupdf\n", + "import requests\n", + "import os\n", + "import logging\n", + "import numpy as np\n", + "import warnings\n", + "from tqdm import tqdm\n", + "from botocore.exceptions import ClientError\n", + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "from IPython import display\n", + "\n", + "\n", + "logger = logging.getLogger(__name__)\n", + "logger.setLevel(logging.ERROR)\n", + "\n", + "warnings.filterwarnings(\"ignore\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Data Loading" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Downloading the dataset - URL of the \"Attention Is All You Need\" paper (Replace it with the URL of the PDF file/dataset you want to download)\n", + "url = \"https://arxiv.org/pdf/1706.03762.pdf\"\n", + "\n", + "# Set the filename and filepath\n", + "filename = \"attention_paper.pdf\"\n", + "filepath = os.path.join(\"data\", filename)\n", + "\n", + "# Create the data directory if it doesn't exist\n", + "os.makedirs(\"data\", exist_ok=True)\n", + "\n", + "# Download the file\n", + "response = requests.get(url)\n", + "if response.status_code == 200:\n", + " with open(filepath, 'wb') as file:\n", + " file.write(response.content)\n", + " print(f\"File downloaded successfully: {filepath}\")\n", + "else:\n", + " print(f\"Failed to download the file. Status code: {response.status_code}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Display the PDF file\n", + "display.IFrame(filepath, width=1000, height=600)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Data Extraction" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "# Create the directories\n", + "def create_directories(base_dir):\n", + " directories = [\"images\", \"text\", \"tables\", \"page_images\"]\n", + " for dir in directories:\n", + " os.makedirs(os.path.join(base_dir, dir), exist_ok=True)\n", + "\n", + "# Process tables\n", + "def process_tables(doc, page_num, base_dir, items):\n", + " try:\n", + " tables = tabula.read_pdf(filepath, pages=page_num + 1, multiple_tables=True)\n", + " if not tables:\n", + " return\n", + " for table_idx, table in enumerate(tables):\n", + " table_text = \"\\n\".join([\" | \".join(map(str, row)) for row in table.values])\n", + " table_file_name = f\"{base_dir}/tables/{os.path.basename(filepath)}_table_{page_num}_{table_idx}.txt\"\n", + " with open(table_file_name, 'w') as f:\n", + " f.write(table_text)\n", + " items.append({\"page\": page_num, \"type\": \"table\", \"text\": table_text, \"path\": table_file_name})\n", + " except Exception as e:\n", + " print(f\"Error extracting tables from page {page_num}: {str(e)}\")\n", + "\n", + "# Process text chunks\n", + "def process_text_chunks(text, text_splitter, page_num, base_dir, items):\n", + " chunks = text_splitter.split_text(text)\n", + " for i, chunk in enumerate(chunks):\n", + " text_file_name = f\"{base_dir}/text/{os.path.basename(filepath)}_text_{page_num}_{i}.txt\"\n", + " with open(text_file_name, 'w') as f:\n", + " f.write(chunk)\n", + " items.append({\"page\": page_num, \"type\": \"text\", \"text\": chunk, \"path\": text_file_name})\n", + "\n", + "# Process images\n", + "def process_images(page, page_num, base_dir, items):\n", + " images = page.get_images()\n", + " for idx, image in enumerate(images):\n", + " xref = image[0]\n", + " pix = pymupdf.Pixmap(doc, xref)\n", + " image_name = f\"{base_dir}/images/{os.path.basename(filepath)}_image_{page_num}_{idx}_{xref}.png\"\n", + " pix.save(image_name)\n", + " with open(image_name, 'rb') as f:\n", + " encoded_image = base64.b64encode(f.read()).decode('utf8')\n", + " items.append({\"page\": page_num, \"type\": \"image\", \"path\": image_name, \"image\": encoded_image})\n", + "\n", + "# Process page images\n", + "def process_page_images(page, page_num, base_dir, items):\n", + " pix = page.get_pixmap()\n", + " page_path = os.path.join(base_dir, f\"page_images/page_{page_num:03d}.png\")\n", + " pix.save(page_path)\n", + " with open(page_path, 'rb') as f:\n", + " page_image = base64.b64encode(f.read()).decode('utf8')\n", + " items.append({\"page\": page_num, \"type\": \"page\", \"path\": page_path, \"image\": page_image})\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "doc = pymupdf.open(filepath)\n", + "num_pages = len(doc)\n", + "base_dir = \"data\"\n", + "\n", + "# Creating the directories\n", + "create_directories(base_dir)\n", + "text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=200, length_function=len)\n", + "items = []\n", + "\n", + "# Process each page of the PDF\n", + "for page_num in tqdm(range(num_pages), desc=\"Processing PDF pages\"):\n", + " page = doc[page_num]\n", + " text = page.get_text()\n", + " process_tables(doc, page_num, base_dir, items)\n", + " process_text_chunks(text, text_splitter, page_num, base_dir, items)\n", + " process_images(page, page_num, base_dir, items)\n", + " process_page_images(page, page_num, base_dir, items)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Looking at the first text item\n", + "[i for i in items if i['type'] == 'text'][0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# Looking at the first table item\n", + "[i for i in items if i['type'] == 'table'][0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# Looking at the first image item\n", + "[i for i in items if i['type'] == 'image'][0]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Generating Multimodal Embeddings" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Generating Multimodal Embeddings using Amazon Titan Multimodal Embeddings model\n", + "def generate_multimodal_embeddings(prompt=None, image=None, output_embedding_length=384):\n", + " \"\"\"\n", + " Invoke the Amazon Titan Multimodal Embeddings model using Amazon Bedrock runtime.\n", + "\n", + " Args:\n", + " prompt (str): The text prompt to provide to the model.\n", + " image (str): A base64-encoded image data.\n", + " Returns:\n", + " str: The model's response embedding.\n", + " \"\"\"\n", + " if not prompt and not image:\n", + " raise ValueError(\"Please provide either a text prompt, base64 image, or both as input\")\n", + " \n", + " # Initialize the Amazon Bedrock runtime client\n", + " client = boto3.client(service_name=\"bedrock-runtime\")\n", + " model_id = \"amazon.titan-embed-image-v1\"\n", + " \n", + " body = {\"embeddingConfig\": {\"outputEmbeddingLength\": output_embedding_length}}\n", + " \n", + " if prompt:\n", + " body[\"inputText\"] = prompt\n", + " if image:\n", + " body[\"inputImage\"] = image\n", + "\n", + " try:\n", + " response = client.invoke_model(\n", + " modelId=model_id,\n", + " body=json.dumps(body),\n", + " accept=\"application/json\",\n", + " contentType=\"application/json\"\n", + " )\n", + "\n", + " # Process and return the response\n", + " result = json.loads(response.get(\"body\").read())\n", + " return result.get(\"embedding\")\n", + "\n", + " except ClientError as err:\n", + " print(f\"Couldn't invoke Titan embedding model. Error: {err.response['Error']['Message']}\")\n", + " return None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set embedding vector dimension\n", + "embedding_vector_dimension = 384\n", + "\n", + "# Count the number of each type of item\n", + "item_counts = {\n", + " 'text': sum(1 for item in items if item['type'] == 'text'),\n", + " 'table': sum(1 for item in items if item['type'] == 'table'),\n", + " 'image': sum(1 for item in items if item['type'] == 'image'),\n", + " 'page': sum(1 for item in items if item['type'] == 'page')\n", + "}\n", + "\n", + "# Initialize counters\n", + "counters = dict.fromkeys(item_counts.keys(), 0)\n", + "\n", + "# Generate embeddings for all items\n", + "with tqdm(total=len(items), desc=\"Generating embeddings\", bar_format=\"{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}{postfix}]\") as pbar:\n", + " for item in items:\n", + " item_type = item['type']\n", + " counters[item_type] += 1\n", + " \n", + " if item_type in ['text', 'table']:\n", + " # For text or table, use the formatted text representation\n", + " item['embedding'] = generate_multimodal_embeddings(prompt=item['text'], output_embedding_length=embedding_vector_dimension)\n", + " else:\n", + " # For images, use the base64-encoded image data\n", + " item['embedding'] = generate_multimodal_embeddings(image=item['image'], output_embedding_length=embedding_vector_dimension)\n", + " \n", + " # Update the progress bar\n", + " pbar.set_postfix_str(f\"Text: {counters['text']}/{item_counts['text']}, Table: {counters['table']}/{item_counts['table']}, Image: {counters['image']}/{item_counts['image']}\")\n", + " pbar.update(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Creating Vector Database/Index" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Vector Database](img/vectordb.png)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# All the embeddings\n", + "all_embeddings = np.array([item['embedding'] for item in items])\n", + "\n", + "# Create FAISS Index\n", + "index = faiss.IndexFlatL2(embedding_vector_dimension)\n", + "\n", + "# Clear any pre-existing index\n", + "index.reset()\n", + "\n", + "# Add embeddings to the index\n", + "index.add(np.array(all_embeddings, dtype=np.float32))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_aws import ChatBedrock\n", + "\n", + "# Generating RAG response with Amazon Nova\n", + "def invoke_nova_multimodal(prompt, matched_items):\n", + " \"\"\"\n", + " Invoke the Amazon Nova model.\n", + " \"\"\"\n", + "\n", + "\n", + " # Define your system prompt(s).\n", + " system_msg = [\n", + " { \"text\": \"\"\"You are a helpful assistant for question answering. \n", + " The text context is relevant information retrieved. \n", + " The provided image(s) are relevant information retrieved.\"\"\"}\n", + " ]\n", + "\n", + " # Define one or more messages using the \"user\" and \"assistant\" roles.\n", + " message_content = []\n", + "\n", + " for item in matched_items:\n", + " if item['type'] == 'text' or item['type'] == 'table':\n", + " message_content.append({\"text\": item['text']})\n", + " else:\n", + " message_content.append({\"image\": {\n", + " \"format\": \"png\",\n", + " \"source\": {\"bytes\": item['image']},\n", + " }\n", + " })\n", + "\n", + "\n", + " # Configure the inference parameters.\n", + " inf_params = {\"max_new_tokens\": 300, \n", + " \"top_p\": 0.9, \n", + " \"top_k\": 20, \n", + " \"temperature\": 0.7}\n", + "\n", + " # Define the final message list\n", + " message_list = [\n", + " {\"role\": \"user\", \"content\": message_content}\n", + " ]\n", + " \n", + " # Adding the prompt to the message list\n", + " message_list.append({\"role\": \"user\", \"content\": [{\"text\": prompt}]})\n", + "\n", + " native_request = {\n", + " \"messages\": message_list,\n", + " \"system\": system_msg,\n", + " \"inferenceConfig\": inf_params,\n", + " }\n", + "\n", + " # Initialize the Amazon Bedrock runtime client\n", + " model_id = \"amazon.nova-pro-v1:0\"\n", + " client = ChatBedrock(model_id=model_id)\n", + "\n", + " # Invoke the model and extract the response body.\n", + " response = client.invoke(json.dumps(native_request))\n", + " model_response = response.content\n", + " \n", + " return model_response\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Test the RAG Pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "# User Query\n", + "query = \"Which optimizer was used when training the models?\"\n", + "\n", + "# Generate embeddings for the query\n", + "query_embedding = generate_multimodal_embeddings(prompt=query,output_embedding_length=embedding_vector_dimension)\n", + "\n", + "# Search for the nearest neighbors in the vector database\n", + "distances, result = index.search(np.array(query_embedding, dtype=np.float32).reshape(1,-1), k=5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Check the result (matched chunks)\n", + "result.flatten()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Retrieve the matched items\n", + "matched_items = [{k: v for k, v in items[index].items() if k != 'embedding'} for index in result.flatten()]\n", + "\n", + "# Generate RAG response with Amazon Nova\n", + "response = invoke_nova_multimodal(query, matched_items)\n", + "\n", + "display.Markdown(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# `Your Turn`: Test the RAG Pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# List of queries (Replace with any query of your choice)\n", + "other_queries = [\"How long were the base and big models trained?\",\n", + " \"Which optimizer was used when training the models?\",\n", + " \"What is the position-wise feed-forward neural network mentioned in the paper?\",\n", + " \"What is the BLEU score of the model in English to German translation (EN-DE)?\",\n", + " \"How is the scaled-dot-product attention is calculated?\",\n", + " ]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "query = other_queries[0] # Replace with any query from the list above\n", + "\n", + "# Generate embeddings for the query\n", + "query_embedding = generate_multimodal_embeddings(prompt=query,output_embedding_length=embedding_vector_dimension)\n", + "\n", + "# Search for the nearest neighbors in the vector database\n", + "distances, result = index.search(np.array(query_embedding, dtype=np.float32).reshape(1,-1), k=5)\n", + "\n", + "# Retrieve the matched items\n", + "matched_items = [{k: v for k, v in items[index].items() if k != 'embedding'} for index in result.flatten()]\n", + "\n", + "# Generate RAG response with Amazon Nova\n", + "response = invoke_nova_multimodal(query, matched_items)\n", + "\n", + "# Display the response\n", + "display.Markdown(response)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Thank you!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "lc-aws", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}