add documentation

raid-guild · Sep 18, 2024 · 2ca0846 · 2ca0846
1 parent c263a1e
commit 2ca0846
Show file tree

Hide file tree

Showing 16 changed files with 6,411 additions and 3 deletions.
diff --git a/.gitignore b/.gitignore
@@ -177,4 +177,25 @@ experiments/.DS_Store
 output/**/*
 
 # MacOS
-.DS_Store
+.DS_Store
+
+# Vocs
+# dependencies
+/node_modules
+/.pnp
+.pnp.js
+
+# production
+/docs/dist
+
+# misc
+.DS_Store
+*.pem
+
+# debug
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+
+# typescript
+*.tsbuildinfo
diff --git a/config/.env.sample b/config/.env.sample
@@ -7,14 +7,14 @@ PATHWAY_RETRY_DELAY_MS=1000
 
 # Maximum No of concurrent operations per batch during udf async execution. Zero means no specific limits
 CHUNKING_BATCH_CAPACITY=0
-EMBEDDINGS_BATCH_CAPACITY=500
+EMBEDDINGS_BATCH_CAPACITY=15
 
 API_DATA_ENCODING="utf-8"
 OPENAPI_SPEC_FILE=
 SOURCE_MANIFEST_FILE=
 OUTPUT_FOLDER="./output"
 
-LLM_API_BASE_URL="http://localhost:8080/v1" # openai
+LLM_API_BASE_URL="http://localhost:8080/v1" # openai-based
 # LLM_API_BASE_URL="http://localhost:11434" # ollama
 LLM_API_KEY="empty-api-key"
 LLM_EMBEDDINGS_MODEL="Nomic-embed-text-v1.5"

diff --git a/docs/footer.tsx b/docs/footer.tsx
@@ -0,0 +1,8 @@
+export default function Footer() {
+    return (
+      <div style={{display: "flex", "flexDirection": "column", "alignItems": "center"}}>
+        <div>Released under the MIT License.</div>
+        <div>Built by <a href="https://www.raidguild.org/" target="_blank">RaigGuild.org</a></div>
+      </div>
+    )
+  }
diff --git a/docs/pages/architecture.mdx b/docs/pages/architecture.mdx
@@ -0,0 +1,96 @@
+# RAG API Pipeline Architecture Overview
+
+## Introduction
+
+This document provides an overview of the RAG (Retrieval-Augmented Generation) API Pipeline architecture. The system is designed to extract, process, and store data from the Boardroom Governance API, creating a knowledge base that can be queried using natural language processing techniques.
+
+
+## Diagram
+
+<iframe src="https://mermaid.live/view#pako:eNplVF1vozAQ_CuWXy8l4SPQoNNJiZqmVZM2Ou7pnD44eAOogJEx7eWa_PdbQ0lzqSUke3Z2Zz22eaexFEBDmihepeTXbFMSHHWz7YBpprZ7DWxDP2bkG1lznb7x_YY-d2Qz1iu2zirIsxLIipfZDmr9TK6ufhxsyyYKuDiQ6ZJN1_dLyQWos9SniD1VUGKIRBXEfZZzyvqkTpdt0MGSCZSguIYDiVYsko2KgRS98GdGtGozXMyoNVf6QH5OFww_0rd7RjawYXstWyos_jBjD6V8y0EkQGa8Pqc_zD7YTs_GAl0YSnFhJIYiUK-g0Eqj0263Bf7zcblcMfxIgaeSn5s0ve9cujeZZ4H5is2LLQiRlcmXrItmZgs2k1wJJWVBFhK1S16ib6ai2Up_2h27d2NsjcgOuG4UEOilDp8b6tgfZvhIjkpe1anUXzgXy8UjW_CMP4ImayW1jGVOcPEm1cvzRQL5btpbPHbw3S27a5LE7PiWx4C9WyhtLtYXyZP_S0hOPrRG2wxFY6jrc_cddsM1J5E5znPcZfM_2tiVt4eWxX0Ua9IBLUAVPBP4jN4NvKE6hQI2NMSp4OplQzflEXm80TLalzENtWpgQJVskpSGO57XuGoqgRf6JuPYcXGBzkWGPZ3AvH1ENHynel-1zzerNQrEstxlicEblSOcal3V4XBowlaS6bTZWrEshnUmUnwN6evEH_qOf80dF_zA5WPXFfHWnlzvHM_eiWBkO5wejwMKrf6q-1e0v4wBrXj5G2_SqSlcG-U_NPS8seU7I8_3XMcbjQN_PKB7GrquhdBoPPa8IJg49gQL_20rjKxrN_B8Z-Lb2Ebg-sHxHxjOZcQ"
+width="100%"
+height="900px" >
+</iframe>
+
+[Full Image](https://res.cloudinary.com/dwx9alovg/image/upload/v1725856208/rag-pipeline/toms0dzpmfbrmuw87xmd.png)
+
+## Components
+
+### 1. Pipeline Manifest
+
+- A YAML file that defines the configuration settings and API endpoints for extraction.
+- Read at the start of the pipeline process (step 1.1).
+
+### 2. OpenAPI Spec
+
+- A YAML file containing the OpenAPI specification for the Boardroom Governance API.
+- Read by the APILoader component (step 1.2).
+
+### 3. APILoader
+
+- Reads the Pipeline Manifest and OpenAPI Spec.
+- Generates a Source Manifest (step 2) based on the input configurations.
+
+### 4. Source Manifest
+
+- A YAML file generated by the APILoader.
+- Contains detailed information about the data sources and extraction parameters.
+
+### 5. Boardroom Governance API
+
+- The primary data source for the pipeline.
+- Data is extracted from this API (step 4).
+
+### 6. Airbyte + Pathway
+
+- Airbyte is used for data extraction and initial processing.
+- Pathway is used for data transformation and pipelining.
+- These components work together to process the extracted data (step 5).
+
+### 7. RAG Pipeline
+
+- Consists of several sub-steps:
+  - Preprocessing
+  - Normalization
+  - Semantic chunking
+  - Feature embeddings
+- Processes the data extracted by Airbyte (step 5).
+
+### 8. Qdrant Vector Store
+
+- A vector database used to store the processed and embedded data.
+- Data is stored here after processing (step 5.5).
+
+### 9. RAG API Server
+
+- Hosts the following components:
+  - LLM (Language Model)
+  - Embedding model
+  - OpenAI API integration
+- Interfaces with the Qdrant Vector Store to retrieve relevant information.
+- Connects to the GaiaNet Protocol Network.
+
+### 10. OpenAI API
+
+- Used by the RAG API Server for advanced natural language processing tasks.
+
+### 11. GaiaNet Protocol Network
+
+- The broader network that the RAG API Server interfaces with.
+- Consists of multiple Gaia Nodes.
+
+## Process Flow
+
+1. The pipeline starts by reading the Pipeline Manifest (1.1) and OpenAPI Spec (1.2).
+2. The APILoader generates a Source Manifest based on these inputs.
+3. The pipeline begins data extraction from the Boardroom Governance API.
+4. Extracted data is processed through the Airbyte + Pathway components.
+5. The RAG Pipeline performs preprocessing, normalization, semantic chunking, and feature embedding.
+6. Processed data is stored in the Qdrant Vector Store.
+7. The RAG API Server can now access this data to respond to queries.
+8. The RAG API Server may use the OpenAI API for additional processing or generation tasks.
+9. The RAG API Server interfaces with the GaiaNet Protocol Network to provide its services.
+
+## Using Pre-generated snapshots and models
+ - You can also use models and snapshots supported by Gaianet node by defining them in the Gaianet node config file.
+ - Check out the [Gaianet docs](http://docs.gaianet.ai) for more information.
diff --git a/docs/pages/cli.mdx b/docs/pages/cli.mdx
@@ -0,0 +1,125 @@
+# RAG API Pipeline CLI Documentation
+
+## Overview
+
+The CLI tool provides functionality for running a RAG (Retrieval-Augmented Generation) API pipeline.
+It offers various commands to execute different stages of the pipeline, from data extraction to embedding generation.
+
+## Installation
+
+This project uses Poetry for dependency management. To install the project and all its dependencies:
+
+1. Ensure you have Poetry installed. If not, install it by following the instructions at https://python-poetry.org/docs/#installation
+
+2. Clone the repository:
+   ```
+   git clone https://github.com/raid-guild/gaianet-rag-api-pipeline
+   cd gaianet-rag-api-pipeline
+   ```
+
+3. Install dependencies using Poetry:
+   ```
+   poetry install
+   ```
+
+   This will create a virtual environment and install all necessary dependencies specified in the `pyproject.toml` file.
+
+## Usage
+
+To run any command, use the `poetry run` prefix:
+
+```
+poetry run rag-api-pipeline [OPTIONS] COMMAND [ARGS]...
+```
+
+## Global Options
+
+- `--debug`: Enable logging at debug level.
+
+## Commands
+
+### 1. run-all
+
+Run the complete RAG API pipeline.
+
+```
+poetry run rag-api-pipeline run-all [OPTIONS] API_MANIFEST_FILE
+```
+
+#### Arguments
+
+- `API_MANIFEST_FILE`: Pipeline YAML manifest that defines the Pipeline config settings and API endpoints to extract.
+
+#### Options
+
+- `--llm-provider [ollama|openai]`: Embedding model provider (default: openai)
+- `--api-key TEXT`: API Auth key
+- `--openapi-spec-file FILE`: OpenAPI YAML spec file (default: config/openapi.yaml)
+- `--source-manifest-file FILE`: Source YAML manifest
+- `--full-refresh`: Clean up cache and extract API data from scratch
+- `--normalized-only`: Run pipeline until the normalized data stage
+- `--chunked-only`: Run pipeline until the chunked data stage
+
+### 2. from-normalized
+
+Execute the RAG API pipeline from normalized data.
+
+```
+poetry run rag-api-pipeline from-normalized [OPTIONS] API_MANIFEST_FILE
+```
+
+#### Arguments
+
+- `API_MANIFEST_FILE`: Pipeline YAML manifest that defines the Pipeline config settings and API endpoints to extract.
+
+#### Options
+
+- `--llm-provider [ollama|openai]`: Embedding model provider (default: openai)
+- `--normalized-data-file FILE`: Normalized data in JSONL format (required)
+
+### 3. from-chunked
+
+Execute the RAG API pipeline from (cached) data chunks.
+
+```
+poetry run rag-api-pipeline from-chunked [OPTIONS] API_MANIFEST_FILE
+```
+
+#### Arguments
+
+- `API_MANIFEST_FILE`: Pipeline YAML manifest that defines the Pipeline config settings and API endpoints to extract.
+
+#### Options
+
+- `--llm-provider [ollama|openai]`: Embedding model provider (default: openai)
+- `--chunked-data-file FILE`: Chunked data in JSONL format (required)
+
+## Environment Variables
+
+- `BOARDROOM_API_KEY`: Can be used to set the API key instead of passing it as a command-line option.
+
+## Examples
+
+1. Run the complete pipeline:
+   ```
+   poetry run rag-api-pipeline run-all config/api_pipeline.yaml --llm-provider ollama
+   ```
+
+2. Run the pipeline from normalized data:
+   ```
+   poetry run rag-api-pipeline from-normalized config/api_pipeline.yaml --normalized-data-file path/to/normalized_data.jsonl --llm-provider ollama
+   ```
+
+3. Run the pipeline from chunked data:
+   ```
+   poetry run rag-api-pipeline from-chunked config/api_pipeline.yaml --chunked-data-file path/to/chunked_data.jsonl --llm-provider ollama
+   ```
+
+## Notes
+
+- The CLI uses the `click` library for command-line interface creation.
+- It integrates with a custom logging setup and uses the `codetiming` library for performance timing.
+- The pipeline is built using the `pathway` library for data processing.
+- Make sure to properly configure your API manifest file and OpenAPI spec file before running the pipeline.
+- Using Poetry for dependency management ensures consistent environments across different setups.
+- Always use `poetry run` to execute the CLI commands within the Poetry environment.
diff --git a/docs/pages/deployment.mdx b/docs/pages/deployment.mdx
@@ -0,0 +1,115 @@
+# RAG API Pipeline Deployment
+
+## Overview
+This page outlines the deployment process for the AI pipeline, including the necessary components and their configurations.
+
+## Deployment Process
+
+### Local development
+
+Start with building your containers: `docker compose -f local.yml build`.
+
+You are ready to start developing your application!
+Define your custom logic in `gaianet_rag_api_pipeline/pipeline.py`. It already contains a sample code which sums all the input values.
+
+You can test it in the following modes:
+
+- [debug (batch mode)] run your Pathway app code with pytest with `docker compose -f local.yml run --rm pathway_app pytest`
+- [streaming] run your Pathway app `docker compose -f local.yml up`. Modify `InfiniteStream` in `gaianet_rag_api_pipeline/input.py` to feed it with different data. The results are streamed to the `output.csv` file (you can change this in `gaianet_rag_api_pipeline/output.py`)
+
+
+### Production environment
+
+Production environment streams data from `redpanda`.
+Build production containers with `docker compose -f prod.yml build`
+
+To run your application invoke:
+1. `docker compose -f prod.yml rm -svf` to clean the state so that `redpanda` can start without issues
+2. `docker compose -f prod.yml up`
+
+For test, you can push messages to redpanda by running
+`docker compose -f prod.yml exec redpanda rpk topic create gaianet_rag_api_pipeline` to make sure the topic is created
+and then `docker compose -f prod.yml exec redpanda rpk topic produce gaianet_rag_api_pipeline`
+
+and typing in the messages, e.g:
+`{"value":10}`
+
+
+## Working with a Snapshot
+
+You can optionally use ollama to generate a snapshot and work with it for your gaianet node
+
+### Setting up Ollama
+Download and install ollama from official [website](https://ollama.com/download)
+
+### Setting up Embeddings Model with Ollama
+
+1. Download the prefered embedding model from [HuggingFace](https://huggingface.co/gaianet/Nomic-embed-text-v1.5-Embedding-GGUF/resolve/main/nomic-embed-text-v1.5.f16.gguf)
+2. Create `Modelfile` to use embedding model with ollama
+
+```
+FROM ./nomic-embed-text-v1.5.f16.gguf # this is the path to the embedding model
+```
+Learn more about Ollama Modelfile [here](https://github.com/ollama/ollama/blob/main/docs/modelfile.md)
+
+To use the new Modelfile Save it as a file (e.g. Modelfile) and run the following command:
+
+```bash
+ollama create choose-a-model-name -f <location of the file e.g. ./Modelfile>'
+ollama run choose-a-model-name
+```
+
+### Setup a qdrant vector db instance
+
+- Run the following command to start a qdrant vector db instance (make sure to have docker daemon running)
+
+```bash
+docker run -p 6333:6333 -p 6334:6334 -v ./qdrant_dev:/qdrant/storage:z qdrant/qdrant:v1.10.1
+```
+
+### Generating a Snapshopt
+
+
+### Using the Snapshot
+to use the generated snapshot with gaianet node without the full pipeline, you can edit the config file and add the following:
+
+```
+"snapshot": "/your-snapshot-path-or-url", # this can be http url or local path
+```
+make sure to add the snapshot to the config file run the following command to start the node:
+
+```bash
+gaianet init
+gaianet start
+```
+
+
+
+
+## Recommended GaiaNet Node Configuration
+- Tested on Mac Studio 32GB RAM
+```json
+{
+  "address": "your-node-address",
+  "chat": "https://huggingface.co/gaianet/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf",
+  "chat_batch_size": "64",
+  "chat_ctx_size": "8192",
+  "chat_name": "Boardroom-Llama-3-Chat",
+  "description": "Llama-3-chat model. with Boardroom API snapshot",
+  "domain": "us.gaianet.network",
+  "embedding": "https://huggingface.co/gaianet/Nomic-embed-text-v1.5-Embedding-GGUF/resolve/main/nomic-embed-text-v1.5.f16.gguf",
+  "embedding_batch_size": "2048",
+  "embedding_collection_name": "boardroom_api_collection", # this is the name of the collection in the snapshot
+  "embedding_ctx_size": "2048",
+  "embedding_name": "Nomic-embed-text-v1.5",
+  "llamaedge_port": "8080",
+  "prompt_template": "llama-3-chat",
+  "qdrant_limit": "1",
+  "qdrant_score_threshold": "0.5",
+  "rag_policy": "system-message",
+  "rag_prompt": "Use the following pieces of context to answer the user's question. Respond directly to the user with your answer, do not say 'this is the answer' or 'this is the answer' or similar language. Never mention your knowledge base or say 'according to the context' or 'hypothetical' or other similar language. Use json metadata included in knowledge base whenever possible enrich your answers. The term aave refers the DAO protocol where discussions and proposals are posted. If you don't know the answer, don't try to make up an answer. \n----------------\n",
+  "reverse_prompt": "",
+  "snapshot": "/your-snapshot-path-or-url",
+  "system_prompt": "You are an AI assistant designed to provide clear, concise, and accurate answers to user queries. Your primary functions include retrieving relevant information from the provided RAG (Retrieval-Augmented Generation) data and utilizing your pre-training data when necessary. Use json metadata included in RAG data whenever possible enrich your answers. The term aave refers the DAO protocol where discussions and proposals are posted. If no relevant information is found, you will inform the user that you are not familiar with the knowledge."
+}
+```