update docs

raid-guild · Sep 18, 2024 · 4e903e2 · 4e903e2
1 parent 458fd2d
commit 4e903e2
Show file tree

Hide file tree

Showing 7 changed files with 281 additions and 77 deletions.
diff --git a/README.md b/README.md
@@ -1,70 +1,203 @@
-# GaiaNet RAG API Pipeline
+# GaiaNet x RAG API Pipeline
+
+`rag-api-pipeline` is a Python-based data pipeline tool that allows you to easily generate a vector knowledge base from any REST API data source. The
+resulting database snapshot can be then plugged-in into a Gaia node's LLM model with a prompt and provide contextual responses to user queries using RAG 
+(Retrieval Augmented Generation).
+
+The following sections help you to quickly setup and execute the pipeline on your REST API. If you're looking more in-depth information about how to use 
+thes tool, tech stack and/or how it works under the hood, check the content menu on the left.
 
 ## System Requirements
 
+- Poetry ([Docs](https://python-poetry.org/docs/))
 - Python 3.11.x
+- (Optional): a Python virtual environment manager or your preference (e.g. conda, venv)
+- Docker and Docker compose
+- Qdrant vector database ([Docs](https://qdrant.tech/documentation/))
+- LLM model provider (either a Gaianet node or Ollama)
 
-## Dependencies
+## Setup Instructions
 
-- airbyte==0.16.4
-- pathway[xpack-llm]
-- pathway[xpack-llm-docs]
+### 1. Clone this repository
 
-Workaround in case of missing dependencies:
+Git clone or download this repository to your local.
 
-- if trying to install pillow-heif missingn module
-    - export CFLAGS="-Wno-nullability-completeness"
-- libmagic -> Required for having libmagic working
-    - MacOS:
-        - brew install libmagic
-    - pip install python-magic-bin
+### 2. Activate your virtual environment
 
-## Pipeline execution
+If using a custom virtual environment, you should activate your virtual environment, otherwise poetry will handle the environment for you
 
-- Copy `config/.env/sample` into `config/.env` file and set environment variables accordingly
-- Set the API key in a `config/secrets/api_key` file
-- Define the API pipeline manifest and execute the following CLI command
+### 3. Install project dependencies
 
-```
-poetry run rag-api-pipeline run-all config/api_pipeline.yaml [--full-refresh]
-# or
-poetry run rag-api-pipeline from-normalized config/api_pipeline.yaml --normalized-data-file <file>
-# or
-poetry run rag-api-pipeline from-chunked config/api_pipeline.yaml --chunked-data-file <file>
+Navigate to the directory where this repository was cloned/download and execute the following on a terminal:
+
+```bash [terminal]
+poetry install
 ```
 
-## Local development
+### 4. Set environment variables
 
-Start with building your containers: `docker compose -f local.yml build`.
+Copy `config/.env/sample` into `config/.env` file and set environment variables accordingly. Check the [environment variables](#environment-variables) section 
+for details.
 
-You are ready to start developing your application!
-Define your custom logic in `gaianet_rag_api_pipeline/pipeline.py`. It already contains a sample code which sums all the input values.
+### 5. Define your API Pipeline manifest
 
-You can test it in the following modes:
+Define the pipeline manifest for your REST API you're looking to extract data from. Check how to define an API pipeline manifest in 
+[Defining an API Pipeline Manifest](/manifest-definition) for details, or take a look at the in-depth review of the sample manifests available in 
+[API Examples](/examples).
 
-- [debug (batch mode)] run your Pathway app code with pytest with `docker compose -f local.yml run --rm pathway_app pytest`
-- [streaming] run your Pathway app `docker compose -f local.yml up`. Modify `InfiniteStream` in `gaianet_rag_api_pipeline/input.py` to feed it with different data. The results are streamed to the `output.csv` file (you can change this in `gaianet_rag_api_pipeline/output.py`)
+### 6. Set the REST API Key
 
-## Production environment
+Set the REST API key in a `config/secrets/api_key` file, or specify it using the `--api-key` as argument to the CLI
 
-Production environment streams data from `redpanda`.
-Build production containers with `docker compose -f prod.yml build`
+### 7. Setup a Qdrant DB instance
 
-To run your application invoke:
-1. `docker compose -f prod.yml rm -svf` to clean the state so that `redpanda` can start without issues
-2. `docker compose -f prod.yml up`
+Get the base URL of your Qdrant Vector DB or deploy a local `Qdrant` ([Docs](https://qdrant.tech/documentation/)) vector database instance using docker:
 
-For test, you can push messages to redpanda by running
-`docker compose -f prod.yml exec redpanda rpk topic create gaianet_rag_api_pipeline` to make sure the topic is created
-and then `docker compose -f prod.yml exec redpanda rpk topic produce gaianet_rag_api_pipeline`
+```bash [terminal]
+# IMPORTANT: make sure you use `qdrant:v1.10.1` for compatibility with Gaianet node
+docker run -p 6333:6333 -p 6334:6334 -v ./qdrant_dev:/qdrant/storage:z qdrant/qdrant:v1.10.1
+```
 
-and typing in the messages, e.g:
-`{"value":10}`
+### 8. Select and Setup an LLM provider
 
+Get your Gaianet node running ([Docs](https://docs.gaianet.ai/node-guide/quick-start)) or install Ollama ([Docs](https://ollama.com/)) provider locally. 
+The latter is recommended if you're looking to run the pipeline on consumer hardware.
 
-## Configuration
+### 9. Load an LLM embeddings model
 
-Supply configuration with environment variables.
+Load the LLM embeddings model of your preference into the LLM provider you chose in the previous step:
+- You can find info on how to customize a Gaianet node [here](https://docs.gaianet.ai/node-guide/customize)
+- If you chose Ollama, follow these instructions to import the LLM embeddings model:
+  - Make sure the Ollama service is up and running
+  - Go to the folder where the embeddings model is located. For this example, the llm model file is `nomic-embed-text-v1.5.f16.gguf` ([Source](https://huggingface.co/gaianet/Nomic-embed-text-v1.5-Embedding-GGUF/tree/main?show_file_info=nomic-embed-text-v1.5.f16.gguf))
+  - Create a file with name `Modelfile` and paste the following (replace `<path/to/model>` with your local directory):
+
+```docker 
+FROM <path/to/model>/nomic-embed-text-v1.5.f16.gguf
+```
+  - Import the model by running the following command on a terminal:
+```bash [terminal]
+ollama create Nomic-embed-text-v1.5
+```
+  - Make sure the model setting by running the command:
+```bash [terminal]
+ollama show Nomic-embed-text-v1.5
+```
+
+## Pipeline CLI
+
+Now you're ready to use the `rag-api-pipeline` CLI commands to execute different tasks of the RAG pipeline, from extracting data from an API source to generating vector embeddings 
+and a database snapshot. If you need more details about the parameters available on each command you can execute:
+
+```bash [Terminal]
+poetry run rag-api-pipeline <command> --help
+```
+
+### CLI available commands
+
+Below you can find the default instructions available and an in-depth review of both the functionality and available arguments that each command offers:
+
+```bash [Terminal]
+# run the entire pipeline
+poetry run rag-api-pipeline run-all API_MANIFEST_FILE ----openapi-spec-file <openapi-spec-yaml-file> [--full-refresh] [--llm-provider openapi|ollama]
+# or run using an already normalized dataset
+poetry run rag-api-pipeline from-normalized API_MANIFEST_FILE --normalized-data-file <jsonl-file> [--llm-provider openapi|ollama]
+# or run using an already chunked dataset
+poetry run rag-api-pipeline from-chunked API_MANIFEST_FILE --chunked-data-file <jsonl-file> [--llm-provider openapi|ollama]
+```
 
-For ease of development, you can also use dotenv file in `config/.env` to specify configuration.
-Note that environment variables will take precedence over any configuration specified in `config/.env` file.
+- **run-all**: executes the entire RAG data pipeline including API endpoint data streams, data normalization, data chunking, vector embeddings and 
+database snapshot generation. You can specify the following arguments to the command:
+  * `API_MANIFEST_FILE`: API pipeline manifest file (mandatory)
+  * `--llm-provider [ollama|openai]`: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)
+  * `--api-key`: API Auth key. If not specified, it will try to get it from `config/secrets/api_key`
+  * `--openapi-spec-file`: API OpenAPI YAML spec file. default to `config/openapi.yaml`
+  * `--source-manifest-file`: Airbyte API Connector YAML manifest. If specified, it will omit the API Connector manifest generation step.
+  * `--full-refresh`: clean up cache and extract API data from scratch.
+  * `--normalized-only`: run pipeline until the data normalization stage.
+  * `--chunked-only`: run pipeline until the data chunking stage.
+
+- **from-normalized**: executes the RAG data pipeline using an already normalized JSONL dataset. You can specify the following arguments to the command:
+  * `API_MANIFEST_FILE`: API pipeline manifest file (mandatory)
+  * `--llm-provider [ollama|openai]`: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)
+  * `--normalized-data-file`: path to the normalized dataset in JSONL format (mandatory). Check the [Architecture](/architecture) section for details on the 
+  required data schema.
+
+- **from chunked**: executes the RAG data pipeline using an already chunked dataset in JSONL format. You can specify the following arguments to the command:
+  * `API_MANIFEST_FILE`: API pipeline manifest file (mandatory)
+  * `--llm-provider [ollama|openai]`: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)
+  * `--chunked-data-file`: path to the chunked dataset in JSONL format (mandatory). Check the [Architecture](/architecture) section for details on the 
+  required data schema.
+
+## CLI Output
+
+Cached API stream data and results produced from running any of the CLI commands are stored in `<OUTPUT_FOLDER>/<api_name>`. The following files and folders 
+are created by the tool within this `baseDir` folder:
+
+- `{baseDir}/cache/{api_name}/*`: extracted API data is cached into a local DuckDB. Database files are stored in this directory. If the `--full-refresh` argument
+is specified to the `run-all` command, the cache will be cleared and API data will be extracted from scratch.
+- `{baseDir}/{api_name}_stream_{x}_preprocessed.jsonl`: data streams from each API endpoint is preprocessed and stored in JSONL format
+- `{baseDir}/{api_name}_normalized.jsonl`: preprocessed data streams from each API endpoint are joined together and stored in JSONL format
+- `{baseDir}/{api_name}_chunked.jsonl`: normalized data that goes through the data chunking stage is then stored in JSONL format
+- `{baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot`: vector embeddings snapshot file that was exported from Qdrant DB
+- `{baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot.tar.gz`: compressed knowledge base that contains the vector embeddings snapshot
+
+## Environment variables
+
+The following environment variables can be adjusted in `config/.env` based on user needs:
+
+- Pipeline config parameters:
+  - `API_DATA_ENCODING` (="utf-8"): default data encoding used by the REST API
+  - `OUTPUT_FOLDER` (="./output"): output folder where cached data streams, intermediary stage results and generated knowledge base snapshot are stored
+- LLM provider settings:
+  - `LLM_API_BASE_URL` (="http://localhost:8080/v1"): LLM provider base URL (default to a local openai-based provider such as gaia node)
+  - `LLM_API_KEY` (="empty-api-key"): API key to authenticate requests to the LLM provider
+  - `LLM_EMBEDDINGS_MODEL` (="Nomic-embed-text-v1.5"): name of the embeddings model to be consumed through the LLM provider
+  - `LLM_EMBEDDINGS_VECTOR_SIZE` (=768): embeddings vector size
+  - `LLM_PROVIDER` (="openai"): LLM provider backend to use. It can be either `openai` or `ollama` (gaianet offers an openai compatible API)
+- Qdrant DB settings:
+  - `QDRANTDB_URL` (="http://localhost:6333"): Qdrant DB base URL
+  - `QDRANTDB_TIMEOUT` (=60): timeout for requests made to the Qdrant DB
+  - `QDRANTDB_DISTANCE_FN` (="COSINE"): score function to use during vector similarity search. Avaiable functions: ['COSINE', 'EUCLID', 'DOT', 'MANHATTAN']
+- Pathway-related variables:
+  - `AUTOCOMMIT_DURATION_MS` (=1000): the maximum time between two commits. Every autocommit_duration_ms milliseconds, the updates received by the connector are 
+  committed automatically and pushed into Pathway's dataflow. More info can be found [here](https://pathway.com/developers/user-guide/connect/connectors/custom-python-connectors#connector-method-reference)
+  - `FixedDelayRetryStrategy` ([docs](https://pathway.com/developers/api-docs/udfs#pathway.udfs.FixedDelayRetryStrategy)) config parameters:
+    - `PATHWAY_RETRY_MAX_ATTEMPTS` (=10): max retries to be performed if a UDF async execution fails
+    - `PATHWAY_RETRY_DELAY_MS` (=1000): delay in milliseconds to wait for the next execution attempt
+  - *UDF async execution*: set the maximum No of concurrent operations per batch during udf async execution. Zero means no specific limits. Be careful when settings
+  this parameters for the embeddings stage as it could break the LLM provider with too many concurrent requests
+    - `CHUNKING_BATCH_CAPACITY` (=0): max No. of concurrent operation during data chunking operations
+    - `EMBEDDINGS_BATCH_CAPACITY` (=15): max No. of concurrent operation during vector embeddings operations
+
+
+## Using Docker compose for Local development or in Production
+
+TBD
+
+- Start with building your containers: `docker compose -f local.yml build`.
+
+- Build production containers with `docker compose -f prod.yml build`
+
+- To run your application invoke:
+    1. `docker compose -f prod.yml rm -svf`
+    2. `docker compose -f prod.yml up`
+
+## Troubleshooting
+
+### Workaround in case of missing one of the following dependencies:
+
+- If trying to install `pillow-heif` missinng module:
+  - Add the following flags `export CFLAGS="-Wno-nullability-completeness"`
+- Libraries required for having libmagic working:
+  - MacOS:
+    - `brew install libmagic`
+  - `pip install python-magic-bin`
+
+## License
+
+[MIT](LICENSE)
+
+## Authors 
+
+🛠️ Built 🛠️ with ❤️ by [RaidGuild](https://www.raidguild.org/)
diff --git a/docs/pages/apis.mdx b/docs/pages/apis.mdx
@@ -0,0 +1,7 @@
+# API Examples
+
+The repository already includes API pipeline manifest definitions for generating knowledge bases from a few REST APIs.
+
+## Boardroom Governance API
+
+## Optimism Agora API
diff --git a/docs/pages/apis/agora-api.mdx b/docs/pages/apis/agora-api.mdx
@@ -0,0 +1,3 @@
+# Optimism Agora API
+
+TBD
diff --git a/docs/pages/apis/boardroom-api.mdx b/docs/pages/apis/boardroom-api.mdx
@@ -0,0 +1,3 @@
+# Boardroom Governance API
+
+TBD
diff --git a/docs/pages/examples.mdx b/docs/pages/examples.mdx