Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
santteegt committed Sep 18, 2024
1 parent 458fd2d commit 4e903e2
Show file tree
Hide file tree
Showing 7 changed files with 281 additions and 77 deletions.
221 changes: 177 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,203 @@
# GaiaNet RAG API Pipeline
# GaiaNet x RAG API Pipeline

`rag-api-pipeline` is a Python-based data pipeline tool that allows you to easily generate a vector knowledge base from any REST API data source. The
resulting database snapshot can be then plugged-in into a Gaia node's LLM model with a prompt and provide contextual responses to user queries using RAG
(Retrieval Augmented Generation).

The following sections help you to quickly setup and execute the pipeline on your REST API. If you're looking more in-depth information about how to use
thes tool, tech stack and/or how it works under the hood, check the content menu on the left.

## System Requirements

- Poetry ([Docs](https://python-poetry.org/docs/))
- Python 3.11.x
- (Optional): a Python virtual environment manager or your preference (e.g. conda, venv)
- Docker and Docker compose
- Qdrant vector database ([Docs](https://qdrant.tech/documentation/))
- LLM model provider (either a Gaianet node or Ollama)

## Dependencies
## Setup Instructions

- airbyte==0.16.4
- pathway[xpack-llm]
- pathway[xpack-llm-docs]
### 1. Clone this repository

Workaround in case of missing dependencies:
Git clone or download this repository to your local.

- if trying to install pillow-heif missingn module
- export CFLAGS="-Wno-nullability-completeness"
- libmagic -> Required for having libmagic working
- MacOS:
- brew install libmagic
- pip install python-magic-bin
### 2. Activate your virtual environment

## Pipeline execution
If using a custom virtual environment, you should activate your virtual environment, otherwise poetry will handle the environment for you

- Copy `config/.env/sample` into `config/.env` file and set environment variables accordingly
- Set the API key in a `config/secrets/api_key` file
- Define the API pipeline manifest and execute the following CLI command
### 3. Install project dependencies

```
poetry run rag-api-pipeline run-all config/api_pipeline.yaml [--full-refresh]
# or
poetry run rag-api-pipeline from-normalized config/api_pipeline.yaml --normalized-data-file <file>
# or
poetry run rag-api-pipeline from-chunked config/api_pipeline.yaml --chunked-data-file <file>
Navigate to the directory where this repository was cloned/download and execute the following on a terminal:

```bash [terminal]
poetry install
```

## Local development
### 4. Set environment variables

Start with building your containers: `docker compose -f local.yml build`.
Copy `config/.env/sample` into `config/.env` file and set environment variables accordingly. Check the [environment variables](#environment-variables) section
for details.

You are ready to start developing your application!
Define your custom logic in `gaianet_rag_api_pipeline/pipeline.py`. It already contains a sample code which sums all the input values.
### 5. Define your API Pipeline manifest

You can test it in the following modes:
Define the pipeline manifest for your REST API you're looking to extract data from. Check how to define an API pipeline manifest in
[Defining an API Pipeline Manifest](/manifest-definition) for details, or take a look at the in-depth review of the sample manifests available in
[API Examples](/examples).

- [debug (batch mode)] run your Pathway app code with pytest with `docker compose -f local.yml run --rm pathway_app pytest`
- [streaming] run your Pathway app `docker compose -f local.yml up`. Modify `InfiniteStream` in `gaianet_rag_api_pipeline/input.py` to feed it with different data. The results are streamed to the `output.csv` file (you can change this in `gaianet_rag_api_pipeline/output.py`)
### 6. Set the REST API Key

## Production environment
Set the REST API key in a `config/secrets/api_key` file, or specify it using the `--api-key` as argument to the CLI

Production environment streams data from `redpanda`.
Build production containers with `docker compose -f prod.yml build`
### 7. Setup a Qdrant DB instance

To run your application invoke:
1. `docker compose -f prod.yml rm -svf` to clean the state so that `redpanda` can start without issues
2. `docker compose -f prod.yml up`
Get the base URL of your Qdrant Vector DB or deploy a local `Qdrant` ([Docs](https://qdrant.tech/documentation/)) vector database instance using docker:

For test, you can push messages to redpanda by running
`docker compose -f prod.yml exec redpanda rpk topic create gaianet_rag_api_pipeline` to make sure the topic is created
and then `docker compose -f prod.yml exec redpanda rpk topic produce gaianet_rag_api_pipeline`
```bash [terminal]
# IMPORTANT: make sure you use `qdrant:v1.10.1` for compatibility with Gaianet node
docker run -p 6333:6333 -p 6334:6334 -v ./qdrant_dev:/qdrant/storage:z qdrant/qdrant:v1.10.1
```

and typing in the messages, e.g:
`{"value":10}`
### 8. Select and Setup an LLM provider

Get your Gaianet node running ([Docs](https://docs.gaianet.ai/node-guide/quick-start)) or install Ollama ([Docs](https://ollama.com/)) provider locally.
The latter is recommended if you're looking to run the pipeline on consumer hardware.

## Configuration
### 9. Load an LLM embeddings model

Supply configuration with environment variables.
Load the LLM embeddings model of your preference into the LLM provider you chose in the previous step:
- You can find info on how to customize a Gaianet node [here](https://docs.gaianet.ai/node-guide/customize)
- If you chose Ollama, follow these instructions to import the LLM embeddings model:
- Make sure the Ollama service is up and running
- Go to the folder where the embeddings model is located. For this example, the llm model file is `nomic-embed-text-v1.5.f16.gguf` ([Source](https://huggingface.co/gaianet/Nomic-embed-text-v1.5-Embedding-GGUF/tree/main?show_file_info=nomic-embed-text-v1.5.f16.gguf))
- Create a file with name `Modelfile` and paste the following (replace `<path/to/model>` with your local directory):

```docker
FROM <path/to/model>/nomic-embed-text-v1.5.f16.gguf
```
- Import the model by running the following command on a terminal:
```bash [terminal]
ollama create Nomic-embed-text-v1.5
```
- Make sure the model setting by running the command:
```bash [terminal]
ollama show Nomic-embed-text-v1.5
```

## Pipeline CLI

Now you're ready to use the `rag-api-pipeline` CLI commands to execute different tasks of the RAG pipeline, from extracting data from an API source to generating vector embeddings
and a database snapshot. If you need more details about the parameters available on each command you can execute:

```bash [Terminal]
poetry run rag-api-pipeline <command> --help
```

### CLI available commands

Below you can find the default instructions available and an in-depth review of both the functionality and available arguments that each command offers:

```bash [Terminal]
# run the entire pipeline
poetry run rag-api-pipeline run-all API_MANIFEST_FILE ----openapi-spec-file <openapi-spec-yaml-file> [--full-refresh] [--llm-provider openapi|ollama]
# or run using an already normalized dataset
poetry run rag-api-pipeline from-normalized API_MANIFEST_FILE --normalized-data-file <jsonl-file> [--llm-provider openapi|ollama]
# or run using an already chunked dataset
poetry run rag-api-pipeline from-chunked API_MANIFEST_FILE --chunked-data-file <jsonl-file> [--llm-provider openapi|ollama]
```

For ease of development, you can also use dotenv file in `config/.env` to specify configuration.
Note that environment variables will take precedence over any configuration specified in `config/.env` file.
- **run-all**: executes the entire RAG data pipeline including API endpoint data streams, data normalization, data chunking, vector embeddings and
database snapshot generation. You can specify the following arguments to the command:
* `API_MANIFEST_FILE`: API pipeline manifest file (mandatory)
* `--llm-provider [ollama|openai]`: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)
* `--api-key`: API Auth key. If not specified, it will try to get it from `config/secrets/api_key`
* `--openapi-spec-file`: API OpenAPI YAML spec file. default to `config/openapi.yaml`
* `--source-manifest-file`: Airbyte API Connector YAML manifest. If specified, it will omit the API Connector manifest generation step.
* `--full-refresh`: clean up cache and extract API data from scratch.
* `--normalized-only`: run pipeline until the data normalization stage.
* `--chunked-only`: run pipeline until the data chunking stage.

- **from-normalized**: executes the RAG data pipeline using an already normalized JSONL dataset. You can specify the following arguments to the command:
* `API_MANIFEST_FILE`: API pipeline manifest file (mandatory)
* `--llm-provider [ollama|openai]`: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)
* `--normalized-data-file`: path to the normalized dataset in JSONL format (mandatory). Check the [Architecture](/architecture) section for details on the
required data schema.

- **from chunked**: executes the RAG data pipeline using an already chunked dataset in JSONL format. You can specify the following arguments to the command:
* `API_MANIFEST_FILE`: API pipeline manifest file (mandatory)
* `--llm-provider [ollama|openai]`: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)
* `--chunked-data-file`: path to the chunked dataset in JSONL format (mandatory). Check the [Architecture](/architecture) section for details on the
required data schema.

## CLI Output

Cached API stream data and results produced from running any of the CLI commands are stored in `<OUTPUT_FOLDER>/<api_name>`. The following files and folders
are created by the tool within this `baseDir` folder:

- `{baseDir}/cache/{api_name}/*`: extracted API data is cached into a local DuckDB. Database files are stored in this directory. If the `--full-refresh` argument
is specified to the `run-all` command, the cache will be cleared and API data will be extracted from scratch.
- `{baseDir}/{api_name}_stream_{x}_preprocessed.jsonl`: data streams from each API endpoint is preprocessed and stored in JSONL format
- `{baseDir}/{api_name}_normalized.jsonl`: preprocessed data streams from each API endpoint are joined together and stored in JSONL format
- `{baseDir}/{api_name}_chunked.jsonl`: normalized data that goes through the data chunking stage is then stored in JSONL format
- `{baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot`: vector embeddings snapshot file that was exported from Qdrant DB
- `{baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot.tar.gz`: compressed knowledge base that contains the vector embeddings snapshot

## Environment variables

The following environment variables can be adjusted in `config/.env` based on user needs:

- Pipeline config parameters:
- `API_DATA_ENCODING` (="utf-8"): default data encoding used by the REST API
- `OUTPUT_FOLDER` (="./output"): output folder where cached data streams, intermediary stage results and generated knowledge base snapshot are stored
- LLM provider settings:
- `LLM_API_BASE_URL` (="http://localhost:8080/v1"): LLM provider base URL (default to a local openai-based provider such as gaia node)
- `LLM_API_KEY` (="empty-api-key"): API key to authenticate requests to the LLM provider
- `LLM_EMBEDDINGS_MODEL` (="Nomic-embed-text-v1.5"): name of the embeddings model to be consumed through the LLM provider
- `LLM_EMBEDDINGS_VECTOR_SIZE` (=768): embeddings vector size
- `LLM_PROVIDER` (="openai"): LLM provider backend to use. It can be either `openai` or `ollama` (gaianet offers an openai compatible API)
- Qdrant DB settings:
- `QDRANTDB_URL` (="http://localhost:6333"): Qdrant DB base URL
- `QDRANTDB_TIMEOUT` (=60): timeout for requests made to the Qdrant DB
- `QDRANTDB_DISTANCE_FN` (="COSINE"): score function to use during vector similarity search. Avaiable functions: ['COSINE', 'EUCLID', 'DOT', 'MANHATTAN']
- Pathway-related variables:
- `AUTOCOMMIT_DURATION_MS` (=1000): the maximum time between two commits. Every autocommit_duration_ms milliseconds, the updates received by the connector are
committed automatically and pushed into Pathway's dataflow. More info can be found [here](https://pathway.com/developers/user-guide/connect/connectors/custom-python-connectors#connector-method-reference)
- `FixedDelayRetryStrategy` ([docs](https://pathway.com/developers/api-docs/udfs#pathway.udfs.FixedDelayRetryStrategy)) config parameters:
- `PATHWAY_RETRY_MAX_ATTEMPTS` (=10): max retries to be performed if a UDF async execution fails
- `PATHWAY_RETRY_DELAY_MS` (=1000): delay in milliseconds to wait for the next execution attempt
- *UDF async execution*: set the maximum No of concurrent operations per batch during udf async execution. Zero means no specific limits. Be careful when settings
this parameters for the embeddings stage as it could break the LLM provider with too many concurrent requests
- `CHUNKING_BATCH_CAPACITY` (=0): max No. of concurrent operation during data chunking operations
- `EMBEDDINGS_BATCH_CAPACITY` (=15): max No. of concurrent operation during vector embeddings operations


## Using Docker compose for Local development or in Production

TBD

- Start with building your containers: `docker compose -f local.yml build`.

- Build production containers with `docker compose -f prod.yml build`

- To run your application invoke:
1. `docker compose -f prod.yml rm -svf`
2. `docker compose -f prod.yml up`

## Troubleshooting

### Workaround in case of missing one of the following dependencies:

- If trying to install `pillow-heif` missinng module:
- Add the following flags `export CFLAGS="-Wno-nullability-completeness"`
- Libraries required for having libmagic working:
- MacOS:
- `brew install libmagic`
- `pip install python-magic-bin`

## License

[MIT](LICENSE)

## Authors

🛠️ Built 🛠️ with ❤️ by [RaidGuild](https://www.raidguild.org/)
7 changes: 7 additions & 0 deletions docs/pages/apis.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# API Examples

The repository already includes API pipeline manifest definitions for generating knowledge bases from a few REST APIs.

## Boardroom Governance API

## Optimism Agora API
3 changes: 3 additions & 0 deletions docs/pages/apis/agora-api.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Optimism Agora API

TBD
3 changes: 3 additions & 0 deletions docs/pages/apis/boardroom-api.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Boardroom Governance API

TBD
5 changes: 0 additions & 5 deletions docs/pages/examples.mdx

This file was deleted.

Loading

0 comments on commit 4e903e2

Please sign in to comment.