Skip to content

Commit

Permalink
replaced gaianet with gaia
Browse files Browse the repository at this point in the history
  • Loading branch information
wtfsayo committed Nov 26, 2024
1 parent 5f57fb3 commit 6f5d489
Show file tree
Hide file tree
Showing 10 changed files with 100 additions and 98 deletions.
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# GaiaNet x RAG API Pipeline
# Gaia x RAG API Pipeline

`rag-api-pipeline` is a Python-based data pipeline tool that allows you to easily generate a vector knowledge base from any REST API data source. The resulting database snapshot can be then plugged-in into a Gaia node's LLM model with a prompt and provide contextual responses to user queries using RAG (Retrieval Augmented Generation).

Expand All @@ -11,7 +11,7 @@ The following sections help you to quickly setup and execute the pipeline on you
- (Optional): a Python virtual environment manager of your preference (e.g. conda, venv)
- Qdrant vector database ([Docs](https://qdrant.tech/documentation/))
- (Optional): Docker to spin up a local container
- LLM model provider ([spin up your own Gaia node](docs/pages/cli/node-deployment.mdx) or pick one from the [GaiaNet public network](https://www.gaianet.ai/chat))
- LLM model provider ([spin up your own Gaia node](docs/pages/cli/node-deployment.mdx) or pick one from the [Gaia public network](https://www.gaianet.ai/chat))
- An Embeddings model (e.g. [Nomic-embed-text-v1.5](https://huggingface.co/gaianet/Nomic-embed-text-v1.5-Embedding-GGUF/tree/main?show_file_info=nomic-embed-text-v1.5.f16.gguf))

## Setup Instructions
Expand All @@ -24,9 +24,9 @@ Git clone or download this repository to your local machine.
git clone https://github.com/raid-guild/gaianet-rag-api-pipeline.git
```

### 2. Install the Pipeline CLI
### 2. Install the Pipeline CLI

It is recommended to activate your [own virtual environment](https://python-poetry.org/docs/basic-usage/#using-your-virtual-environment).
It is recommended to activate your [own virtual environment](https://python-poetry.org/docs/basic-usage/#using-your-virtual-environment).
Then, navigate to the directory where this repository was cloned/download and execute the following command to install the `rag-api-pipeline` CLI:

```bash
Expand All @@ -53,24 +53,24 @@ rag-api-pipeline run all config/boardroom_api_pipeline.yaml config/boardroom_ope
```

You are required two specify two main arguments to the pipeline:
- The path to the OpenAPI specification file (e.g. `config/boardroom_openapi.yaml`): the OpenAPI spec for the REST API data source
- The path to the OpenAPI specification file (e.g. `config/boardroom_openapi.yaml`): the OpenAPI spec for the REST API data source
you're looking to extract data from.
- The path to the API pipeline manifest file (e.g. `config/boardroom_api_pipeline.yaml`): a YAML file that defines API endpoints you're
- The path to the API pipeline manifest file (e.g. `config/boardroom_api_pipeline.yaml`): a YAML file that defines API endpoints you're
looking to extract data from, among other parameters (more details in the next section).

Once the pipeline execution is completed, you'll find the vector database snapshot and extracted/processed datasets under the `output/molochdao_boardroom_api` folder.

## Define your own API Pipeline manifest

Now it's time to define the pipeline manifest for the REST API you're looking to extract data from. Make sure you get the OpenAPI specification
for the API you're targeting. Check the
[Defining an API Pipeline Manifest](docs/pages/manifest-definition/overview.mdx) page for details on how to get the OpenAPI spec and define an API pipeline manifest,
Now it's time to define the pipeline manifest for the REST API you're looking to extract data from. Make sure you get the OpenAPI specification
for the API you're targeting. Check the
[Defining an API Pipeline Manifest](docs/pages/manifest-definition/overview.mdx) page for details on how to get the OpenAPI spec and define an API pipeline manifest,
or take a look at the in-depth review of the sample manifests available in the [API Examples](docs/pages/apis) folder.

## Using the Pipeline CLI

Once you have both the API pipeline manifest and OpenAPI spec files, you're ready to start using the `rag-api-pipeline run` command to execute different tasks of the RAG pipeline,
from extracting data from an API source to generating vector embeddings and a database snapshot. If you need more details about the parameters available
Once you have both the API pipeline manifest and OpenAPI spec files, you're ready to start using the `rag-api-pipeline run` command to execute different tasks of the RAG pipeline,
from extracting data from an API source to generating vector embeddings and a database snapshot. If you need more details about the parameters available
on each task you can execute:

```bash [Terminal]
Expand Down
12 changes: 6 additions & 6 deletions docs/pages/apis/boardroom-api.mdx
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Boardroom Governance API

The repository already contains the [OpenAPI specification](https://github.com/raid-guild/gaianet-rag-api-pipeline/blob/main/config/boardroom_openapi.yaml) and the [API pipeline manifest](https://github.com/raid-guild/gaianet-rag-api-pipeline/blob/main/config/boardroom_api_pipeline.yaml) needed to create a RAG API pipeline.
The repository already contains the [OpenAPI specification](https://github.com/raid-guild/gaianet-rag-api-pipeline/blob/main/config/boardroom_openapi.yaml) and the [API pipeline manifest](https://github.com/raid-guild/gaianet-rag-api-pipeline/blob/main/config/boardroom_api_pipeline.yaml) needed to create a RAG API pipeline.
This pipeline generates a knowledge base from any DAO/Protocol hosted by the Boardroom Governance API.

## Pre-requisites

To use this API, you'll need an API key. Request one from [Boardroom's developer portal](https://boardroom.io/developers/billing). You can run the `rag-api-pipeline setup` command to set the REST API Key,
To use this API, you'll need an API key. Request one from [Boardroom's developer portal](https://boardroom.io/developers/billing). You can run the `rag-api-pipeline setup` command to set the REST API Key,
or your can directly store the key in the `config/secrets/api-key` file. A less secure option is to provide it using the `--api-key` CLI argument.

## Getting the Boardroom API OpenAPI Spec
Expand Down Expand Up @@ -213,7 +213,7 @@ schemas:
type: integer
```

On the other hand, the endpoint's `textSchema` reference specifies the list of fields for text parsing. Note that all properties are also listed in the `responseSchema`.
On the other hand, the endpoint's `textSchema` reference specifies the list of fields for text parsing. Note that all properties are also listed in the `responseSchema`.
In this case, `title`, `content`, and `summary` will be parsed as texts, while other fields will be included as metadata properties in a JSON object:

```yaml [boardroom_api_pipeline.yaml]
Expand Down Expand Up @@ -259,15 +259,15 @@ rag-api-pipeline run all config/boardroom_api_pipeline.yaml config/boardroom_ope

The processed data and knowledge base snapshot for Aave will be available in the `output/aave_boardroom_api` folder. You can also find a public knowledge base snapshot on [Hugging Face](https://huggingface.co/datasets/uxman/aave_snapshot_boardroom/tree/main).

### Import the KB Snapshot into a Gaianet Node
### Import the KB Snapshot into a Gaia Node

1. Locate the generated snapshot in `output/aave_boardroom_api/` (named `aave_boardroom_api_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot.tar.gz`) or download it from the HuggingFace link above.
2. Follow the official [knowledge base selection guide](https://docs.gaianet.ai/node-guide/customize#select-a-knowledge-base)
3. Configure your node using the recommended settings from the [node deployment guide](/cli/node-deployment#recommended-gaianet-node-configuration)

Once the command above finishes, you'll find a compressed knowledge base snapshot in
`{OUTPUT_FOLDER}/aave_boardroom_api/` with name aave_boardroom_api_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot.tar.gz`. Now it's time to import it
into your gaianet node. You can find the instructions on how to select a knowledge base [here](https://docs.gaianet.ai/node-guide/customize#select-a-knowledge-base).
`{OUTPUT_FOLDER}/aave_boardroom_api/` with name aave_boardroom_api_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot.tar.gz`. Now it's time to import it
into your gaia node. You can find the instructions on how to select a knowledge base [here](https://docs.gaianet.ai/node-guide/customize#select-a-knowledge-base).
The recommended prompts and node config settings can be found [here](/cli/node-deployment#recommended-gaianet-node-configuration).

### Example user prompts
Expand Down
6 changes: 3 additions & 3 deletions docs/pages/architecture/tech-stack.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This page outlines the technologies and tools integrated into the `rag-api-pipel
## Tools & Frameworks

### 1. RAG Pipeline over Data Stream: Pathway ([Docs](https://pathway.com/developers/user-guide/introduction/welcome/))
- **Description**: A Python-based data processing framework designed for creating AI-driven pipelines over data streams
- **Description**: A Python-based data processing framework designed for creating AI-driven pipelines over data streams
- **Core Technology**:
- **Rust Engine** with multithreading and multiprocessing capabilities for high performance
- **Use Case**: Efficient data processing, enabling integration with third-party data-related tools and AI models to process large, real-time data streams
Expand All @@ -31,7 +31,7 @@ This page outlines the technologies and tools integrated into the `rag-api-pipel
### 5. Feature Embedding Generation:
- **Description**: connects to a LLM provider and is responsible for generating feature embeddings, which create dense vector representations of the extracted data.
- **Technologies Used**:
- **Gaianet Node** ([Docs](https://docs.gaianet.ai/category/node-operator-guide)): Offers a *RAG API Server* that provides an *OpenAI-like API* to interact with hosted LLM models
- **Gaia Node** ([Docs](https://docs.gaianet.ai/category/node-operator-guide)): Offers a *RAG API Server* that provides an *OpenAI-like API* to interact with hosted LLM models
- **Ollama** ([Docs](https://ollama.com/)): Easy-to-install LLM engine for running large language models on a local machine
- **Python Libraries**:
- [litellm](https://docs.litellm.ai/docs/providers/openai_compatible) Python library for connecting with OpenAI-compatible LLM providers
Expand All @@ -41,4 +41,4 @@ This page outlines the technologies and tools integrated into the `rag-api-pipel
- **Description**: A **vector database** and **vector similarity search engine**
- **Key Features**:
- Provides efficient vector searches based on similarity, crucial for tasks like nearest-neighbor search in large datasets
- Acts as a **knowledge base snapshot** repository, storing vectors generated from processed data and feature embeddings
- Acts as a **knowledge base snapshot** repository, storing vectors generated from processed data and feature embeddings
10 changes: 5 additions & 5 deletions docs/pages/cli/node-deployment.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

## Quick start guide

We recommend to follow the GaiaNet Official [quick start guide](https://docs.gaianet.ai/node-guide/quick-start). Your GaiaNet node will
We recommend to follow the Gaia Official [quick start guide](https://docs.gaianet.ai/node-guide/quick-start). Your Gaia node will
be setup in the `GAIANET_BASE_DIR` (default: `"$HOME/gaianet"`) directory.

## Deploying your GaiaNet node in *embeddings* running mode (⚠️**Recommended**)

The `rag-api-pipeline` requires an embeddings model to generate vector embeddings from the API data source. At this stage, we recommend to
start your GaiaNet node in *embeddings-only* mode (thus consuming less resources than starting the full node) by running the following command:
The `rag-api-pipeline` requires an embeddings model to generate vector embeddings from the API data source. At this stage, we recommend to
start your Gaia node in *embeddings-only* mode (thus consuming less resources than starting the full node) by running the following command:

```bash [Terminal]
cd $GAIANET_BASE_DIR
Expand All @@ -22,9 +22,9 @@ wasmedge --dir .:./dashboard --env NODE_VERSION=0.4.7 \
- `--model-name <model_name>`: specifies the embeddings model name
- `--ctx-size` and `--batch-size` should be set according to the selected embeddings model

## Selecting a knowledge base and custom prompts for your GaiaNet node
## Selecting a knowledge base and custom prompts for your Gaia node

In order to supplement the LLM model hosted on your Gaia node with a custom knowledge base and prompts follow the instructions outlined in this [link](https://docs.gaianet.ai/node-guide/customize#select-a-knowledge-base).
In order to supplement the LLM model hosted on your Gaia node with a custom knowledge base and prompts follow the instructions outlined in this [link](https://docs.gaianet.ai/node-guide/customize#select-a-knowledge-base).
Remember to re-initialize and re-start the node after you make any configuration changes.

```bash [Terminal]
Expand Down
16 changes: 8 additions & 8 deletions docs/pages/cli/other-llm-providers.mdx
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Supported LLM providers

The `rag-api-pipeline` currently supports two types of LLM providers: `openai` and `ollama`. A Gaianet node for example, uses a Rust-based [RAG API Server](https://github.com/LlamaEdge/rag-api-server)
The `rag-api-pipeline` currently supports two types of LLM providers: `openai` and `ollama`. A Gaia node for example, uses a Rust-based [RAG API Server](https://github.com/LlamaEdge/rag-api-server)
to offer OpenAI-compatible web APIs for creating RAG applications.

In the following sections, you'll find more details on the supported LLM providers that the pipeline currently supports and how to setup them

## OpenAI

By default, the pipeline supports any LLM provider that offers OpenAI-compatible web APIs. If you wanna work with a provider other than Gaianet,
By default, the pipeline supports any LLM provider that offers OpenAI-compatible web APIs. If you wanna work with a provider other than Gaia,
you can setup the connection using the setup wizard via the `rag-api-pipeline setup` command:

```bash [Terminal]
Expand All @@ -17,7 +17,7 @@ Init pipeline...
(Step 1/3) Setting Pipeline LLM provider settings...
Select a custom LLM provider (openai, ollama): openai
LLM provider API URL [http://127.0.0.1:8080/v1]: https://api.openai.com/v1
LLM provider API Key:
LLM provider API Key:
LLM Provider API connection OK!
Embeddings model Name [Nomic-embed-text-v1.5]: text-embedding-ada-002
Embeddings Vector Size [768]: 2048
Expand All @@ -26,8 +26,8 @@ Pipeline LLM Provider settings OK!

## Ollama

If you're planning to use the pipeline on consumer hardware that cannot handle a GaiaNet node running in the background, you can opt-in to use Ollama
as LLM provider. Depending on the use case and resources available, some of the advantages of using Ollama for example are that it is more lighweight,
If you're planning to use the pipeline on consumer hardware that cannot handle a Gaia node running in the background, you can opt-in to use Ollama
as LLM provider. Depending on the use case and resources available, some of the advantages of using Ollama for example are that it is more lighweight,
easier to install and ready to use with Mac GPU devices.

### Getting Ollama
Expand All @@ -46,13 +46,13 @@ Which LLM provider you want to use? (gaia, other) [gaia]: other
Init pipeline...
(Step 1/3) Setting Pipeline LLM provider settings...
Select a custom LLM provider (openai, ollama): ollama
LLM provider API URL [http://127.0.0.1:11434]:
LLM provider API URL [http://127.0.0.1:11434]:
ERROR: LLM Provider API (@ http://127.0.0.1:11434/v1/models) is down. HTTPConnectionPool(host='127.0.0.1', port=11434): Max retries exceeded with url: /v1/models (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1091c2490>: Failed to establish a new connection: [Errno 61] Connection refused'))
Try again...
LLM provider API URL [http://127.0.0.1:11434]:
LLM Provider API connection OK!
Embeddings model Name [Nomic-embed-text-v1.5]:
Embeddings Vector Size [768]:
Embeddings model Name [Nomic-embed-text-v1.5]:
Embeddings Vector Size [768]:
Enter the Absolute Path to the Embeddings model file: /home/user/rag-api-pipeline/models/nomic-embed-text-v1.5.f16.gguf
Importing embeddings model into Ollama...
Pipeline LLM Provider settings OK!
Expand Down
6 changes: 3 additions & 3 deletions docs/pages/cli/settings.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Customizing the Pipeline Config Settings

Most of the pipeline configuration settings are set by running the setup wizard via `rag-api-pipeline setup` command. However, there are
Most of the pipeline configuration settings are set by running the setup wizard via `rag-api-pipeline setup` command. However, there are
more advanced features that can be also set via environment variables in `config/.env`.

## Environment variables
Expand All @@ -21,7 +21,7 @@ The following environment variables can be adjusted in `config/.env` based on th
- Default value: `Nomic-embed-text-v1.5`
- `LLM_EMBEDDINGS_VECTOR_SIZE`: embeddings vector size
- Default value: `768`
- `LLM_PROVIDER`: LLM provider backend to use. It can be either `openai` or `ollama` (Gaianet offers an OpenAI-compatible API)
- `LLM_PROVIDER`: LLM provider backend to use. It can be either `openai` or `ollama` (Gaia offers an OpenAI-compatible API)
- Default value: `openai`
- **Qdrant DB settings**:
- `QDRANTDB_URL`: Qdrant DB base URL
Expand All @@ -31,7 +31,7 @@ The following environment variables can be adjusted in `config/.env` based on th
- `QDRANTDB_DISTANCE_FN`: score function to use during vector similarity search. Available functions: ['COSINE', 'EUCLID', 'DOT', 'MANHATTAN']
- Default value: `COSINE`
- **Pathway-related variables**:
- `AUTOCOMMIT_DURATION_MS`: the maximum time between two commits. Every autocommit_duration_ms milliseconds, the updates received by the connector are
- `AUTOCOMMIT_DURATION_MS`: the maximum time between two commits. Every autocommit_duration_ms milliseconds, the updates received by the connector are
committed automatically and pushed into Pathway's dataflow. More information can be found [here](https://pathway.com/developers/user-guide/connect/connectors/custom-python-connectors#connector-method-reference)
- Default value: `1000`
- `FixedDelayRetryStrategy` ([docs](https://pathway.com/developers/api-docs/udfs#pathway.udfs.FixedDelayRetryStrategy)) config parameters:
Expand Down
Loading

0 comments on commit 6f5d489

Please sign in to comment.