Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
santteegt committed Oct 28, 2024
1 parent 8c92ba3 commit fc439e9
Show file tree
Hide file tree
Showing 3 changed files with 3,297 additions and 2,691 deletions.
76 changes: 49 additions & 27 deletions docs/pages/getting-started.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ resulting database snapshot can be then plugged-in into a Gaia node's LLM model
(Retrieval Augmented Generation).

The following sections help you to quickly setup and execute the pipeline on your REST API. If you're looking more in-depth information about how to use
thes tool, tech stack and/or how it works under the hood, check the content menu on the left.
this tool, the tech stack and/or how it works under the hood, check the content menu on the left.

## System Requirements

Expand All @@ -23,39 +23,46 @@ thes tool, tech stack and/or how it works under the hood, check the content menu

Git clone or download this repository to your local.

```bash [Terminal]
git clone https://github.com/raid-guild/gaianet-rag-api-pipeline.git
```

### Activate your virtual environment

If using a custom virtual environment, you should activate your virtual environment, otherwise poetry will handle the environment for you
It is recommended to activate your [own virtual environment](https://python-poetry.org/docs/basic-usage/#using-your-virtual-environment),
otherwise Poetry will create/use a brand new environment. Check how Poetry [manage environments](https://python-poetry.org/docs/managing-environments/) for details.

### Install project dependencies

Navigate to the directory where this repository was cloned/download and execute the following on a terminal:
Navigate to the directory where this repository was cloned/download and execute the following command to install the project dependencies:

```bash [Terminal]
cd gaianet-rag-api-pipeline
poetry install
```

### Set environment variables

Copy `config/.env/sample` into `config/.env` file and set environment variables accordingly. Check the [environment variables](/#environment-variables) section below
Copy `config/.env/sample` into `config/.env` file and set environment variables accordingly. Check the [environment variables](/getting-started#environment-variables) section below
for details.

### Define your API Pipeline manifest

Define the pipeline manifest for your REST API you're looking to extract data from. Check how to define an API pipeline manifest in
Now it's time to define the pipeline manifest for the REST API you're looking to extract data from. Check how to define an API pipeline manifest in
[Defining an API Pipeline Manifest](/manifest-definition) for details, or take a look at the in-depth review of the sample manifests available in
[API Examples](/apis).

### Set the REST API Key

Set the REST API key in a `config/secrets/api_key` file, or specify it using the `--api-key` as argument to the CLI
The API must require an API key when sending requests. You can set it in the `config/secrets/api_key` file, or specify it using the `--api-key` as argument to the CLI.

### Setup a Qdrant DB instance

Get the base URL of your Qdrant Vector DB or deploy a local `Qdrant` ([Docs](https://qdrant.tech/documentation/)) vector database instance using docker:
Get the base URL of your Qdrant Vector DB or deploy a local `Qdrant` ([Docs](https://qdrant.tech/documentation/)) vector database instance using docker
and update the `QDRANTDB_URL` variable in the `config/.env` file:

```bash [terminal]
docker run -p 6333:6333 -p 6334:6334 -v ./qdrant_dev:/qdrant/storage:z qdrant/qdrant:v1.10.1
docker run -p 6333:6333 -p 6334:6334 -v ./qdrant:/qdrant/storage:z qdrant/qdrant:v1.10.1
```

:::warning
Expand All @@ -73,8 +80,8 @@ Load the LLM embeddings model of your preference into the LLM provider you chose
- You can find info on how to customize a Gaianet node [here](https://docs.gaianet.ai/node-guide/customize)
- If you chose Ollama, follow these instructions to import the LLM embeddings model:
- Make sure the Ollama service is up and running
- Go to the folder where the embeddings model is located. For this example, the llm model file is `nomic-embed-text-v1.5.f16.gguf` ([Source](https://huggingface.co/gaianet/Nomic-embed-text-v1.5-Embedding-GGUF/tree/main?show_file_info=nomic-embed-text-v1.5.f16.gguf))
- Create a file with name `Modelfile` and paste the following (replace `<path/to/model>` with your local directory):
- Go to the folder where the embeddings model is located. Then, create a file with name `Modelfile` and paste the following (replace `<path/to/model>`
with your local directory, for example [Nomic-embed-text-v1.5.f16.gguf](https://huggingface.co/gaianet/Nomic-embed-text-v1.5-Embedding-GGUF/tree/main?show_file_info=nomic-embed-text-v1.5.f16.gguf)):

```docker [Modelfile]
FROM <path/to/model>/nomic-embed-text-v1.5.f16.gguf
Expand All @@ -83,7 +90,7 @@ FROM <path/to/model>/nomic-embed-text-v1.5.f16.gguf
```bash [terminal]
ollama create Nomic-embed-text-v1.5
```
- Make sure the model setting by running the command:
- Check the embeddings model settings by running the command:
```bash [terminal]
ollama show Nomic-embed-text-v1.5
```
Expand All @@ -104,7 +111,7 @@ Below you can find the default instructions available and an in-depth review of

```bash [Terminal]
# run the entire pipeline
poetry run rag-api-pipeline run-all API_MANIFEST_FILE ----openapi-spec-file <openapi-spec-yaml-file> [--full-refresh] [--llm-provider openapi|ollama]
poetry run rag-api-pipeline run-all API_MANIFEST_FILE --openapi-spec-file <openapi-spec-yaml-file> [--full-refresh] [--llm-provider openapi|ollama]
# or run using an already normalized dataset
poetry run rag-api-pipeline from-normalized API_MANIFEST_FILE --normalized-data-file <jsonl-file> [--llm-provider openapi|ollama]
# or run using an already chunked dataset
Expand Down Expand Up @@ -153,28 +160,43 @@ is specified to the `run-all` command, the cache will be cleared and API data wi
The following environment variables can be adjusted in `config/.env` based on user needs:

- Pipeline config parameters:
- `API_DATA_ENCODING` (="utf-8"): default data encoding used by the REST API
- `OUTPUT_FOLDER` (="./output"): output folder where cached data streams, intermediary stage results and generated knowledge base snapshot are stored
- `API_DATA_ENCODING`: data encoding used by the REST API
- Default value: `utf-8`
- `OUTPUT_FOLDER`: output folder where cached data streams, intermediary stage results and generated knowledge base snapshot are stored
- Default value: `./output`
- LLM provider settings:
- `LLM_API_BASE_URL` (="http://localhost:8080/v1"): LLM provider base URL (default to a local openai-based provider such as gaia node)
- `LLM_API_KEY` (="empty-api-key"): API key to authenticate requests to the LLM provider
- `LLM_EMBEDDINGS_MODEL` (="Nomic-embed-text-v1.5"): name of the embeddings model to be consumed through the LLM provider
- `LLM_EMBEDDINGS_VECTOR_SIZE` (=768): embeddings vector size
- `LLM_PROVIDER` (="openai"): LLM provider backend to use. It can be either `openai` or `ollama` (gaianet offers an openai compatible API)
- `LLM_API_BASE_URL`: LLM provider base URL (default to a local openai-based provider such as gaia node)
- Default value: `http://localhost:8080/v1`
- `LLM_API_KEY`: API key to authenticate requests to the LLM provider
- Default value: `empty-api-key`
- `LLM_EMBEDDINGS_MODEL`: name of the embeddings model to be consumed through the LLM provider
- Default value: `Nomic-embed-text-v1.5`
- `LLM_EMBEDDINGS_VECTOR_SIZE`: embeddings vector size
- Default value: `768`
- `LLM_PROVIDER`: LLM provider backend to use. It can be either `openai` or `ollama` (gaianet offers an openai compatible API)
- Default value: `openai`
- Qdrant DB settings:
- `QDRANTDB_URL` (="http://localhost:6333"): Qdrant DB base URL
- `QDRANTDB_TIMEOUT` (=60): timeout for requests made to the Qdrant DB
- `QDRANTDB_DISTANCE_FN` (="COSINE"): score function to use during vector similarity search. Avaiable functions: ['COSINE', 'EUCLID', 'DOT', 'MANHATTAN']
- `QDRANTDB_URL`: Qdrant DB base URL
- Default value: `http://localhost:6333`
- `QDRANTDB_TIMEOUT`: timeout for requests made to the Qdrant DB
- Default value: `60`
- `QDRANTDB_DISTANCE_FN`: score function to use during vector similarity search. Avaiable functions: ['COSINE', 'EUCLID', 'DOT', 'MANHATTAN']
- Default value: `COSINE`
- Pathway-related variables:
- `AUTOCOMMIT_DURATION_MS` (=1000): the maximum time between two commits. Every autocommit_duration_ms milliseconds, the updates received by the connector are
- `AUTOCOMMIT_DURATION_MS`: the maximum time between two commits. Every autocommit_duration_ms milliseconds, the updates received by the connector are
committed automatically and pushed into Pathway's dataflow. More info can be found [here](https://pathway.com/developers/user-guide/connect/connectors/custom-python-connectors#connector-method-reference)
- Default value: `1000`
- `FixedDelayRetryStrategy` ([docs](https://pathway.com/developers/api-docs/udfs#pathway.udfs.FixedDelayRetryStrategy)) config parameters:
- `PATHWAY_RETRY_MAX_ATTEMPTS` (=10): max retries to be performed if a UDF async execution fails
- `PATHWAY_RETRY_DELAY_MS` (=1000): delay in milliseconds to wait for the next execution attempt
- `PATHWAY_RETRY_MAX_ATTEMPTS`: max retries to be performed if a UDF async execution fails
- Default value: `10`
- `PATHWAY_RETRY_DELAY_MS`: delay in milliseconds to wait for the next execution attempt
- Default value: `1000`
- *UDF async execution*: set the maximum No of concurrent operations per batch during udf async execution. Zero means no specific limits. Be careful when settings
this parameters for the embeddings stage as it could break the LLM provider with too many concurrent requests
- `CHUNKING_BATCH_CAPACITY` (=0): max No. of concurrent operation during data chunking operations
- `EMBEDDINGS_BATCH_CAPACITY` (=15): max No. of concurrent operation during vector embeddings operations
- `CHUNKING_BATCH_CAPACITY`: max No. of concurrent operation during data chunking operations
- Default value: `0`
- `EMBEDDINGS_BATCH_CAPACITY`: max No. of concurrent operation during vector embeddings operations
- Default value: `15`


## Using Docker compose for Local development or in Production
Expand Down
36 changes: 19 additions & 17 deletions docs/pages/manifest-definition/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ An alphanumeric-only name for the API pipeline.

### api_parameters

Contains any parameters required for building the API requests. This parameters MUST be included in the [spec](#spec) section and their values are accessible
Contains any parameters required for building the API requests. This parameters MUST be included in the [spec](/manifest-definition/overview#spec) section and their values are accessible
through the `config` object over the rest of the manifest.

### api_config
Expand Down Expand Up @@ -49,15 +49,15 @@ parameter in this [link](https://docs.unstructured.io/open-source/core-functiona

An Airbyte's declarative manifest require the following schema definitions:

#### spec ([Docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/reference#/definitions/Spec))
#### spec

A source specification made up of connector metadata and how it can be configured. All parameters defined in the `api_parameters` section must be listed under
`required` and `properties`
A source specification made up of connector metadata and how it can be configured ([Docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/reference#/definitions/Spec)).
All parameters defined in the `api_parameters` section must be listed under required` and `properties`

#### selector ([Docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/record-selector))
#### selector

The record selector is responsible for translating an HTTP response into a list of Airbyte records by extracting records from the response. The API
pipeline manifest includes two base selectors that can be applied for single/multiple record responses:
The record selector is responsible for translating an HTTP response into a list of Airbyte records by extracting records from the response ([Docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/record-selector)).
The API pipeline manifest includes two base selectors that can be applied for single/multiple record responses:

```yaml
selector:
Expand All @@ -72,16 +72,17 @@ selector_single:
field_path: []
```
#### requester_base ([Docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/requester)}
#### requester_base
The Requester defines how to prepare HTTP requests to send to the source API. Here you specify the API `base_url` and the `authenticator` schema used. Airbyte supports
the most commonly used Authentication methods: `ApiKeyAuthenticator`, `BearerAuthenticator`, `BasicHttpAuthenticator` and `OAuth`.
You can find a detailed explanation on how to configure each of them in this [link](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/authentication)
The Requester defines how to prepare HTTP requests to send to the source API ([Docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/requester)).
Here you specify the API `base_url` and the `authenticator` schema used. Airbyte supports the most commonly used Authentication methods:
`ApiKeyAuthenticator`, `BearerAuthenticator`, `BasicHttpAuthenticator` and `OAuth`. You can find a detailed explanation on how to configure each of them in this
[link](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/authentication)

#### retriever_base ([Docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/reference#/definitions/SimpleRetriever))
#### retriever_base

A SimpleRetriever object in charge of fetching records by synchronous API requests. The retriever acts as an orchestrator between the requester,
the record selector and the paginator. The API pipeline manifest includes a base retriever per defined selectors:
A SimpleRetriever object in charge of fetching records by synchronous API requests ([Docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/reference#/definitions/SimpleRetriever)).
The retriever acts as an orchestrator between the requester, the record selector and the paginator. The API pipeline manifest includes a base retriever per defined selectors:

```yaml
retriever_base:
Expand All @@ -94,10 +95,11 @@ retriever_single_base:
$ref: "#/definitions/selector_single"
```

#### paginator ([Docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/pagination))
#### paginator

Set the pagination strategy for API endpoints that return multiple records. Airbyte supports `Page increment`, `Offset increment` and `Cursor based` pagination strategies.
On the other hand, the `"#/definitions/NoPagination"` is automatically set for endpoints that return a single record
Set the pagination strategy for API endpoints that return multiple records ([Docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/pagination)).
Airbyte supports `Page increment`, `Offset increment` and `Cursor based` pagination strategies. On the other hand, the `"#/definitions/NoPagination"` is automatically set for endpoints
that return a single record

### Endpoints definitions

Expand Down
Loading

0 comments on commit fc439e9

Please sign in to comment.