txtai: AI-powered search engine

txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai also has functionality for zero-shot classification.

NeuML uses txtai and/or the concepts behind it to power all of our Natural Language Processing (NLP) applications. Example applications:

paperai - AI-powered literature discovery and review engine for medical/scientific papers
tldrstory - AI-powered understanding of headlines and story text
neuspo - Fact-driven, real-time sports event and news site
codequestion - Ask coding questions directly from the terminal

txtai is built on the following stack:

Installation

The easiest way to install is via pip and PyPI

pip install txtai

You can also install txtai directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/txtai

Python 3.6+ is supported

Troubleshooting

This project has dependencies that require compiling native code. Windows and macOS systems require the following additional steps. Most Linux environments will install without any additional steps.

Windows

Install C++ Build Tools - https://visualstudio.microsoft.com/visual-cpp-build-tools/
PyTorch now has Windows binaries on PyPI and should work with the standard install. But if issues arise, try running the install directly from PyTorch.
```
pip install txtai -f https://download.pytorch.org/whl/torch_stable.html
```
See pytorch.org for more information.

macOS

Run the following before installing
```
brew install libomp
```
See this link for more information.

See this GitHub workflow file for an example of environment-dependent installation procedures.

Examples

The examples directory has a series of examples and notebooks giving an overview of txtai. See the list of notebooks below.

Notebooks

Notebook	Description
Introducing txtai	Overview of the functionality provided by txtai
Extractive QA with txtai	Extractive question-answering with txtai
Build an Embeddings index from a data source	Embeddings index from a data source backed by word embeddings
Extractive QA with Elasticsearch	Extractive question-answering with Elasticsearch
Labeling with zero-shot classification	Labeling with zero-shot classification

Configuration

The following section goes over available settings for Embeddings and Extractor instances.

Embeddings

Embeddings parameters are set through the constructor. Examples below.

# Transformers embeddings model
Embeddings({"method": "transformers",
            "path": "sentence-transformers/bert-base-nli-mean-tokens"})

# Word embeddings model
Embeddings({"path": vectors,
            "storevectors": True,
            "scoring": "bm25",
            "pca": 3,
            "quantize": True})

method

method: transformers|words

Sets the sentence embeddings method to use. When set to transformers, the embeddings object builds sentence embeddings using the sentence transformers. Otherwise a word embeddings model is used. Defaults to words.

path

path: string

Required field that sets the path for a vectors model. When method set to transformers, this must be a path to a Hugging Face transformers model. Otherwise, it must be a path to a local word embeddings model.

storevectors

storevectors: boolean

Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies.

scoring

scoring: bm25|tfidf|sif

For word embedding models, a scoring model allows building weighted averages of word vectors for a given sentence. Supports BM25, tf-idf and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built.

pca

pca: int

Removes n principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied.

backend

backend: annoy|faiss|hnsw

Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. Defaults to Faiss for Linux/macOS and Annoy for Windows. Faiss currently is not supported on Windows.

quantize

quantize: boolean

Enables quanitization of generated sentence embeddings. If the index backend supports it, sentence embeddings will be stored with 8-bit precision vs 32-bit. Only Faiss currently supports quantization.

Extractor

Extractor parameters are set as constructor arguments. Examples below.

Extractor(embeddings, path, quantize)

embeddings

embeddings: Embeddings object instance

Embeddings object instance. Used to query and find candidate text snippets to run the question-answer model against.

path

path: string

Required path to a Hugging Face SQuAD fine-tuned model. Used to answer questions.

quantize

quantize: boolean

Enables dynamic quantization of the Hugging Face model. This is a runtime setting and doesn't save space. It is used to improve the inference time performance of the QA model.

Labels

Labels parameters are set as constructor arguments. Examples below.

Labels()
Labels("roberta-large-mnli")

path

path: string

Required path to a Hugging Face MNLI fine-tuned model. Used to answer questions.

API

txtai has a full-featured API that can optionally be enabled for any txtai process. All functionality found in txtai can be accessed via the API. The following is an example configuration and startup script for the API.

Note that this configuration file enables all functionality (embeddings, extractor and labels). It is suggested that separate processes are used for each instance of a txtai component.

# Index file path
path: /tmp/index

# Allow indexing of documents
writable: True

# Embeddings settings
embeddings:
  method: transformers
  path: sentence-transformers/bert-base-nli-mean-tokens

# Extractor settings
extractor:
  path: distilbert-base-cased-distilled-squad

# Labels settings
labels:

Assuming this YAML content is stored in a file named index.yml, the following command starts the API process.

CONFIG=index.yml uvicorn "txtai.api:app"

Supported language bindings

The following programming languages have txtai bindings:

External implementations of txtai bindings welcome, we're happy to add any additional implementations to this list.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github/workflows		.github/workflows
examples		examples
src/python/txtai		src/python/txtai
test/python		test/python
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
demo.gif		demo.gif
logo.png		logo.png
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

txtai: AI-powered search engine

Installation

Troubleshooting

Windows

macOS

Examples

Notebooks

Configuration

Embeddings

method

path

storevectors

scoring

pca

backend

quantize

Extractor

embeddings

path

quantize

Labels

path

API

Supported language bindings

About

Releases

Packages

Languages

License

xiaojinwhu/txtai

Folders and files

Latest commit

History

Repository files navigation

txtai: AI-powered search engine

Installation

Troubleshooting

Windows

macOS

Examples

Notebooks

Configuration

Embeddings

method

path

storevectors

scoring

pca

backend

quantize

Extractor

embeddings

path

quantize

Labels

path

API

Supported language bindings

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages