Auctus

This project is a web crawler and search engine for datasets, specifically meant for data augmentation tasks in machine learning. It is able to find datasets in different repositories and index them for later retrieval.

Requirements:

Docker should be installed and started.

Python should be installed

Documentation is available here

Documention of modified architecture and implemented features

It is divided in multiple components:

Libraries
- Geospatial database datamart_geo. This contains data about administrative areas extracted from Wikidata and OpenStreetMap. It lives in its own repository and is used here as a submodule.
- Profiling library datamart_profiler. This can be installed by clients, will allow the client library to profile datasets locally instead of sending them to the server. It is also used by the apiserver and profiler services.
- Materialization library datamart_materialize. This is used to materialize dataset from the various sources that Auctus supports. It can be installed by clients, which will allow them to materialize datasets locally instead of using the server as a proxy.
- Data augmentation library datamart_augmentation. This performs the join or union of two datasets and is used by the apiserver service, but could conceivably be used stand-alone.
- Core server library datamart_core. This contains common code for services. Only used for the server components. The filesystem locking code is separate as datamart_fslock for performance reasons (has to import fast).
Services
- Discovery services: those are responsible for discovering datasets. Each plugin can talk to a specific repository. Materialization metadata is recorded for each dataset, to allow future retrieval of that dataset.
- Profiler: this service downloads a discovered dataset and computes additional metadata that can be used for search (for example, dimensions, semantic types, value distributions). Uses the profiling and materialization libraries.
- Lazo Server: this service is responsible for indexing textual and categorical attributes using Lazo. The code for the server and client is available here.
- apiserver: this service responds to requests from clients to search for datasets in the index (triggering on-demand query by discovery services that support it), upload new datasets, profile datasets, or perform augmentation. Uses the profiling and materialization libraries. Implements a JSON API using the Tornado web framework.
- The cache-cleaner: this service makes sure the dataset cache stays under a given size limit by removing least-recently-used datasets when the configured size is reached.
- The coordinator: this service collects some metrics and offers a maintenance interface for the system administrator.
- The frontend: this is a React app implementing a user-friendly web interface on top of the API.

Elasticsearch is used as the search index, storing one document per known dataset.

The services exchange messages through RabbitMQ, allowing us to have complex messaging patterns with queueing and retrying semantics, and complex patterns such as the on-demand querying.

Deployment

The system is currently running at https://auctus.vida-nyu.org/. You can see the system status at https://grafana.auctus.vida-nyu.org/.

Local deployment / development setup

To deploy the system locally using docker-compose, follow those step:

Set up environment

Make sure you have checked out the submodule with git submodule init && git submodule update

Make sure you have Git LFS installed and configured (git lfs install)

Copy env.default to .env and update the variables there. You might want to update the password for a production deployment.

Make sure your node is set up for running Elasticsearch. You will probably have to raise the mmap limit.

The API_URL is the URL at which the apiserver containers will be visible to clients. In a production deployment, this is probably a public-facing HTTPS URL. It can be the same URL that the "coordinator" component will be served at if using a reverse proxy (see nginx.conf).

To run scripts locally, you can load the environment variables into your shell by running: . scripts/load_env.sh (that's dot space scripts...)

Prepare data volumes

Run scripts/setup.sh to initialize the data volumes. This will set the correct permissions on the volumes/ subdirectories.

Should you ever want to start from scratch, you can delete volumes/ but make sure to run scripts/setup.sh again afterwards to set permissions.

Build the containers

$ docker-compose build --build-arg version=$(git describe) apiserver

Start the base containers

$ docker-compose up -d elasticsearch rabbitmq redis minio lazo

These will take a few seconds to get up and running. Then you can start the other components:

$ docker-compose up -d cache-cleaner coordinator profiler apiserver apilb frontend

You can use the --scale option to start more profiler or apiserver containers, for example:

$ docker-compose up -d --scale profiler=4 --scale apiserver=8 cache-cleaner coordinator profiler apiserver apilb frontend

Reproducibility:

Git clone the above project.

Follow the above steps to host the docker environment

Then run the following commands

pip3 install flask 
pip3 install flask_cors 
pip3 install duckdb
pip3 install minio 
pip3 install pandas 
pip3 install pyarrow

Then goto auctus location and run following command to start the Flask APP

python3 main.py

Import a snapshot of our index

$ scripts/docker_import_snapshot.sh

This will download an Elasticsearch dump from auctus.vida-nyu.org and import it into your local Elasticsearch container.

Now go the auctus link http://localhost:8001

Search for term like 'cities'.

From the search result you can see the button "Execute SQL"

First click on "Copy ID"

Now click on "Execute SQL", it would redirect to new tab.

Now paste the datasetId on Dataset ID field

Run this command on the sql section.

"SELECT * FROM data LIMIT 10;"

You can also download the queried data using "Download as CSV" button.

It is possible to modify the limit and select columns, while maintaining the table name as "data". The dataset is being manipulated in the backend, but for ease of use for the user, we have chosen to temporarily name the table "data".

Reproducibility for parsing Parquet in the Profiling pipeline

We have modified Profiling pipeline to handle parquet file along with csv. To test the above pipeline execute the below commands where we are creating a sample parquet file and loading into the profiling pipeline using the profiler library process_dataset function Go to profiler container on docker desktop. Now click on Terminal.

$ cd  
$ ls
$ python3
Python 3.8.16 (default, May  4 2023, 06:20:30) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datamart_profiler
>>> import pandas
>>> df = pandas.DataFrame({
...     'place': ['france', 'france', 'italy', 'germany'],
...     'favorite': ['Brittany', 'Normandie', 'Hamburg', 'Bavaria'],
... })
>>> df.to_parquet("sample.parquet")
>>> datamart_profiler.process_dataset("sample.parquet")
{'nb_rows': 4, 'nb_profiled_rows': 4, 'nb_columns': 2, 'columns': [{'name': 'place', 'structural_type': 'http://schema.org/Text', 'semantic_types': [], 'num_distinct_values': 3}, {'name': 'favorite', 'structural_type': 'http://schema.org/Text', 'semantic_types': [], 'num_distinct_values': 4}], 'types': [], 'attribute_keywords': ['place', 'favorite']}
>>>

Minio Creds

username: devkey pass : devpassword

Ports:

The web interface is at http://localhost:8001
Flask is at http://localhost:5001
Api for executing SQL query http://localhost:5001/query
The API at http://localhost:8002/api/v1 (behind HAProxy)
Elasticsearch is at http://localhost:8020
The Lazo server is at http://localhost:8030
The RabbitMQ management interface is at http://localhost:8010
The RabbitMQ metrics are at http://localhost:8012
The Minio interface is at http://localhost:8050 (if you use that)
The HAProxy statistics are at http://localhost:8004
Prometheus is at http://localhost:8040
Grafana is at http://localhost:8041

Prometheus is configured to automatically find the containers (see prometheus.yml)

Name		Name	Last commit message	Last commit date
Latest commit History 2,799 Commits
.vscode		.vscode
apiserver		apiserver
cache_cleaner		cache_cleaner
contrib		contrib
coordinator		coordinator
discovery		discovery
docker		docker
docs		docs
examples		examples
frontend		frontend
lib_augmentation		lib_augmentation
lib_core		lib_core
lib_fslock		lib_fslock
lib_geo @ 88d984b		lib_geo @ 88d984b
lib_materialize		lib_materialize
lib_profiler		lib_profiler
node_modules		node_modules
profiler		profiler
scripts		scripts
snapshotter		snapshotter
templates		templates
tests		tests
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
NVN_BigData_FInal_Report.pdf		NVN_BigData_FInal_Report.pdf
README.md		README.md
docker-compose.yml		docker-compose.yml
githubssh		githubssh
githubssh.pub		githubssh.pub
main.py		main.py
package-lock.json		package-lock.json
package.json		package.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auctus

Deployment

Local deployment / development setup

Set up environment

Prepare data volumes

Build the containers

Start the base containers

Reproducibility:

Import a snapshot of our index

Reproducibility for parsing Parquet in the Profiling pipeline

Minio Creds

About

Releases

Packages

Contributors 8

Languages

License

Vamshi-Madineni/NVN-BigData2023

Folders and files

Latest commit

History

Repository files navigation

Auctus

Deployment

Local deployment / development setup

Set up environment

Prepare data volumes

Build the containers

Start the base containers

Reproducibility:

Import a snapshot of our index

Reproducibility for parsing Parquet in the Profiling pipeline

Minio Creds

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages