Tag-Pag: A Dedicated Tool for Systematic Web Page Annotations

Description
Quickstart
Quick Configuration
Further Documentation
Additional Resources

Description

This web application is designed to label scraped webpages. It allows users to annotate and tag web content for further analysis. With this application, you can easily label and categorize webpages based on your specific requirements.

Quickstart

For more information, see the next section "Further Documentation".

Clone the Repository

git clone https://github.com/Pantonius/TagPag.git
cd TagPag

Setup a Virtual Environment

For example install pyenv as per their instructions and setup a virtual environment for the project:

pyenv install 3.12.7
pyenv virtualenv 3.12.7 tagpag-env

For Windows, e.g., install conda as per their instructions and setup a virtual environment for the project:

conda create -n tagpag-env python=3.12.7
conda activate tagpag-env

Install the Requirements

pip install -r requirements.txt

or

conda install --file requirements.txt

Start the Project

streamlit run src/app.py

Notice that a new .env file has been created from the .env-example file, which uses the example data located in example_workdir.

At this point you can take a look around. Maybe the usage documentation can be of service.

Quick Configuration

Open the .env with any text editor. If the file does not exist, create one copying the content of .env-example into .env (e.g., use cp .env-example . env).
Set up a WORKING_DIR (i.e., a directory that will contain all the data of the project) and LABELS (i.e., the labels that will be used to tag the webpages).

WORKING_DIR = '/PATH/TO/WORKING_DIRECTORY'
LABELS = 'label_1,label_2,label_3'

Make sure that the TASKS_FILE and HTML_DIR are in the WORKING_DIRECTORY

The TASKS_FILE should contain, at least, two columns which are defined by TASKS_ID_COLUMN (by default, _id) and TASKS_URL_COLUMN (by default, url).

The HTML_DIR should contain the html files that are associated with the tasks. The following naming scheme should be used for the html files: TASK_ID.html, where TASK_ID is the value of the TASKS_ID_COLUMN.

Your folder structure should look like this:

WORKING_DIR/
├── TASKS_FILE
└── HTML_DIR
    ├── FIRST_ID.html
    ├── SECOND_ID.html
    └── ...

If you copied .env from .env-example, the program will assume the following naming

example_workdir
├── tasks.csv
└── html
    ├── FIRST_ID.html
    ├── SECOND_ID.html
    └── ...

(Re)-start TagPag

streamlit run src/app.py

Further Documentation

A more detailed guide to setting up the project can be found in the doc folder. It will lead you through the process using the example data of the example_workdir.

Additional Resources

For more information on how to use Streamlit, refer to the Streamlit Documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
.github/workflows		.github/workflows
.vscode		.vscode
doc		doc
example_workdir		example_workdir
paper		paper
src		src
tests_data		tests_data
.env-example		.env-example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
codemeta.json		codemeta.json
requirements.txt		requirements.txt
screenshot.png		screenshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tag-Pag: A Dedicated Tool for Systematic Web Page Annotations

Description

Quickstart

Quick Configuration

Further Documentation

Additional Resources

About

Contributors 3

Languages

License

Pantonius/TagPag

Folders and files

Latest commit

History

Repository files navigation

Tag-Pag: A Dedicated Tool for Systematic Web Page Annotations

Description

Quickstart

Quick Configuration

Further Documentation

Additional Resources

About

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages