This web application is designed to label scraped webpages. It allows users to annotate and tag web content for further analysis. With this application, you can easily label and categorize webpages based on your specific requirements.
For more information, see the next section "Further Documentation".
- Clone the Repository
git clone https://github.com/Pantonius/TagPag.git
cd TagPag
- Setup a Virtual Environment
For example install pyenv
as per their instructions and setup a virtual environment for the project:
pyenv install 3.12.7
pyenv virtualenv 3.12.7 tagpag-env
For Windows, e.g., install conda
as per their instructions and setup a virtual environment for the project:
conda create -n tagpag-env python=3.12.7
conda activate tagpag-env
- Install the Requirements
pip install -r requirements.txt
or
conda install --file requirements.txt
- Start the Project
streamlit run src/app.py
Notice that a new .env
file has been created from the .env-example
file, which uses the example data located in example_workdir
.
At this point you can take a look around. Maybe the usage documentation can be of service.
-
Open the
.env
with any text editor. If the file does not exist, create one copying the content of.env-example
into.env
(e.g., usecp .env-example . env
). -
Set up a
WORKING_DIR
(i.e., a directory that will contain all the data of the project) andLABELS
(i.e., the labels that will be used to tag the webpages).
WORKING_DIR = '/PATH/TO/WORKING_DIRECTORY'
LABELS = 'label_1,label_2,label_3'
- Make sure that the
TASKS_FILE
andHTML_DIR
are in theWORKING_DIRECTORY
The TASKS_FILE
should contain, at least, two columns which are defined by TASKS_ID_COLUMN
(by default, _id
) and TASKS_URL_COLUMN
(by default, url
).
The HTML_DIR
should contain the html files that are associated with the tasks. The following naming scheme should be used for the html files: TASK_ID.html
, where TASK_ID
is the value of the TASKS_ID_COLUMN
.
Your folder structure should look like this:
WORKING_DIR/
├── TASKS_FILE
└── HTML_DIR
├── FIRST_ID.html
├── SECOND_ID.html
└── ...
If you copied .env
from .env-example
, the program will assume the following naming
example_workdir
├── tasks.csv
└── html
├── FIRST_ID.html
├── SECOND_ID.html
└── ...
- (Re)-start TagPag
streamlit run src/app.py
A more detailed guide to setting up the project can be found in the doc folder. It will lead you through the process using the example data of the example_workdir.
For more information on how to use Streamlit, refer to the Streamlit Documentation.