Link Content Scraper

A FastAPI web application that scrapes content from web pages and their linked pages, converting them to clean markdown format using the Jina Reader API.

Features

Scrapes content from a given URL and all its linked pages
Converts web content to clean markdown using Jina's Reader API
Handles PDF content (especially arXiv papers)
Rate limiting to respect API limits
Real-time progress tracking with SSE
Downloads results as a ZIP file
Filters out social media and image URLs
Batched processing with configurable timeouts

Requirements

Python 3.12+
uv for dependency management
Docker (optional)

Installation

Using uv (Recommended)

# Install dependencies
uv sync

# Activate virtual environment
uv venv

# Run the application
uvicorn main:app --reload

Using Docker

# Build the image
docker build -t scraper .

# Run the container
docker run -p 8000:8000 scraper

Usage

Open http://localhost:8000 in your browser
Enter a URL to scrape
Watch the real-time progress
Download the results as a ZIP file when complete

Configuration

The application has several configurable settings in main.py:

RATE_LIMIT = 15          # Requests per minute
RATE_PERIOD = 60         # Seconds
MAX_RETRIES = 3          # Number of retries for failed requests
RETRY_DELAY = 5          # Seconds between retries
PDF_TIMEOUT = 60.0       # Timeout for PDF content (seconds)
DEFAULT_TIMEOUT = 30.0   # Default timeout (seconds)

Dependencies

fastapi[standard]: Web framework
httpx: Async HTTP client
beautifulsoup4: HTML parsing
aiofiles: Async file operations
uvicorn: ASGI server

Development

The project uses modern Python tooling:

uv for dependency management
FastAPI for the web framework
Docker with multi-stage builds
Type hints throughout the codebase

Notes

The application uses Jina's free Reader API which has a rate limit of 20 requests per minute
PDF content (especially from arXiv) may take longer to process
Some URLs may be filtered out to avoid rate limits and irrelevant content

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
fly.toml		fly.toml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Link Content Scraper

Features

Requirements

Installation

Using uv (Recommended)

Using Docker

Usage

Configuration

Dependencies

Development

Notes

License

About

Releases

Packages

Languages

License

JoshuaOliphant/Link-Content-Scraper

Folders and files

Latest commit

History

Repository files navigation

Link Content Scraper

Features

Requirements

Installation

Using uv (Recommended)

Using Docker

Usage

Configuration

Dependencies

Development

Notes

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages