Web Crawler

A robust, asynchronous web crawler designed to extract product URLs from e-commerce websites. The crawler supports parallel processing of multiple domains while handling dynamic content and implementing various optimization strategies.

Features

Parallel crawling of multiple domains
Dynamic content handling via Selenium
Configurable depth and URL limits
Automatic handling of JavaScript-loaded content
Custom URL pattern matching for products
Domain-specific exclusion patterns
Comprehensive logging system
Rate limiting and timeout controls

Requirements

python >= 3.7
aiohttp
beautifulsoup4
selenium

Installation

Clone the repository
Create virtual environment:

python3 -m venv venv

Activate the environemtn:

  C:> venv\Scripts\activate.bat # For Windows
  source venv/bin/activate # For Linux/MacOS

Install dependencies:

pip install aiohttp beautifulsoup4 selenium

Start the program

python3 version2.py

Usage

Basic usage example:

domains = [
    "https://flipkart.com",
    "https://amazon.in"
]

crawler = ParallelCrawler(
    domains=domains,
    max_urls_per_domain=500,
    max_depth=3,
    timeout_seconds=3600
)
crawler.run()

Configuration

domains: List of domains to be crawled
max_urls_per_domain: Maximum number of URLs to crawl per domain (default: 10000)
max_depth: Maximum depth for crawling links (default: 10)
timeout_seconds: Timeout for crawling operation per domain (default: 3600)
output_file: Path for JSON output file (default: 'product_urls.json')

Output

Results are saved in JSON format:

{
  "flipkart.com": [
    "https://flipkart.com/product/1",
    "https://domaflipkartin1.com/product/2"
  ],
  "amazon.com": [
    "https://amazon.com/product/1"
  ]
}

Logging

The crawler logs all operations to both console and 'crawler.log' file, including:

Crawling progress
Error messages
CAPTCHA detection
Timeout warnings

Execution flow

ParallelCrawler class object is initialized with domains and configuration
ParallelCrawler.run() method is called which in turn calls the async function ParallelCrawler.crawl_all_domains() with asyncio to execute asynchronous functions
DomainCrawler object (crawler) is intialised over all the domains and sets up parallel processing.
Each DomainCrawler's start() method:
- Initiates crawling from the domain's root URL
- Manages the Selenium WebDriver lifecycle
- Returns collected product URLs
DomainCrawler.crawl_url() : Extracts all the urls on a page. It checks if the url is product page by matching certain product page URL patterns.
- If YES - Adds it to product URLs list
- If NO - Checks if it's within depth limit and not excluded, then crawls it to find more product URLs
DomainCrawler.handle_dynamic_content(): If a page uses JavaScript to load content:
- Opens the page in a headless browser
- Scrolls to load all content
- Extracts URLs from the fully loaded page
The process continues until:
- Maximum URLs limit is reached
- Maximum depth is reached
- Timeout occurs
- User stops the program
Finally, all collected product URLs are saved to a JSON file, organized by domain

Approach

The project was developed in three phases, progressively enhancing functionality and robustness. Below is a detailed breakdown of each phase and the final solution:

Initial Approach (`main.py`)

In the initial phase, we focused on the basics of web crawling, such as:

Extracting URLs from web pages.
Navigating to related URLs within the same domain.

This phase laid the foundation for understanding web crawling mechanics and URL traversal.

Advanced Approach (`advance.py`)

Building on the basics, the second phase introduced advanced topics, including:

Error Handling: Ensuring the crawler is resilient to broken links, timeouts, and other runtime errors.
Dynamic Content Loading: Handling challenges like infinite scrolling and AJAX-loaded content.
Captcha Handling: Investigating techniques to bypass or manage captchas efficiently.
Authentication Handling: Addressing scenarios where login credentials are required to access certain pages.

This phase aimed to make the crawler more versatile and capable of handling real-world complexities.

Final Solution (`version2.py`)

In the final phase, we combined the learnings from both previous phases to create a comprehensive and efficient solution. Key features include:

Basic URL Handling and Navigation: Leveraging the robust URL extraction and traversal techniques from main.py.
Dynamic Content Handling: Incorporating strategies for managing infinite scrolling, AJAX-loaded elements, and other dynamic content challenges from advance.py.
Product Page Identification:
- Visiting the homepages of target domains and extracting all available URLs.
- Filtering URLs based on patterns indicative of product pages.
- Excluding generic and domain-specific patterns to improve accuracy.

This holistic approach ensures efficient URL discovery, robust error handling, and effective product page identification.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Notes.md		Notes.md
README.md		README.md
advance.py		advance.py
crawler.log		crawler.log
main.py		main.py
product_urls.json		product_urls.json
version2.py		version2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Features

Requirements

Installation

Usage

Configuration

Output

Logging

Execution flow

Approach

Initial Approach (`main.py`)

Advanced Approach (`advance.py`)

Final Solution (`version2.py`)

About

Releases

Packages

Languages

ViragJain3010/Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Features

Requirements

Installation

Usage

Configuration

Output

Logging

Execution flow

Approach

Initial Approach (main.py)

Advanced Approach (advance.py)

Final Solution (version2.py)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Initial Approach (`main.py`)

Advanced Approach (`advance.py`)

Final Solution (`version2.py`)

Packages