Skip to content

Latest commit

 

History

History
117 lines (88 loc) · 4.17 KB

tutorial_dynamic_clicks.md

File metadata and controls

117 lines (88 loc) · 4.17 KB

Tutorial: Clicking Buttons to Load More Content with Crawl4AI

Introduction

When scraping dynamic websites, it’s common to encounter “Load More” or “Next” buttons that must be clicked to reveal new content. Crawl4AI provides a straightforward way to handle these situations using JavaScript execution and waiting conditions. In this tutorial, we’ll cover two approaches:

  1. Step-by-step (Session-based) Approach: Multiple calls to arun() to progressively load more content.
  2. Single-call Approach: Execute a more complex JavaScript snippet inside a single arun() call to handle all clicks at once before the extraction.

Prerequisites

  • A working installation of Crawl4AI
  • Basic familiarity with Python’s async/await syntax

Step-by-Step Approach

Use a session ID to maintain state across multiple arun() calls:

from crawl4ai import AsyncWebCrawler, CacheMode

js_code = [
    # This JS finds the “Next” button and clicks it
    "const nextButton = document.querySelector('button.next'); nextButton && nextButton.click();"
]

wait_for_condition = "css:.new-content-class"

async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
    # 1. Load the initial page
    result_initial = await crawler.arun(
        url="https://example.com",
        cache_mode=CacheMode.BYPASS,
        session_id="my_session"
    )

    # 2. Click the 'Next' button and wait for new content
    result_next = await crawler.arun(
        url="https://example.com",
        session_id="my_session",
        js_code=js_code,
        wait_for=wait_for_condition,
        js_only=True,
        cache_mode=CacheMode.BYPASS
    )

# `result_next` now contains the updated HTML after clicking 'Next'

Key Points:

  • session_id: Keeps the same browser context open.
  • js_code: Executes JavaScript in the context of the already loaded page.
  • wait_for: Ensures the crawler waits until new content is fully loaded.
  • js_only=True: Runs the JS in the current session without reloading the page.

By repeating the arun() call multiple times and modifying the js_code (e.g., clicking different modules or pages), you can iteratively load all the desired content.

Single-call Approach

If the page allows it, you can run a single arun() call with a more elaborate JavaScript snippet that:

  • Iterates over all the modules or "Next" buttons
  • Clicks them one by one
  • Waits for content updates between each click
  • Once done, returns control to Crawl4AI for extraction.

Example snippet:

from crawl4ai import AsyncWebCrawler, CacheMode

js_code = [
    # Example JS that clicks multiple modules:
    """
    (async () => {
      const modules = document.querySelectorAll('.module-item');
      for (let i = 0; i < modules.length; i++) {
        modules[i].scrollIntoView();
        modules[i].click();
        // Wait for each module’s content to load, adjust 100ms as needed
        await new Promise(r => setTimeout(r, 100));
      }
    })();
    """
]

async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
    result = await crawler.arun(
        url="https://example.com",
        js_code=js_code,
        wait_for="css:.final-loaded-content-class",
        cache_mode=CacheMode.BYPASS
    )

# `result` now contains all content after all modules have been clicked in one go.

Key Points:

  • All interactions (clicks and waits) happen before the extraction.
  • Ideal for pages where all steps can be done in a single pass.

Choosing the Right Approach

  • Step-by-Step (Session-based):

    • Good when you need fine-grained control or must dynamically check conditions before clicking the next page.
    • Useful if the page requires multiple conditions checked at runtime.
  • Single-call:

    • Perfect if the sequence of interactions is known in advance.
    • Cleaner code if the page’s structure is consistent and predictable.

Conclusion

Crawl4AI makes it easy to handle dynamic content:

  • Use session IDs and multiple arun() calls for stepwise crawling.
  • Or pack all actions into one arun() call if the interactions are well-defined upfront.

This flexibility ensures you can handle a wide range of dynamic web pages efficiently.