When scraping dynamic websites, it’s common to encounter “Load More” or “Next” buttons that must be clicked to reveal new content. Crawl4AI provides a straightforward way to handle these situations using JavaScript execution and waiting conditions. In this tutorial, we’ll cover two approaches:
- Step-by-step (Session-based) Approach: Multiple calls to
arun()
to progressively load more content. - Single-call Approach: Execute a more complex JavaScript snippet inside a single
arun()
call to handle all clicks at once before the extraction.
- A working installation of Crawl4AI
- Basic familiarity with Python’s
async
/await
syntax
Use a session ID to maintain state across multiple arun()
calls:
from crawl4ai import AsyncWebCrawler, CacheMode
js_code = [
# This JS finds the “Next” button and clicks it
"const nextButton = document.querySelector('button.next'); nextButton && nextButton.click();"
]
wait_for_condition = "css:.new-content-class"
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
# 1. Load the initial page
result_initial = await crawler.arun(
url="https://example.com",
cache_mode=CacheMode.BYPASS,
session_id="my_session"
)
# 2. Click the 'Next' button and wait for new content
result_next = await crawler.arun(
url="https://example.com",
session_id="my_session",
js_code=js_code,
wait_for=wait_for_condition,
js_only=True,
cache_mode=CacheMode.BYPASS
)
# `result_next` now contains the updated HTML after clicking 'Next'
Key Points:
session_id
: Keeps the same browser context open.js_code
: Executes JavaScript in the context of the already loaded page.wait_for
: Ensures the crawler waits until new content is fully loaded.js_only=True
: Runs the JS in the current session without reloading the page.
By repeating the arun()
call multiple times and modifying the js_code
(e.g., clicking different modules or pages), you can iteratively load all the desired content.
If the page allows it, you can run a single arun()
call with a more elaborate JavaScript snippet that:
- Iterates over all the modules or "Next" buttons
- Clicks them one by one
- Waits for content updates between each click
- Once done, returns control to Crawl4AI for extraction.
Example snippet:
from crawl4ai import AsyncWebCrawler, CacheMode
js_code = [
# Example JS that clicks multiple modules:
"""
(async () => {
const modules = document.querySelectorAll('.module-item');
for (let i = 0; i < modules.length; i++) {
modules[i].scrollIntoView();
modules[i].click();
// Wait for each module’s content to load, adjust 100ms as needed
await new Promise(r => setTimeout(r, 100));
}
})();
"""
]
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com",
js_code=js_code,
wait_for="css:.final-loaded-content-class",
cache_mode=CacheMode.BYPASS
)
# `result` now contains all content after all modules have been clicked in one go.
Key Points:
- All interactions (clicks and waits) happen before the extraction.
- Ideal for pages where all steps can be done in a single pass.
-
Step-by-Step (Session-based):
- Good when you need fine-grained control or must dynamically check conditions before clicking the next page.
- Useful if the page requires multiple conditions checked at runtime.
-
Single-call:
- Perfect if the sequence of interactions is known in advance.
- Cleaner code if the page’s structure is consistent and predictable.
Crawl4AI makes it easy to handle dynamic content:
- Use session IDs and multiple
arun()
calls for stepwise crawling. - Or pack all actions into one
arun()
call if the interactions are well-defined upfront.
This flexibility ensures you can handle a wide range of dynamic web pages efficiently.