-
Notifications
You must be signed in to change notification settings - Fork 0
Multipage command
Sometimes information on a website is spread over multiple pages, despite being structurally the same format. Some examples are blogs or news providers. Some sites even dynamically load more content upon some trigger event without reloading the page at all. The Multipage
command allows the parser to save and combine all content. The inputs provided to this command dictate how to obtain the new page or content and possibly when to stop if there is too much of it.
Depending on the page we are dealing with, the Multipage
command must be provided different inputs. The simplest case would be the typical static page with a navigation button that sends the user to the next page. The Next
option tells the parser how to find the element that must be clicked. The parser continues to click the element on each page if one is found and stops otherwise. Besides the Next
option, the standard Group
or Save
input is required as well. A simple example is shown below which collects the names of the last 3 Pokemon and combined type pool, each shown on a separate page.
Url: https://iilaurens.github.io/urlsave/pages/249.html
Multipage:
Next: //a[contains(.,"next")]
Save: //td[@title="name"]
>> ["Lugia", "Ho-Oh", "Celebi"]
Please note however that currently the Multipage
implementation is not optimized to handle the Save
input, and might result in unexpected behaviour. It is advised to use the Group
command instead.
If there are too many pages and you only want to visit a couple, then it is possible to provide a limit using Max pages
. The example below takes obtains the names and corresponding types from the first 9 starter Pokemon.
Url: https://iilaurens.github.io/urlsave/pages/001.html
Multipage:
Next: //a[contains(.,"next")]
Max pages: 9
Group:
By: /html
Keys: .//td[@title="name"]
Save: .//td[@title="type"]/span --keep-list
>> {"Bulbasaur": ["Grass", "Poison"],
>> "Ivysaur": ["Grass", "Poison"],
>> "Venusaur": ["Grass", "Poison"],
>> "Charmander": ["Fire"],
>> "Charmeleon": ["Fire"],
>> "Charizard": ["Fire", "Flying"],
>> "Squirtle": ["Water"],
>> "Wartortle": ["Water"],
>> "Blastoise": ["Water"]}
Some websites dynamically load more content on the same page, for example by pressing a load button or by scrolling to the bottom of the page. Both cases are supported. Note that sites generally add new data without removing the existing content already shown on the page. To prevent double counting, we thus need to tell the multipage
command that content is cumulative.
The example below shows a page that dynamically loads more Pokemon on the press of a button. By adding the statement Cumulative: True
the parser attempts to load all the Pokemon before parsing the page's content. Load actions can be limited again by the optional Max pages
input.
Url: https://iilaurens.github.io/urlsave/pages/pokemons button load.html
Multipage:
Next: //a[contains(., "Load")]
Cumulative: True
Max pages: 3
Group:
By: //tr[@class="row"]
Keys: .//td[@title="name"]
Save: .//td[@title="type"]/span --keep-list
>> {"Bulbasaur": ["Grass", "Poison"],
>> "Ivysaur": ["Grass", "Poison"],
>> ...
>> ...
>> "Mewtwo": ["Psychic"],
>> "Mew": ["Psychic"]}
Alternatively if the page requires scrolling to the bottom for loading new content (instead of pressing a button), one can add Scrolling: True
as an input. Number of scrolls are again limited by the optional Max pages
argument.
Url: https://iilaurens.github.io/urlsave/pages/pokemons scroll load.html
Multipage:
Scrolling: True
Max pages: 3
Group:
By: //tr[@class="row"]
Keys: .//td[@title="name"]
Save: .//td[@title="type"]/span --keep-list
>> {"Bulbasaur": ["Grass", "Poison"],
>> "Ivysaur": ["Grass", "Poison"],
>> ...
>> ...
>> "Mewtwo": ["Psychic"],
>> "Mew": ["Psychic"]}