Skip to content

Save command

iiLaurens edited this page Sep 23, 2018 · 18 revisions

The top level Save instruction generates parses the webpage's source code using XPath, structures it and stores it in the storage property of the Parser instance. Only XPath 1.0 is supported in this command, with the exception of an EXSLT extension for regular expressions.

Using the 'Save' instruction

The content described by the Save instruction determines the type of the output:

  • A single XPath:

    A single XPath returns a list containing every match of the given XPath, or collapses to a scalar if only a single match was found.
    Url: https://iilaurens.github.io/urlsave/pages/pokemons.html
    Save: //td[@title='name']
    
    >> ['Bulbasaur', 'Ivysaur', 'Venusaur', 'Charmander', 'Charmeleon', ...]
    
    Url: https://iilaurens.github.io/urlsave/pages/pokemons.html
    Save: (//td[@title='name'])[1]
    
    >> 'Bulbasaur'
  • A list of XPaths:

    If a list with XPaths is provided, then each XPath in the list will be evaluated individually. The returned output is a list with size equal to the input list. Note that each individual XPath evaluation can return a list too, so that the end result is a nested list.
    Url: https://iilaurens.github.io/urlsave/pages/pokemons.html
    Save:
      - //td[@title='name']
      - //td[@title='French']
      - //td[@title='German']
      - //td[@title='Korean']
      - //td[@title='Japan']
    
    >> [['Bulbasaur',  'Ivysaur',    'Venusaur',   'Charmander','Charmeleon', ...],
    >>  ['Bulbizarre', 'Herbizarre', 'Florizarre', 'Salamèche', 'Reptincel',  ...],
    >>  ['Bisasam',    'Bisaknosp',  'Bisaflor',   'Glumanda',  'Glutexo',    ...],
    >>  ['이상해씨',    '이상해풀',    '이상해꽃',   '파이리',     '리자드',     ...],
    >>  ['フシギダネ',  'フシギソウ',  'フシギバナ',  'ヒトカゲ',   'リザード',   ...]]
  • A dict of XPaths:

    If a dict with XPaths is provided, then each XPath in the list will be evaluated individually. The output also returns a dict. The keys of the input dict will equal the keys of the output dict, and the returned XPath evaluations will be their respective values.
    Url: https://iilaurens.github.io/urlsave/pages/pokemons.html
    Save:
      English: //td[@title='name']
      French: //td[@title='French']
      German: //td[@title='German']
      Korean: //td[@title='Korean']
      Japanese: //td[@title='Japan']
    
    >> {'English':  ['Bulbasaur',  'Ivysaur',    'Venusaur',   'Charmander','Charmeleon', ...],
    >>  'French':   ['Bulbizarre', 'Herbizarre', 'Florizarre', 'Salamèche', 'Reptincel',  ...],
    >>  'German':   ['Bisasam',    'Bisaknosp',  'Bisaflor',   'Glumanda',  'Glutexo',    ...],
    >>  'Korean':   ['이상해씨',    '이상해풀',    '이상해꽃',   '파이리',     '리자드',     ...],
    >>  'Japanese': ['フシギダネ',  'フシギソウ',  'フシギバナ',  'ヒトカゲ',   'リザード',   ...]}
  • A nested save:

    For advanced uses it is also possible to nest save commands. There is no need to repeat the Save keyword if already called before.
    Url: https://iilaurens.github.io/urlsave/pages/pokemons.html
    Save:
      European:
        English: //td[@title='name']
        French: //td[@title='French']
        German: //td[@title='German']
      Asian:
        Korean: //td[@title='Korean']
        Japanese: //td[@title='Japan']
    
    >> {'European': {'English':  ['Bulbasaur',  'Ivysaur',    'Venusaur',   'Charmander','Charmeleon', ...],
    >>               'French':   ['Bulbizarre', 'Herbizarre', 'Florizarre', 'Salamèche', 'Reptincel',  ...],
    >>               'German':   ['Bisasam',    'Bisaknosp',  'Bisaflor',   'Glumanda',  'Glutexo',    ...]},
    >>  'Asian':    {'Korean':   ['이상해씨',    '이상해풀',    '이상해꽃',   '파이리',     '리자드',     ...],
    >>               'Japanese': ['フシギダネ',  'フシギソウ',  'フシギバナ',  'ヒトカゲ',   'リザード',   ...]}}

Using optional arguments

If an XPath is provided in a Save instruction, it is possible to provide additional arguments. These arguments are:

--text: This command stops the preceding text from being parsed and uses it *as-is*.
Url: https://iilaurens.github.io/urlsave/pages/pokemons.html
Save:
  Pokemon: (//td[@title='name'])[25]
  Owner: Ash --text

>> {"Pokemon": "Pikachu", "Owner": "Ash"}
--keep-list: Force a list as output, even if there is only one match.
# With --keep-list:
Url: https://iilaurens.github.io/urlsave/pages/pokemons.html
Save: (//td[@title='name'])[25] --keep-list

>> ["Pikachu"]

# Without --keep-list:
Url: https://iilaurens.github.io/urlsave/pages/pokemons.html
Save: (//td[@title='name'])[25]

>> "Pikachu"
--unique: This removes any duplicate values from the list whilst preserving the order.
Url: https://iilaurens.github.io/urlsave/pages/pokemons.html
Save: //span[@class="type"] --unique

>> ["Grass", "Poison", "Normal", "Fire", "Dragon", "Flying", "Water", "Dark", "Bug", "Psychic", "Ground", "Ice", "Electric", "Rock", "Fighting", "Ghost", "Curse", "Steel"]
--number: Attempts to convert the values from strings to numbers.
Url: https://iilaurens.github.io/urlsave/pages/pokemons.html
Save:
  Pokemon: (//td[@title="name"])[1]
  Capture rate: (//td[@title="capture"])[1] --number

# Notice the unquoted number
>> {"Pokemon": "Bulbasaur", "Capture rate": 45.0}
--url: If the output is expected to be a URL, then relative URLs are evaluated and tranformed to an absolute URL based on the current open page. Absolute URLs remain untouched.
Url: https://iilaurens.github.io/urlsave/pages/starters.html
Save: //a/@href --url

>> ["https://iilaurens.github.io/urlsave/pages/001.html",
>>  "https://iilaurens.github.io/urlsave/pages/002.html",
>>  "https://iilaurens.github.io/urlsave/pages/003.html",
>>  "https://iilaurens.github.io/urlsave/pages/004.html",
>>  "https://iilaurens.github.io/urlsave/pages/005.html",
>>  "https://iilaurens.github.io/urlsave/pages/006.html",
>>  "https://iilaurens.github.io/urlsave/pages/007.html",
>>  "https://iilaurens.github.io/urlsave/pages/008.html",
>>  "https://iilaurens.github.io/urlsave/pages/009.html"]

Dynamic keys

Sometimes a static definition of what we would like to obtain can become very verbose, particularly if you have many values who you all want to assign a different key in a dictionary. In that case dynamic keys might help. The following example shows how dynamic keys can be used to get the attack name and attack type as key-value pairs in a dictionary.

Url: https://iilaurens.github.io/urlsave/pages/025.html
Save:
  Keys(//table[@title="moves"]//tr//td[2]): //table[@title="moves"]//tr//td[@class="cen"][1]/span

>> { "ThunderShock": "Electric",
>>   "Growl":        "Normal",
>>   "Tail Whip":    "Normal",
>>   "Thunder Wave": "Electric",
>>   "Quick Attack": "Normal",
>>   "Double Team":  "Normal",
>>   "Slam":         "Normal",
>>   "Thunderbolt":  "Electric",
>>   "Agility":      "Psychic",
>>   "Thunder":      "Electric",
>>   "Light Screen": "Psychic"}