Scrap files

Adding a new website to Cathode means creating a json file, we call it a scrap file.

Each scrap includes the author's contacts, general information about the website and a list of recipes.

Each recipe tells Cathode's engine what to do for a certain section of the website the scrap was made for. For instance you can have a recipe to parse the website's homepage and another one for the article pages. You can have as many recipes per website as you want. The correct recipe is chosen by regexp matching the request with the url field definition.

See below a scrap for Twitter, twitter.com.json:

{
    "name": "Twitter",
    "author": {
        "name": "Celso Martinho",
        "email": "celso at brpx dot com",
        "github": "celso",
        "twitter": "celso"
    },
    "recipes": [
        {
            "title": "User feed",
            "url": "/.*",
            "cache": 320,
            "scope": ".content",
            "fields": [
                {
                    "title": "p.TweetTextSize@html | spaceout | strip | trim | clean",
                    "embed": "p.TweetTextSize a[data-pre-embedded=\"true\"]@html",
                    "url": "small.time a@href",
                    "date": "small.time span._timestamp@data-time",
                    "thumb": "div.AdaptiveMedia-photoContainer@data-image-url"
                }
            ]
        }
    ]
}

The scrap file must be named [sub.]domain.com.json.

Header

The header of a scrap starts with a few attributes:

name (mandatory) - Name of the scrap. Usually the name of the website.
author/name (mandatory) - The name of the author of this scrap.
author/email (optional) - E-mail of the author. In the future we may use the author's contacts to notify him of deprecated of failing scraps.
author/github (optional) - Github usename
author/twitter (optional) - Twitter usename (no @)

Recipe

Each recipe includes the following fields:

title (mandatory) - Title of the recipe
url (mandatory) - URL regexp. It can be relative to the scrap's top level domain (as the example shows), protocol agnostic (ex: //twitter.com/.) or a fqdn (ex: https://twitter.com/.).
cache (optional) - Cathode has default cache of 320 seconds for each recipe. You can override this setting here.
scope (optional) - You can scope the query to a specific element selector that matches a class, id or attribute.
fields (mandatory) - The fields you'd like to extract with the recipe. Details below.

Fields

The fields is a flat object, or an array with an object, with a list of name/query pairs. If the object is embbeded in an array [ object ], then you're looking for a collection of items that repeats itself with the same pattern across the document.

Common field names you should use if you want multiple output formats (ex: rss) to work as expected:

title - title of the item
description or body - longer description
url - item external link
date - date
image - image for item

The query pair follows a jQuery selectors like syntax. Check the examples below to see it in action:

Query examples

Here's a few examples to give an idea on the kind of queries you can perform on a page.

Let's use the following document structure as the input:

<html>
    <head><title>Titly</title></head>
    <body>
        <img src="/imgs/logo.png" class="center">
        <p>Company logo</p>
        <h1>List of items</h1>
        <ul>
            <li>First item</li>
            <li>Second item</li>
        </ul>
        <h2>Story</h2>
        <div class="story">
            <p>Would you tell me, please, which way I ought to go from here?</p>
            <p>That depends a good deal on where you want to get to, said the Cat.</p>
        </div>
        <table>
            <tr>
                <td>John</td>
                <td>Doe</td>
            </tr>
            <tr>
                <td>Mike</td>
                <td>Albert</td>
            </tr>
        </table>
        <div id="disclaimer">Generic disclaimer</div>
    </body>
</html>

Grabbing the title

{
    "fields": { "title": "head title" }
}

[ { "title": "Titly" } ]

Grabbing the items

{
    "scope": "ul",
    "fields": [ { "items": "li" } ]
}

{ "items": [ "First item", "Second item" ] }

Grabbing the image location

{
    "fields": { "imgsrc": "img@src" }
}

[ { "imgsrc": "/imgs/logo.png" } ]

Grabbing the story, cleaning the paragraphs, and joining them in a single string

{
    "fields": { "text | join": [ "body .story p | clean" ] }
}

[ { "text": "Would you tell me, please, which way I ought to go from here? That depends a good deal on where you want to get to, said the Cat." } ]

Grabbing the table rows, using scope

{
    "scope": "table tr",
    "fields": [ {
        "firstName": "td:nth-child(1)",
        "secondName": "td:nth-child(2)"
    } ]
}

[ { "firstName": "John", "secondName": "Doe" },
  { "firstName": "Mike", "secondName": "Albert" } ]

Available filters

There are two types of filters you can use: pre and post.

Pre filters are applied to the queries, at the extraction level.

Post filters are applied to end result, at the delivery level.

You can see how this works in the grabbing the story example above.

Pre filters

trim - trimp spaces from string borders
reverse - reverse string
slice:start,end - slice string from offset start to end
strip - strips html from result
spaceout - unglues words from tags, adding a space between
clean - cleans up excessive spaces and carriage returns, turns them into a single space
match:regexp - powerful regular expression filter

Post filters

join - joins an array of results into a single string

Output formats

The output format is determined by the Cathode's feed url extension.

Currently, we support the following output transformations:

JSON - JSON format (example here).
RSS - RSS feed format (example here).

Other notes

Cathode's engine is slightly based on X-ray with changes. X-ray is an excellent and modern web scraper known for its flexible schema, composable API and modular architecture. Its query syntax is based on enhanced jQuery-like strings.

cathode-cli

cathode-cli is a command line tool that uses Cathode's API and helps developers to build and test their scrap formulas.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
2kindsofpeople.tumblr.com.json		2kindsofpeople.tumblr.com.json
LICENSE		LICENSE
README.md		README.md
alexa.com.json		alexa.com.json
amazon.json		amazon.json
colorhunt.co.json		colorhunt.co.json
dilbert.com.json		dilbert.com.json
eztv.ag.json		eztv.ag.json
gocomics.com.json		gocomics.com.json
instagram.com.json		instagram.com.json
publico.pt.json		publico.pt.json
theoatmeal.com.json		theoatmeal.com.json
twitter.com.json		twitter.com.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrap files

Header

Recipe

Fields

Query examples

Available filters

Pre filters

Post filters

Output formats

Other notes

cathode-cli

About

Releases

Packages

License

brpx/cathode-scraps

Folders and files

Latest commit

History

Repository files navigation

Scrap files

Header

Recipe

Fields

Query examples

Available filters

Pre filters

Post filters

Output formats

Other notes

cathode-cli

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages