GitHub - kepano/defuddle: Extract the main content from web pages.

de·fud·dle /diˈfʌdl/ transitive verb
to remove unnecessary elements from a web page, and make it easily readable.

Beware! Defuddle is very much a work in progress!

Defuddle extracts the main content from web pages. It cleans up web pages by removing clutter like comments, sidebars, headers, footers, and other non-essential elements, leaving only the primary content.

Key features

Defuddle aims to be a replacement for Mozilla Readability, with a few differences:

More forgiving, removes fewer uncertain elements
Uses a page's mobile styles to guess at unnecessary elements
Extracts more metadata from the page, including schema.org data

Installation

npm install defuddle

Usage

import { Defuddle } from 'defuddle';

const article = new Defuddle(document).parse();

// Use the extracted content and metadata
console.log(article.content);  // HTML string of the main content
console.log(article.title);    // Title of the article

Response

The parse() method returns an object with the following properties:

Property	Type	Description
`content`	string	HTML string of the extracted main content
`title`	string	Title of the article
`description`	string	Description or summary of the article
`domain`	string	Domain name of the website
`favicon`	string	URL of the website's favicon
`image`	string	URL of the article's main image
`published`	string	Publication date of the article
`author`	string	Author of the article
`site`	string	Name of the website
`schemaOrgData`	object	Raw schema.org data extracted from the page

Development

Build

To build the package, you'll need Node.js and npm installed. Then run:

# Install dependencies
npm install

# Clean and build
npm run build

This will generate:

dist/index.js - UMD build for both Node.js and browsers
dist/index.d.ts - TypeScript declaration file

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
dist		dist
src		src
.editorconfig		.editorconfig
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.declarations.json		tsconfig.declarations.json
tsconfig.json		tsconfig.json
webpack.config.js		webpack.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Key features

Installation

Usage

Response

Development

Build

About

Releases 1

Packages

Languages

License

kepano/defuddle

Folders and files

Latest commit

History

Repository files navigation

Key features

Installation

Usage

Response

Development

Build

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages