broad spider for all kinds of site
- crawl all
index
fromhome
$ scrapy crawl index -a sites=a.com,b.com -a no_cache
- crawl all
article
from anindex
list
$ scrapy crawl list
- crawl
meta
information of anarticle
$ scrapy crawl meta
- crawl
toc
from anarticle
$ scrapy crawl toc
- crawl
chapter
frommeta
of anarticle
$ scrapy crawl chapter
-
argument
-a value1
nocache
: do not use cached web pages for current command.
-
keyword argument
-a key=value1,value2
multi-values are separated by comma(
,
).-
s
: crawl in listsite
ids only. e.g-a s=a.com,b.net,c.org
If no sites specified, will crawl all supported sites.
-
i
: crawl in listindex
ids. e.g.-a i=1,3
If not specified, will crawl all indexes in DB.
-
a
: crawl only specifiedarticle
ids. e.g.-a a=3,8,23
if not specified, will crawl all articles that matcharticle weight
- weight = LISTED: only listed in DB (meta, toc, chapter will no be crawled)
- weight = META: only crawl meta (toc, chapter will not be crawled)
- weight = TOC_PREVIEW: crawl n entries in TOC. (n == ARTICLE_PREVIEW_CHAPTER_COUNT)
- weight = TOC: only crawl toc (chapter content will not be crawled)
- weight = PREVIEW: crawl n chapters content. (n == ARTICLE_PREVIEW_CHAPTER_COUNT)
- weight = NORMAL: crawl all chapter content
- weight = CHOICE: editor choice. crawl first. Good articles.
- weight = CLASSIC: classic articles. crawl first.
- weight = PREMIUM: premium articles. crawl first.
-
cf
: chapter from id.chapter
spider only. id must in downloaded TOC. -
ct
: chapter to id.chapter
spider only. id must in downloaded TOC. -
p
: pages to crawl if there isnext
page. e.g.-a p=2
means crawl only 2 pages. -
ac
: count of articles will be crawled. -
cc
: chapters of articles will be crawled.
-
-
Available arguments for spiders
- index -
s
- list -
s
,i
,p
- meta -
s
,i
,a
,p
,ac
- toc -
s
,i
,a
,p
(todo),ac
,cc
- chapter -
s
,i
,a
,ac
,p
(todo)
nocache
available for all spiders. - index -
-
remember last crawled. use
scrap states
scrapy crawl list -s JOBDIR=./path/to/state/list-1
scrapy crawl meta -s JOBDIR=./path/to/state/novel-1
scrapy crawl chapter -s JOBDIR=./path/to/state/novel-1
- run
python reader.py 8080
- browse http://localhost:8080
- if port is not specified, 8080 will be default.
- add file
debug.py
with content
from scrapy.cmdline import execute
import sys
execute(['scrapy', 'crawl', *sys.argv[1:]])
- In IDE, set start script as
debug.py