[Suggestion] filter non-html page from collector #7

ghost · 2020-06-01T06:08:45Z

Hi,

Hope you are all well !

It would be interesting to exclude non-html page from indexing content into elasticsearch, or to create a mime/type detector to detect images or pdf documents and create dedicated sub-processing for the binary types.

For now, I just added:

		if title == "" {
			spider.Logger.Error(errors.New("not an html page"))
			return
		}

Cheers,
X

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] filter non-html page from collector #7

[Suggestion] filter non-html page from collector #7

ghost commented Jun 1, 2020

[Suggestion] filter non-html page from collector #7

[Suggestion] filter non-html page from collector #7

Comments

ghost commented Jun 1, 2020