Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan results not consistently updated #1125

Closed
eloquence opened this issue May 28, 2024 · 7 comments · Fixed by #1126
Closed

Scan results not consistently updated #1125

eloquence opened this issue May 28, 2024 · 7 comments · Fixed by #1126
Assignees

Comments

@eloquence
Copy link
Member

https://securedrop.org/admin/scanresult/ ordered by "Result last seen" currently shows 6 results for May 28 (today), 2 for earlier in May, and then drops off into March, February and January.

I would expect the whole directory to be scanned regularly (daily?) with results being updated across the board.

@soleilera
Copy link

What does "Result last seen" mean?

@chigby
Copy link
Contributor

chigby commented May 28, 2024

Result last seen is a timestamp on the scan result data that is updated (1) when the scan result is first created, and (2) if a subsequent scan obtains exactly the same data. Here's the code that deals with that: https://github.com/freedomofpress/securedrop.org/blob/develop/scanner/scanner.py#L108-L114

@mig5
Copy link

mig5 commented May 28, 2024

The daily-tasks job is throwing a fatal error at the end, after about 5 minutes.

Here is the full output of a job:

[www-production] user@fpf-ssh:~/git/k8s-configs$ ksdo logs -f sdo-daily-tasks-manual-mig5-cpqnp
Updating backend: default
default: Rebuilding index default
default: blog.BlogPage             .
default: blog.CategoryPage         .
default: blog.BlogIndexPage        .
default: common.CustomImage        .
default: home.HomePage             .
default: marketing.MarketingIndexPage .
default: marketing.FeaturePage     .
default: simple.SimplePage         .
default: simple.SimplePageWithMenuSidebar .
default: simple.FAQPage            .
default: forms.FormPage            .
default: directory.DirectoryEntry  .
default: directory.DirectoryPage   .
default: wagtaildocs.Document      .
default: wagtailimages.Image       
default: wagtailcore.Page          .
default: wagtailmedia.Media        .
default: indexed 438 objects

- 72 SearchDocuments created
- 0 SearchDocuments updated (does not necessarily indicate changes)
- 224 SearchDocuments created
- 0 SearchDocuments updated (does not necessarily indicate changes)






/usr/local/lib/python3.12/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(

Traceback (most recent call last):
  File "/django/./manage.py", line 12, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.12/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.12/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.12/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.12/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/django/directory/management/commands/scan.py", line 46, in handle
    bulk_scan(securedrop_pages)
  File "/django/scanner/scanner.py", line 95, in bulk_scan
    current_result = perform_scan(entry.landing_page_url, permitted_domains)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/django/scanner/scanner.py", line 46, in perform_scan
    assets = extract_assets(soup, page.url)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/django/scanner/assets.py", line 105, in extract_assets
    response = fetch_asset(link.attrs['href'], site_url)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/django/scanner/assets.py", line 184, in fetch_asset
    return requests.get(asset_url.geturl(), headers=HEADERS, timeout=5)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 697, in send
    adapter = self.get_adapter(url=request.url)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 794, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for 'file://www.forbes.com/Users/dcao/Downloads/securedrop.css'

@eloquence
Copy link
Member Author

eloquence commented May 28, 2024

I can reproduce an error if I try to scan the Forbes entry manually through the UI (this is done every time an entry is published):

image

However, the scan manage.py subcommand should probably fail gracefully and just record the error for individual scans, instead of aborting the entire scan :/

@mig5
Copy link

mig5 commented May 28, 2024

Yeah, a try/catch (pass) on requests.exceptions.InvalidSchema around https://github.com/freedomofpress/securedrop.org/blob/develop/scanner/scanner.py#L46-L48 might do it.. (or ignore that file:// scheme in https://github.com/freedomofpress/securedrop.org/blob/develop/scanner/assets.py#L173-L184 ?)

@eloquence
Copy link
Member Author

I can confirm that this is fixed and scan results appear to be rolling in consistently again. @nathandyer Heads up. This may have resulted in some warning status changes on existing directory entries that hadn't been scanned in a while :/.

@nathandyer
Copy link
Contributor

Thanks for the heads up @eloquence, I'll take a look through the directory for any new status changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants