Skip to content

Latest commit

 

History

History
316 lines (241 loc) · 11.8 KB

1_persist_agent_restart.md

File metadata and controls

316 lines (241 loc) · 11.8 KB

Persist verifier monitoring after agent restarts

Release Signoff Checklist

  • Enhancement issue in release milestone, which links to pull request in [keylime/enhancements]
  • Core members have approved the issue with the label implementable
  • Design details are appropriately documented
  • Test plan is in place
  • User-facing documentation has been created in [keylime/keylime-docs]

Summary

Should someone restart an agent based server or force an agent offline, the agent will no longer be monitored by the verifier. Upon starting the agent will just register with the registrar and IMA monitoring will cease.

This behavior was originally discussed on the keylime mailing list

Motivation

Its acceptable that someone may want to manually restart a server (or the server restarts as part of an automated work flow) while retaining the configuration set up during the intial "adding" of the agent to the verifier (allowlist, tpm_policy). They should not have to again add (or update) the verifier every time if there is not change in configuration or trust mapping (e.g software CA).

Goals

A user restarts the agent on a target node. When the agent is becomes active again the verifier proceeds to recommence monitoring the delegated measurements from when the target agent was first added to the verifier and registrar.

Non-Goals

Any sort of migration or fault redundancy (although both areas benefit from this change)

Proposal

A target machine is rebooted with no change in state (measured properties). This machine should not require “re adding” with the keylime tenant again.

Once the target node / agent returns to an online / reachable state, the verifier should proceed to recommence run time monitoring.

A new tornado web handler will be created within the verifier to listen for requests that an agent will emit when it (re)starts.

Code will be introduced within the agent that will perform a POST request to inform the verifier an agent has been (re)started. This in turn will cause the verifier to perform an operational_state query for the UUID of that agent and then proceed to perform run time integrity monitoring again.

User Stories (optional)

For any given reason my server reboots. Keylime handles this event and provides trust monitoring once the server and agent are back online and can be reached by the verifier.

Should the machines state have been tampered with during the offline period, Keylime will immediate fail the target node accordingly (or likewise show the machine is still in the expected trust state according to the delegated measurements)

If I want to change measurements, I use the existing update command available in the Keylime Tenant CLI.

Risks and Mitigations

We should be sure we do not introduce security risks and be mindful of future enhancements such as multi tenancy, auth and migration.

Design Details

Verifier Changes

A new tornado web handler is created within the verifier to listen for requests that an agent will emit when it starts. We will call this /nudge for now with a more suitable name agreed within this review.

A new operational_state named OFFLINE will be created for when a machine becomes unreachable during a GET_QUOTE operational_state. This state will be set once the agent fails to respond during its retry query period set within the keylime.conf configuration file.

A new database row will need to be introduced for the OFFLINE operational_state

Agent Changes

Code will be introduced to the agent that will perform a POST request /nudge to inform the verifier an agent has been (re)started. This in turn will instruct the verifier to perform an operational_state query for the UUID of the concerned agent. Should the operational_state be OFFLINE, it will change the operational_state to GET_QUOTE and proceed to (re)start continuous monitoring of the node with the previous set measurements (whitelist, tpm_policy)

Registrar Changes

No immediate changes come to mind, but we should be mindful of this as the design evolves.

Keylime TPM coms changes

We will need to assess changes required within our TPM communications. For example the Agent calls tpm_startup -c and takes ownership of the tpm every time it starts. The AK handle is also flushed.

We may need to consider having some sort of flag the agent queries to establish its already associated with a verifier.

Rather than bootstrapping itself as a fresh agent, it instead retains its TPM set up and instead just instantiates its web service to allow rest API interactions with the verifier again. These interactions will be a continuum of the previous quote GET requests from the verifier, while retaining the existing root of trust already set up by the registrar (EKpub and AKPub).

Test Plan

Functional tests will be needed to play out the user case of restarting a agent, persisting state and reestablishing measurements upon its restart.

Unit tests will be needed to test the new nudge API functionality.

Upgrade / Downgrade Strategy

May need to consider impact of upgrading with an agent offline and then the new TPM code changes interacting with the TPM setup from the previous release.

Drawbacks

TBD

Alternatives

We evolve the retry handler in the verifier to wait for indefinite periods instead of having a wake up API - this is hazardous as we risk bottle necks and need to consider managing more state (for example a node goes offline to never return).

Infrastructure Needed (optional)

Some changes may be needed to travis CI, but not expected currently.

No new repos required.