Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimistic Relay #2

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Optimistic Relay #2

wants to merge 12 commits into from

Conversation

michaelneuder
Copy link
Owner

@michaelneuder michaelneuder commented Jan 12, 2023

tl;dr; This PR implements an "optimistic" version of flashbots' mev-boost-relay. The core idea is that we can improve the throughput and latency of block submissions by using asynchronous processing. Instead of validating each block that comes in immediately, we return early and add the block to the datastore before checking its validity. This (hopefully) allows builders to submit more blocks, and critically, submit blocks later in the time period of the slot (because they don't have to wait for the simulation of the block to return successfully). Proposers still use the Proposer API to select the highest bid, but there is no longer a guarantee that this block is valid. To submit blocks in the optimistic mode, a builder must put up a collateral that is greater than the value of the blocks they are proposing. If a proposer ends up signing an invalid block, collateral from the builder of that block will be used to refund the proposer.


The changes can be described through 3 sequences:

  1. Submitting optimistic blocks.
  2. Validating optimistic blocks.
  3. Proposing optimistic blocks.

1. Submitting optimistic blocks.

Screen Shot 2023-01-12 at 9 44 38 PM

  1. Block builders submit blocks to the Builder API endpoint of the relay.
  2. The Builder API fetches the status of the block builder from redis.
  3. Based on the status of the builder:
    a. if the builder status is optimisticActive, send the block to the Optimistic Block Channel.
    b. if the builder status is optimisticLocked, send the block to the high-prio queue of the prio-load-balancer.
    c. if the builder status is optimisticDemoted, send the block to the low-prio queue of the prio-load-balancer.
    d. if the builder status is highPrio, send the block to the high-prio queue of the prio-load-balancer.
    e. if the builder status is lowPrio, send the block to the low-prio queue of the prio-load-balancer.
    f. if the builder is in blacklisted mode, discard the block.
  4. In cases (2b-2f), wait until the block has been successfully simulated on the validation nodes.
  5. Update the builders current bid in redis.

Notice that for builders in optimistic mode, we skip step 3 and update the bid after sending the block to the Optimistic Block Channel without validating it. This is where the improved performance can (hopefully) be achieved.


2. Validating optimistic blocks.

Screen Shot 2023-01-12 at 9 42 32 PM

  1. The OptimisticBlockProcessor receives a block from the Optimistic Block Channel (this happens in a different goroutine than the Builder API).
  2. If the builder status is OptimisticActive, the OptimisticBlockProcessor sends the block as high-prio to the prio-load-balancer for simulation.
  3. The block is simulated on the validating nodes, and the status is returned to the OptimisticBlockProcessor.
  4. If the simulation failed, the builder is demoted to OptimisticDemoted and the details of the failure are written to the database.

This flow handles the simulation of all the blocks that were optimistically skipped. An invalid block here results in a demotion, but not necessarily a refund for the proposer because we don't yet know whether this block was the highest bid and thus proposed.


3. Proposing optimistic blocks

Screen Shot 2023-01-12 at 9 51 33 PM

  1. mev-boost calls getHeader on the Proposer API of the relay. This is part of the MEV-Block Block Proposal as documented in https://docs.flashbots.net/flashbots-mev-boost/architecture-overview/block-proposal.
  2. mev-boost calls getPayload on the Proposer API of the relay. This triggers the publication of a SignedBeaconBlock.
  3. In a separate goroutine, the Proposer API checks the builder status.
    a. if the status is OptimisticActive, the block that was just proposed is sent to the high-prio queue on the prio-load-balancer for simulation.
    b. if the status is OptimisticDemoted, then write the SignedBeaconBlock to the database for a refund justification.
  4. The block simulation result from (2a) is returned to the Proposer API.
  5. If the block simulation failed, an invalid block was published. We upsert a new row into the Builder Demotions Table with the details of the failure so the refund can be justified.

This flow represents the process of checking if our optimistic block processing ever results in a validator submitting an invalid block. Since block builders will post collateral, this will be used to reimburse the validator in that case. Since refunds should be a relatively rare event, we plan on handling them manually.


Builder status edge cases.

There are two additional situations where the status of a builder is temporarily modified:

  1. If the builder's collateral is less than the value of the block they submit. In this case, they wouldn't be able to cover the cost of the refund. Instead of demoting them in this case, we just process their block "pessimistically" by setting their status as high prio for that block (but not modifying the persistent state so that subsequent blocks will still be optimistic).
  2. If the builder's block is selected as the winner of a previous slot. While we are processing the winning block, we switch the builder to OptimisticLocked temporarily to avoid optimistic blocks being built on optimistic blocks.

The following diagram depicts the state machine of the builder status:
Screen Shot 2023-01-21 at 4 59 53 PM

Since builders may use many different public keys to submit blocks we allow all of those keys to be backed by a single collateral through the use of the a "Collateral ID". This is aimed at simplifying the process of posting collateral, but if any public key using that Collateral ID submits an invalid block, all of the public keys are demoted.

is_blacklisted boolean NOT NULL,
status int NOT NULL,
collateral_value NUMERIC(48, 0),
collateral_id bigint NOT NULL,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'd want to create a new migration file applies the changes to the database schema.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. figured that might be the case. leaving this thread open as i work through that! thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright. i see that the code is outdated, but this is not yet done in the code. All existing migrations need to be left untouched, and a single new migration file created that modifies the previous database. Take a look at this migration for reference: https://github.com/flashbots/mev-boost-relay/blob/main/database/migrations/002_bid_remove_isbest_add_receivedat.go

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, Chris. i took a swing at this here: c4bf788. still need to test this to make sure the DB changes make sense, but i think this overall approach of creating the new columns, updating the column values based on the is_high_prio and is_blacklisted columns, and then dropping those two should be a good approach. let me know if i am missing something here! thanks again :-)

@ralexstokes
Copy link

ralexstokes commented Jan 13, 2023

this is an interesting idea and I imagine many builders would want to use it

that being said I want to call out that this does weaken the integrity of the PBS construction as it is hard to correctly price the value of an invalid block

you address this partially:

collateral from the builder of that block will be used to refund the proposer.

but in the event a builder sends an invalid block to the proposer not only does the proposer lose out on significant revenue but also we have a missed slot on-chain which I argue affects the entire Ethereum user base.

To demonstrate, let's say I am a user not touching any of this MEV stuff and it is important for me to get my transaction on-chain at a given block height (e.g. some kind of on-chain deadman switch) and I just happen to end up in this optimistic block submission mechanism. In the event of an invalid block, my transaction doesn't get through in time. It looks like the proposer would be compensated for the missed slot, but I would not be.

Have you thought about pricing the total cost of a missed slot, beyond the proposer payment when considering the second-order effects of this proposal?

@JustinDrake
Copy link

JustinDrake commented Jan 13, 2023

Have you thought about pricing the total cost of a missed slot, beyond the proposer payment when considering the second-order effects of this proposal?

Only a single slot can be lost per collateral ID without manual intervention. The corresponding builder would lose reputation in addition to the collateral if they abuse empty slots. We could add a penalty (e.g. X ETH) in addition to the block value but I don't think it's necessary at this point. We could also have a "cooloff" period of 1 day before reactivation as an optimistic builder.

As a side note, Ethereum is designed to handle empty slots. 1% of slots are already empty and if an attacker really cares about empty slots they can get some for free by DoSing proposers since we don't have SSLE yet.

simErr := api.blockSimRateLimiter.send(opts.ctx, opts.req, opts.highPrio)
log := opts.log.WithFields(logrus.Fields{
"duration": time.Since(t).Seconds(),
"numWaiting": api.blockSimRateLimiter.currentCounter(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth logging something similar to numWaiting but for optimistic block processors?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems like it could be valuable to provide metrics into how many of the block processors are working at a time! this is a slightly bigger change because we don't have a currentCounter like the block sim rate limiter has. let me tinker on it and get back to this thread. leaving it open for now.

@Ruteri
Copy link

Ruteri commented Jan 16, 2023

I'd be wary of going in this direction. We saw that bugs in the builders code propagated if the blocks were not explicitly verified by the relays, causing three or so extended periods of missed blocks.
Every relay has implemented block validity checks due to pressure from the community, and this would be a step back on that front.

I'd definitely at least make this an opt-in from both the validator and the relay sides.

@JustinDrake
Copy link

We saw that bugs in the builders code propagated if the blocks were not explicitly verified by the relays, causing three or so extended periods of missed blocks.

We are definitely assuming buggy builders—that's part of our threat model! Every block received from builders is explicitly verified by the relay. The worst case is a single missed block per builder. Even that worst case is unlikely because an invalid block that does not win the auction is sufficient to trigger a demotion (itself requiring manual intervention to go back to optimistic mode).

@Ruteri
Copy link

Ruteri commented Jan 16, 2023

a builder must put up a collateral that is greater than the value of the blocks they are proposing

This does make participating more difficult, the builders who don't have huge upfront capital (the blocks are 10+ eth surprisingly often) will be significantly worse off.
How exactly does the collateral look like?

@JustinDrake
Copy link

How exactly does the collateral look like?

Every relay can have their own policy.

For the ultra sound relay here's our current thinking (which may change):

  • During the testing phase we credit reliable builders (according to some criteria) 1 ETH of our own collateral.
  • To prevent "rich" builders from having an unfair edge we will cap the maximum collateral amount, e.g. to 10 ETH.

blocks are 10+ eth surprisingly often

The ultra sound relay has had 2 of 5,536 blocks over 10 ETH (scroll down to the "top blocks" leaderboard here). That's roughly 0.036%. We're totally happy having the bulk of blocks (say, >95%) be processed optimistically—if <5% of blocks are processed "pessimistically" that's fine :)

// Check if the block was valid in the optimistic case.
if builderStatus == common.Optimistic {
// Set to high-prio while we process the winning block.
err := api.datastore.SetBlockBuilderStatus(bidTrace.BuilderPubkey.String(), common.HighPrio)
Copy link

@JustinDrake JustinDrake Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to set all builder pubkeys sharing the same collateral ID to HighPrio. My suggestion is to refactor demoteBuildersByCollateralID to a more general function titled setOptimisticSubstatusByCollateralID (see substatuses suggestion below) and call that function here. Same thing at the end of this function where we set back to optimistic.

To avoid overloading the HighPrio and LowPrio statuses I suggest introducing substatuses for optimistic builders. For example:

  • optimistic.active (in the default case)
  • optimistic.demoted (after a simulation error)
  • optimistic.getPayload (in the case where the optimistic builder won the slot auction and we're still processing getPayload)

In hindsight I feel that having a completely segregated state machine for optimistic builders will be cleaner, safer, and make code review easier.

Another advantage of the optimistic.demoted substatus is that we won't have to wastefully re-simulate the winning block—upserting the proposer signature will be sufficient. It will also make it easier for operators to identify demoted builders.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes a lot of sense! i think in terms of implementing this i would just go for 3 statuses rather than try to encode the semantics of "sub-status", but having optimisticActive, optimisticDemoted, optimisticGetPayload in addition to lowPrio, highPrio, blacklisted feels correct.

also agree that a single datastore function that sets the status over the set of builders with matching collateral ids makes sense and will be useful every time we need to modify that collection of statuses!

i will work on this ASAP. leaving the thread open to track and will circle back!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just going for 3 statuses makes sense to me!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a better name than optimisticGetPayload. Maybe optimisticLocked?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out 51edd33! this is taking into account the discussion above. breaking down the following cases, i think this covers the expected behavior.

Case 1.

  1. optimisticBlockProcesor receives the block first. Block simulation fails. Builder status set to common.OptimisticDemoted.
  2. getPayload sees common.OptimisticDemoted, so just inserts the signedBeaconBlock into the builder demotion table (doesn't resimulate).

Case 2.

  1. getPayload receives the block first. Block simulation fails. Builder status set to common.OptimisticDemoted and signedBeaconBlock inserted into the demotion table.
  2. optimisticBlockProcesor sees that that status of the builder is no longer common.OptimisticActive, so just skips the simulation.

Case 3.

  1. getPayload receives the block first. Block simulation in progress. Builder status set to common.OptimisticLocked.
  2. optimisticBlockProcesor sees that that status of the builder is no longer common.OptimisticActive, so just skips the simulation.

There are more cases than this, and especially if the block simulation succeeds, we will probably end up simulating the blocks twice. e.g., optimisticBlockProcesor gets block first, successfully simulates it so no status change is made. getPayload gets the block second and sees status is still OptimisticActive, so simulates as well. But I think trying to handle each of these individually adds complexity without really solving much (especially because getPayload is a once per slot operaion.

LMK what you think!

@metachris
Copy link

I'm a bit concerned that proposers are forced into a more unsafe interaction (i.e. trusting the relay and then not receiving the promised working block) without having a choice.

How do you decide when to pay the proposer from the collateral? By alerts on specific log messages, or by proposers reaching out to you? I'm sure there's lots of lazy proposers that wouldn't reach out to you because they don't notice in time, so you'd need to automate this payment somehow.

Such payments might also have negative tax implications (i.e. how can you explain the payment from an EOA wallet that's not the block builder)?

@JustinDrake
Copy link

How do you decide when to pay the proposer from the collateral?

That is the purpose of the demotions table. The demotion table contains sufficient information to refund all affected proposers. Our plan as relay operators is to hook up the demotion table to our alerting system to get appropriate automatic calls and emails. A reasonable soft commitment could be for refunds to happen within 24 hours.

(As a side note, refunds are "irrational" for builders. The reason is that a cheating or buggy builder wouldn't extract MEV but still have to pay the bid to the proposer. Builders are incentivised to not mess up—I don't expect refunds to happen often in practice.)

by proposers reaching out to you?

Proposers do not need to reach out. The relay operator can pay the bid to the proposer's fee recipient without any interaction with the proposer.

Such payments might also have negative tax implications (i.e. how can you explain the payment from an EOA wallet that's not the block builder)?

Ha, that's an interesting point! :) I'm curious how it could have a negative tax implication for proposers.

Below are a couple thoughts. (I'm not a tax expert so take the following with a huge grain of salt.)

  1. Proposer rewards are taxed as income. Income tax is "worst case" from a tax perspective. So if refunds are somehow taxed differently to in-block rewards that would, if anything, potentially have a positive tax implication for proposers.
  2. I don't see why delayed bid payments would be treated differently to in-block bid payments. As far as I can tell both would be taxed as income.

@JustinDrake
Copy link

@metachris: I had a chat with bloXroute today. They have refunded proposers several times, e.g. when the block body was not published on time—apparently it's not that uncommon. They said proposers haven't had issues with such delayed bid payments.

@ralexstokes
Copy link

I'll chime in that while relays certainly can reimburse proposers for missed slots, it invites a certain moral hazard that in aggregate makes the network weaker overall. If you can just paper over buggy code with money then you care less about writing bug-free code, and in doing so raise the ambient risk of a serious bug getting to mainnet.

For similar reasons, an "optimistic" relay invites the same kind of moral hazard as I see it and so I would be very careful about deploying this change.


// Demotes a set of builders who have matching collateral IDs. This function
// continues to try demoting even if an error is encountered for any builder.
func (api *RelayAPI) demoteBuildersByCollateralID(builderPubkey string) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's multiple ways this function can error out, and any of these might need to be manually followed up for executing the payment while the demotion not being part of the database. I suggest to at least label them commonly with a certain "relay-must-pay-but-db-failed"-like log tag.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point! in dd2fd60 i added a logging field

"errorWritingRefundToDB": true,

this should be seen as a critically important log!!

@michaelneuder
Copy link
Owner Author

Justin and I discussed this offline, but we think it makes sense to abandon this specific PR in favor of a new iteration with fewer commits and comments. Additionally, the new PR will have unit tests exercising the code. Because we want to be extra confident in the code, I am writing a test plan that I will attach to that PR for comment as well. the test plan will describe (1) the unit tests, (2) the integration tests we need (setting up the database, running the relay standalone, etc) and (3) the e2e tests we need (complete journeys that include block builders and validators interacting with the relay).

Once that PR is open, I will reference it here (it will also be against my forked version of the repo rather than the actual flashbots repo for now).

@ralexstokes
Copy link

the test plan will describe (1) the unit tests, (2) the integration tests we need (setting up the database, running the relay standalone, etc) and (3) the e2e tests we need (complete journeys that include block builders and validators interacting with the relay).

not to derail your efforts, but it could provide a lot of value to the community if we worked on making these tests generic for all relay software (e.g. like how we have https://github.com/ethereum/consensus-spec-tests generated from the consensus-specs) -- if you are interested in exploring this route, perhaps DM me on telegram or something (same handle) and we can discuss further :)

@michaelneuder
Copy link
Owner Author

the test plan will describe (1) the unit tests, (2) the integration tests we need (setting up the database, running the relay standalone, etc) and (3) the e2e tests we need (complete journeys that include block builders and validators interacting with the relay).

not to derail your efforts, but it could provide a lot of value to the community if we worked on making these tests generic for all relay software (e.g. like how we have https://github.com/ethereum/consensus-spec-tests generated from the consensus-specs) -- if you are interested in exploring this route, perhaps DM me on telegram or something (same handle) and we can discuss further :)

would love to explore this route. dmed you on telegram! thanks, Alex :-)

// Check if the block was valid if the builder is OptimisticActive or OptimisticDemoted.
if builderStatus == common.OptimisticActive || builderStatus == common.OptimisticDemoted {
// Set to locked while we process the winning block.
api.setStatusByCollateralID(bidTrace.BuilderPubkey.String(), common.OptimisticLocked)
Copy link

@JustinDrake JustinDrake Jan 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if handleSubmitNewBlock for the next slot is somehow called before handleGetPayload for the current slot executes this line code? For example, what if api.db.SaveDeliveredPayload (a few lines above) takes a really long time? There is a race condition here, right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from discussions offline, we are going to set value for currentOptimisticSlot and only optimistically process blocks one slot at a time. this will avoid these tricky race conditions and make our design easier to reason about. it also allows us to get rid of the OptimisticLocked status, because we know while we are validating the block that won the slot n auction, if the optimistic processing begins for slot n+1, then we must have already processed all the blocks in slot n, so we can be sure that if the winning block ended up being invalid, that builder has been demoted and thus doesn't have an opportunity to build invalid blocks on invalid blocks.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be one of the major design changes of the PR 3 :-)

// Check if the block was valid if the builder is OptimisticActive or OptimisticDemoted.
if builderStatus == common.OptimisticActive || builderStatus == common.OptimisticDemoted {
// Set to locked while we process the winning block.
api.setStatusByCollateralID(bidTrace.BuilderPubkey.String(), common.OptimisticLocked)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a more fundamental problem: we can't rely on handleGetPayload to determine the winning block. The reason is that the validator could collude with the builder to get the payload out of band, without ever calling handleGetPayload. This could even happen without collusion when multiple relays forwarded the same header and the proposer gets the payload from another relay first.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline, i don't see this as being a significant issue. if handleGetPayload is never called, then we never publish the block, thus we aren't responsible for refunding the proposer even if the block turns out to be invalid. i don't think it is possible for a proposer to trick us into thinking they deserve a refund unless we publish a signed, invalid block. if that is the case and they are colluding with the builder to make sure they get a bad block, we refund them from the builders collateral anyway, so it is just their own money changing hands. let me know if i am missing something here!

// Check if the block was valid if the builder is OptimisticActive or OptimisticDemoted.
if builderStatus == common.OptimisticActive || builderStatus == common.OptimisticDemoted {
// Set to locked while we process the winning block.
api.setStatusByCollateralID(bidTrace.BuilderPubkey.String(), common.OptimisticLocked)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Self-TODO: think about what can go wrong with reorgs.)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about this a bit, i dont see any issue actually. once they call getPayload through us, that beacon block at slot n is published. even if they were to try and use a valid block for slot n to reorg the n+1 slot block out, now they have signed two blocks at slot n which is a slashing condition. basically we know that once that block is out there, they can never get that reward back. also another nice aspect of handling the refunds manually is that we can sanity check that the slot we are refunding is empty. given it will take place O(hours) later, we can be confident that no reorg is going to change that.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmk if i am missing something here!

michaelneuder added a commit that referenced this pull request Mar 24, 2023
include more error details for the context error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants