-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimistic Relay #2
base: main
Are you sure you want to change the base?
Conversation
is_blacklisted boolean NOT NULL, | ||
status int NOT NULL, | ||
collateral_value NUMERIC(48, 0), | ||
collateral_id bigint NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'd want to create a new migration file applies the changes to the database schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. figured that might be the case. leaving this thread open as i work through that! thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright. i see that the code is outdated, but this is not yet done in the code. All existing migrations need to be left untouched, and a single new migration file created that modifies the previous database. Take a look at this migration for reference: https://github.com/flashbots/mev-boost-relay/blob/main/database/migrations/002_bid_remove_isbest_add_receivedat.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, Chris. i took a swing at this here: c4bf788. still need to test this to make sure the DB changes make sense, but i think this overall approach of creating the new columns, updating the column values based on the is_high_prio
and is_blacklisted
columns, and then dropping those two should be a good approach. let me know if i am missing something here! thanks again :-)
this is an interesting idea and I imagine many builders would want to use it that being said I want to call out that this does weaken the integrity of the PBS construction as it is hard to correctly price the value of an invalid block you address this partially:
but in the event a builder sends an invalid block to the proposer not only does the proposer lose out on significant revenue but also we have a missed slot on-chain which I argue affects the entire Ethereum user base. To demonstrate, let's say I am a user not touching any of this MEV stuff and it is important for me to get my transaction on-chain at a given block height (e.g. some kind of on-chain deadman switch) and I just happen to end up in this optimistic block submission mechanism. In the event of an invalid block, my transaction doesn't get through in time. It looks like the proposer would be compensated for the missed slot, but I would not be. Have you thought about pricing the total cost of a missed slot, beyond the proposer payment when considering the second-order effects of this proposal? |
Only a single slot can be lost per collateral ID without manual intervention. The corresponding builder would lose reputation in addition to the collateral if they abuse empty slots. We could add a penalty (e.g. X ETH) in addition to the block value but I don't think it's necessary at this point. We could also have a "cooloff" period of 1 day before reactivation as an optimistic builder. As a side note, Ethereum is designed to handle empty slots. 1% of slots are already empty and if an attacker really cares about empty slots they can get some for free by DoSing proposers since we don't have SSLE yet. |
simErr := api.blockSimRateLimiter.send(opts.ctx, opts.req, opts.highPrio) | ||
log := opts.log.WithFields(logrus.Fields{ | ||
"duration": time.Since(t).Seconds(), | ||
"numWaiting": api.blockSimRateLimiter.currentCounter(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth logging something similar to numWaiting
but for optimistic block processors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that seems like it could be valuable to provide metrics into how many of the block processors are working at a time! this is a slightly bigger change because we don't have a currentCounter like the block sim rate limiter has. let me tinker on it and get back to this thread. leaving it open for now.
I'd be wary of going in this direction. We saw that bugs in the builders code propagated if the blocks were not explicitly verified by the relays, causing three or so extended periods of missed blocks. I'd definitely at least make this an opt-in from both the validator and the relay sides. |
We are definitely assuming buggy builders—that's part of our threat model! Every block received from builders is explicitly verified by the relay. The worst case is a single missed block per builder. Even that worst case is unlikely because an invalid block that does not win the auction is sufficient to trigger a demotion (itself requiring manual intervention to go back to optimistic mode). |
This does make participating more difficult, the builders who don't have huge upfront capital (the blocks are 10+ eth surprisingly often) will be significantly worse off. |
Every relay can have their own policy. For the ultra sound relay here's our current thinking (which may change):
The ultra sound relay has had 2 of 5,536 blocks over 10 ETH (scroll down to the "top blocks" leaderboard here). That's roughly 0.036%. We're totally happy having the bulk of blocks (say, >95%) be processed optimistically—if <5% of blocks are processed "pessimistically" that's fine :) |
services/api/service.go
Outdated
// Check if the block was valid in the optimistic case. | ||
if builderStatus == common.Optimistic { | ||
// Set to high-prio while we process the winning block. | ||
err := api.datastore.SetBlockBuilderStatus(bidTrace.BuilderPubkey.String(), common.HighPrio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to set all builder pubkeys sharing the same collateral ID to HighPrio
. My suggestion is to refactor demoteBuildersByCollateralID
to a more general function titled setOptimisticSubstatusByCollateralID
(see substatuses suggestion below) and call that function here. Same thing at the end of this function where we set back to optimistic.
To avoid overloading the HighPrio
and LowPrio
statuses I suggest introducing substatuses for optimistic builders. For example:
optimistic.active
(in the default case)optimistic.demoted
(after a simulation error)optimistic.getPayload
(in the case where the optimistic builder won the slot auction and we're still processinggetPayload
)
In hindsight I feel that having a completely segregated state machine for optimistic builders will be cleaner, safer, and make code review easier.
Another advantage of the optimistic.demoted
substatus is that we won't have to wastefully re-simulate the winning block—upserting the proposer signature will be sufficient. It will also make it easier for operators to identify demoted builders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes a lot of sense! i think in terms of implementing this i would just go for 3 statuses rather than try to encode the semantics of "sub-status", but having optimisticActive, optimisticDemoted, optimisticGetPayload
in addition to lowPrio, highPrio, blacklisted
feels correct.
also agree that a single datastore function that sets the status over the set of builders with matching collateral ids makes sense and will be useful every time we need to modify that collection of statuses!
i will work on this ASAP. leaving the thread open to track and will circle back!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just going for 3 statuses makes sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a better name than optimisticGetPayload
. Maybe optimisticLocked
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check out 51edd33! this is taking into account the discussion above. breaking down the following cases, i think this covers the expected behavior.
Case 1.
optimisticBlockProcesor
receives the block first. Block simulation fails. Builder status set tocommon.OptimisticDemoted
.getPayload
seescommon.OptimisticDemoted
, so just inserts thesignedBeaconBlock
into the builder demotion table (doesn't resimulate).
Case 2.
getPayload
receives the block first. Block simulation fails. Builder status set tocommon.OptimisticDemoted
andsignedBeaconBlock
inserted into the demotion table.optimisticBlockProcesor
sees that that status of the builder is no longercommon.OptimisticActive
, so just skips the simulation.
Case 3.
getPayload
receives the block first. Block simulation in progress. Builder status set tocommon.OptimisticLocked
.optimisticBlockProcesor
sees that that status of the builder is no longercommon.OptimisticActive
, so just skips the simulation.
There are more cases than this, and especially if the block simulation succeeds, we will probably end up simulating the blocks twice. e.g., optimisticBlockProcesor
gets block first, successfully simulates it so no status change is made. getPayload
gets the block second and sees status is still OptimisticActive
, so simulates as well. But I think trying to handle each of these individually adds complexity without really solving much (especially because getPayload
is a once per slot operaion.
LMK what you think!
I'm a bit concerned that proposers are forced into a more unsafe interaction (i.e. trusting the relay and then not receiving the promised working block) without having a choice. How do you decide when to pay the proposer from the collateral? By alerts on specific log messages, or by proposers reaching out to you? I'm sure there's lots of lazy proposers that wouldn't reach out to you because they don't notice in time, so you'd need to automate this payment somehow. Such payments might also have negative tax implications (i.e. how can you explain the payment from an EOA wallet that's not the block builder)? |
That is the purpose of the demotions table. The demotion table contains sufficient information to refund all affected proposers. Our plan as relay operators is to hook up the demotion table to our alerting system to get appropriate automatic calls and emails. A reasonable soft commitment could be for refunds to happen within 24 hours. (As a side note, refunds are "irrational" for builders. The reason is that a cheating or buggy builder wouldn't extract MEV but still have to pay the bid to the proposer. Builders are incentivised to not mess up—I don't expect refunds to happen often in practice.)
Proposers do not need to reach out. The relay operator can pay the bid to the proposer's fee recipient without any interaction with the proposer.
Ha, that's an interesting point! :) I'm curious how it could have a negative tax implication for proposers. Below are a couple thoughts. (I'm not a tax expert so take the following with a huge grain of salt.)
|
@metachris: I had a chat with bloXroute today. They have refunded proposers several times, e.g. when the block body was not published on time—apparently it's not that uncommon. They said proposers haven't had issues with such delayed bid payments. |
I'll chime in that while relays certainly can reimburse proposers for missed slots, it invites a certain moral hazard that in aggregate makes the network weaker overall. If you can just paper over buggy code with money then you care less about writing bug-free code, and in doing so raise the ambient risk of a serious bug getting to mainnet. For similar reasons, an "optimistic" relay invites the same kind of moral hazard as I see it and so I would be very careful about deploying this change. |
services/api/service.go
Outdated
|
||
// Demotes a set of builders who have matching collateral IDs. This function | ||
// continues to try demoting even if an error is encountered for any builder. | ||
func (api *RelayAPI) demoteBuildersByCollateralID(builderPubkey string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's multiple ways this function can error out, and any of these might need to be manually followed up for executing the payment while the demotion not being part of the database. I suggest to at least label them commonly with a certain "relay-must-pay-but-db-failed"-like log tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point! in dd2fd60 i added a logging field
"errorWritingRefundToDB": true,
this should be seen as a critically important log!!
…Active or OptimisticDemoted
…ess to the fee recipient
Justin and I discussed this offline, but we think it makes sense to abandon this specific PR in favor of a new iteration with fewer commits and comments. Additionally, the new PR will have unit tests exercising the code. Because we want to be extra confident in the code, I am writing a test plan that I will attach to that PR for comment as well. the test plan will describe (1) the unit tests, (2) the integration tests we need (setting up the database, running the relay standalone, etc) and (3) the e2e tests we need (complete journeys that include block builders and validators interacting with the relay). Once that PR is open, I will reference it here (it will also be against my forked version of the repo rather than the actual flashbots repo for now). |
not to derail your efforts, but it could provide a lot of value to the community if we worked on making these tests generic for all relay software (e.g. like how we have https://github.com/ethereum/consensus-spec-tests generated from the |
would love to explore this route. dmed you on telegram! thanks, Alex :-) |
// Check if the block was valid if the builder is OptimisticActive or OptimisticDemoted. | ||
if builderStatus == common.OptimisticActive || builderStatus == common.OptimisticDemoted { | ||
// Set to locked while we process the winning block. | ||
api.setStatusByCollateralID(bidTrace.BuilderPubkey.String(), common.OptimisticLocked) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if handleSubmitNewBlock
for the next slot is somehow called before handleGetPayload
for the current slot executes this line code? For example, what if api.db.SaveDeliveredPayload
(a few lines above) takes a really long time? There is a race condition here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from discussions offline, we are going to set value for currentOptimisticSlot and only optimistically process blocks one slot at a time. this will avoid these tricky race conditions and make our design easier to reason about. it also allows us to get rid of the OptimisticLocked
status, because we know while we are validating the block that won the slot n
auction, if the optimistic processing begins for slot n+1
, then we must have already processed all the blocks in slot n
, so we can be sure that if the winning block ended up being invalid, that builder has been demoted and thus doesn't have an opportunity to build invalid blocks on invalid blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to be one of the major design changes of the PR 3 :-)
// Check if the block was valid if the builder is OptimisticActive or OptimisticDemoted. | ||
if builderStatus == common.OptimisticActive || builderStatus == common.OptimisticDemoted { | ||
// Set to locked while we process the winning block. | ||
api.setStatusByCollateralID(bidTrace.BuilderPubkey.String(), common.OptimisticLocked) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a more fundamental problem: we can't rely on handleGetPayload
to determine the winning block. The reason is that the validator could collude with the builder to get the payload out of band, without ever calling handleGetPayload
. This could even happen without collusion when multiple relays forwarded the same header and the proposer gets the payload from another relay first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed offline, i don't see this as being a significant issue. if handleGetPayload
is never called, then we never publish the block, thus we aren't responsible for refunding the proposer even if the block turns out to be invalid. i don't think it is possible for a proposer to trick us into thinking they deserve a refund unless we publish a signed, invalid block. if that is the case and they are colluding with the builder to make sure they get a bad block, we refund them from the builders collateral anyway, so it is just their own money changing hands. let me know if i am missing something here!
// Check if the block was valid if the builder is OptimisticActive or OptimisticDemoted. | ||
if builderStatus == common.OptimisticActive || builderStatus == common.OptimisticDemoted { | ||
// Set to locked while we process the winning block. | ||
api.setStatusByCollateralID(bidTrace.BuilderPubkey.String(), common.OptimisticLocked) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Self-TODO: think about what can go wrong with reorgs.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about this a bit, i dont see any issue actually. once they call getPayload
through us, that beacon block at slot n
is published. even if they were to try and use a valid block for slot n
to reorg the n+1
slot block out, now they have signed two blocks at slot n
which is a slashing condition. basically we know that once that block is out there, they can never get that reward back. also another nice aspect of handling the refunds manually is that we can sanity check that the slot we are refunding is empty. given it will take place O(hours) later, we can be confident that no reorg is going to change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lmk if i am missing something here!
include more error details for the context error
tl;dr; This PR implements an "optimistic" version of flashbots' mev-boost-relay. The core idea is that we can improve the throughput and latency of block submissions by using asynchronous processing. Instead of validating each block that comes in immediately, we return early and add the block to the datastore before checking its validity. This (hopefully) allows builders to submit more blocks, and critically, submit blocks later in the time period of the slot (because they don't have to wait for the simulation of the block to return successfully). Proposers still use the Proposer API to select the highest bid, but there is no longer a guarantee that this block is valid. To submit blocks in the optimistic mode, a builder must put up a collateral that is greater than the value of the blocks they are proposing. If a proposer ends up signing an invalid block, collateral from the builder of that block will be used to refund the proposer.
The changes can be described through 3 sequences:
1. Submitting optimistic blocks.
Builder API
endpoint of the relay.a. if the builder status is
optimisticActive
, send the block to theOptimistic Block Channel
.b. if the builder status is
optimisticLocked
, send the block to thehigh-prio queue
of theprio-load-balancer
.c. if the builder status is
optimisticDemoted
, send the block to thelow-prio queue
of theprio-load-balancer
.d. if the builder status is
highPrio
, send the block to thehigh-prio queue
of theprio-load-balancer
.e. if the builder status is
lowPrio
, send the block to thelow-prio queue
of theprio-load-balancer
.f. if the builder is in
blacklisted
mode, discard the block.2. Validating optimistic blocks.
OptimisticBlockProcessor
receives a block from theOptimistic Block Channel
(this happens in a different goroutine than theBuilder API
).OptimisticActive
, theOptimisticBlockProcessor
sends the block ashigh-prio
to theprio-load-balancer
for simulation.OptimisticBlockProcessor
.OptimisticDemoted
and the details of the failure are written to the database.3. Proposing optimistic blocks
mev-boost
callsgetHeader
on theProposer API
of the relay. This is part of theMEV-Block Block Proposal
as documented in https://docs.flashbots.net/flashbots-mev-boost/architecture-overview/block-proposal.mev-boost
callsgetPayload
on theProposer API
of the relay. This triggers the publication of aSignedBeaconBlock
.Proposer API
checks the builder status.a. if the status is
OptimisticActive
, the block that was just proposed is sent to the high-prio queue on theprio-load-balancer
for simulation.b. if the status is
OptimisticDemoted
, then write theSignedBeaconBlock
to the database for a refund justification.Proposer API
.Builder Demotions Table
with the details of the failure so the refund can be justified.Builder status edge cases.
There are two additional situations where the status of a builder is temporarily modified:
OptimisticLocked
temporarily to avoid optimistic blocks being built on optimistic blocks.The following diagram depicts the state machine of the builder status:
