take #3 :-) #3

michaelneuder · 2023-01-29T03:32:53Z

tl;dr; This PR implements an "optimistic" version of flashbots' mev-boost-relay. The core idea is that we can improve the throughput and latency of block submissions by using asynchronous processing. Instead of validating each block that comes in immediately, we return early and add the block to the datastore before checking its validity. This (hopefully) allows builders to submit more blocks, and critically, submit blocks later in the time period of the slot (because they don't have to wait for the simulation of the block to return successfully). Proposers still use the Proposer API to select the highest bid, but there is no longer a guarantee that this block is valid. To submit blocks in the optimistic mode, a builder must put up a collateral that is greater than the value of the blocks they are proposing. If a proposer ends up signing an invalid block, collateral from the builder of that block will be used to refund the proposer.

The changes can be described through 3 sequences:

Submitting optimistic blocks.
Validating optimistic blocks.
Proposing optimistic blocks.

1. Submitting optimistic blocks.

Block builders submit blocks to the Builder API endpoint of the relay.
Based on the collateral and status of the builder:
a. if the builder collateral is greater than the value of the block and the builder is not demoted, run the block simulation optimistically.
b. else send the builder to the appropriate queue of the prio-load-balancer.
For non-optimistic blocks, wait for the validation result.
Update the builders current bid in redis.

Notice that for builders with sufficient collateral, we update the bid without validating the incoming block. This is where the improved performance can (hopefully) be achieved.

2. Validating optimistic blocks.

The OptimisticBlockProcessor sends the block as high-prio to the prio-load-balancer for simulation.
The block is simulated on the validating nodes, and the status is returned to the OptimisticBlockProcessor.
If the simulation failed, the builder is demoted and the details of the failure are written to the database.

This flow handles the simulation of all the blocks that were optimistically skipped. An invalid block here results in a demotion, but not necessarily a refund for the proposer because we don't yet know whether this block was the winning bid and thus proposed.

3. Proposing optimistic blocks

mev-boost calls getHeader on the Proposer API of the relay. This is part of the MEV-Block Block Proposal as documented in https://docs.flashbots.net/flashbots-mev-boost/architecture-overview/block-proposal.
mev-boost calls getPayload on the Proposer API of the relay. This triggers the publication of a SignedBeaconBlock.
The block that was just proposed is sent to the high-prio queue on the prio-load-balancer for simulation.
The prio-load-balancer runs the simulation of the block against the validation nodes.
If the block simulation failed, an invalid block was published. We upsert a new row into the Builder Demotions Table with the details of the failure so the refund can be justified.

This flow represents the process of checking if our optimistic block processing ever results in a validator submitting an invalid block. Since block builders will post collateral, this will be used to reimburse the validator in that case. Since refunds should be a relatively rare event, we plan on handling them manually.

Misc notes

We only allow optimistic processing of one slot at a time. We use a waitgroup to ensure that before we update the optimistic slot, all the previous slot blocks have been processed. This may bleed into the subsequent slot, but that is OK because we are actually shifting that processing time from the end of the previous slot which is where the timing is much more critical for winning bids.

At the beginning of each slot we also cache the status and collateral of each builder. We access this cache during the block submission flow to (1) avoid repeatedly reading the same status + collateral for a builder and (2) avoid race conditions where the status + collateral of a builder changes over course of a slot due to either a demotion or an internal API call.

Since builders may use many different public keys to submit blocks we allow all of those keys to be backed by a single collateral through the use of the a "Collateral ID". This is aimed at simplifying the process of posting collateral, but if any public key using that Collateral ID submits an invalid block, all of the public keys are demoted.

small fixes

services/api/service.go

database/database.go

database/migrations/003_optimistic.go

JustinDrake · 2023-01-29T20:40:05Z

services/api/service.go

+		// Upsert into the builder demotion table but without the signed block
+		// or the signed registration, because we don't know if this bid will be
+		// accepted.
+		err = api.db.UpsertBuilderDemotion(&opts.req.BuilderSubmitBlockRequest, nil, nil)


We should maybe use logs as a makeshift DB in case of an error here. We get that for free when we abstract away the logic in a separate function (as previously commented).

nice, this lead to a good cleanup IMO. i created a function called demoteBuilder which wraps the calls to SetBlockBuilderStatusByCollateralID and UpsertBuilderDemotion which we call both from the optimisticBlockProcessor and the getPayload handler.

this function takes the logging that we reasoned we could use as our backup DB when we were discussing the getPayload logging, so now we will have consistency in our demotion logging. thanks!

JustinDrake · 2023-02-01T09:35:15Z

services/api/service.go

@@ -950,13 +1097,35 @@ func (api *RelayAPI) handleSubmitNewBlock(w http.ResponseWriter, req *http.Reque
 		}
 	}

-	builderIsHighPrio, builderIsBlacklisted, err := api.redis.GetBlockBuilderStatus(payload.Message.BuilderPubkey.String())
+	builderStatus, err := api.redis.GetBlockBuilderStatus(payload.Message.BuilderPubkey.String())


We should maybe consider caching api.redis.GetBlockBuilderStatus and api.redis.GetBlockBuilderCollateral for all builders, e.g. taking a snapshot once per slot. Here's my thinking:

With optimistic relaying we can have looser rate limiting (e.g. 100 bids/sec/builder). Each Redis read takes on the order of 1ms so these Redis calls can add up, even with just tens of builder pubkeys. The current codebase was designed for a 2 bids/sec/builder paradigm and may need a bit of performance refactoring to handle the load.

The status of builders can change mid-slot in two different ways: a) through demotions, and b) through the internal API. As such there are potential race conditions whenever the internal API is used. It would be conceptually cleaner if at the beginning of every slot (e.g. in updateOptimisticSlot) we take a snapshot of the relevant subset of vars.TableBlockBuilder.

Those Redis calls are in the critical path so we're looking to shave a couple milliseconds of latency—a nice-to-have micro-optimisation.

i think this is a nice change! i added logic to updateOptimisticSlot to cache (create a map as a field of the relay type) the builder status and collaterals at each slot by reading from the TableBlockBuilder as you mentioned. then in the critical path of the block submission, we can just read from this cache. this has a small added benefit of getting rid of our need for the Get and Set collateral on the redis interface (because we only ever read the collateral from the db). and it makes the code more readable on the fast path --- feels like a win!

i was thinking about if we should filter the builders put in the cache, but I don't really think its necessary. we could use a heuristic like "only add builders who have submitted a block in the last 100 slots or something", but unless we need to this feels like http://wiki.c2.com/?PrematureOptimization :-) especially considering this is happening at the beginning of each slot, one db read and iterating through the list of builders just doesn't seem that crazy to me. but lmk what you think!

lmk what you think!

I think it looks great :)

michaelneuder · 2023-02-03T04:21:35Z

responded to your comments in this commit fc8ded6. i can squash down into the base commit once you have a chance to look, just figured it would be nice to see this specific diffs here. thanks :-)

database/types.go

JustinDrake · 2023-02-04T23:05:38Z

datastore/datastore.go

+func (ds *Datastore) SetBlockBuilderStatusByCollateralID(builderPubkey string, status common.BuilderStatus) []error {
+	// Fetch builder collateral_id.
+	builder, err := ds.db.GetBlockBuilderByPubkey(builderPubkey)
+	if err != nil {
+		err = fmt.Errorf("unable to get builder from database: %v", err)
+		ds.log.Error(err.Error())
+		return []error{err}
+	}
+
+	// Fetch all builder pubkeys using the collateral_id.
+	builderPubkeys, err := ds.db.GetBlockBuilderPubkeysByCollateralID(builder.CollateralID)
+	if err != nil {
+		err = fmt.Errorf("unable to get builder pubkeys by collateral id: %v", err)
+		ds.log.Error(err.Error())
+		return []error{err}
+	}
+
+	var errs []error
+	for _, pubkey := range builderPubkeys {
+		err := ds.db.SetBlockBuilderStatus(pubkey, status)
+		if err != nil {
+			err = fmt.Errorf("failed to set block builder: %v status in db: %v", pubkey, err)
+			ds.log.Error(err.Error())
+			errs = append(errs, err)
+		}
+	}
+	return errs
+}


Since Redis is no longer involved should this go in database.go? Also, would it be possible to do a one-shot SQL update (instead of 1 + 1 reads and n writes)? That may be cleaner and more performant. A single error object would suffice.

nice, good points! here is a commit that implements this. 8425fbc

the tl;dr; is we can bundle this all into a single SetBuilderStatus function, which checks if the collateral ID of a builder is set. if it is, then update all the statuses in once DB update call. if it is not, then just update the single builder pubkey. this is a nice cleanup i think!

JustinDrake · 2023-02-04T23:15:40Z

services/api/service.go

+	if !ok {
+		log.Warnf("unable to read builder: %x from the builder cache, using low-prio and no collateral", builderPubkey)
+		builderEntry = &blockBuilderCacheEntry{
+			status: common.BuilderStatus{
+				IsHighPrio: false,
+			},
+			collateral: ZeroU256,
+		}
+	}


I'm tempted to error out and return. The reason is that this path should never happen. And if it does, alerting the builder instead of silently failing may be better. It's also fewer lines of code :)

so the builder won't be in the DB until the first call to UpsertBlockBuilderEntryAfterSubmission which takes place after this (line 1248 in current rev). since the cache reads from the DB, if we don't have this, then the builder will never have a way to get added to the table.

the builder status and collateral also can't be modified until the builder is in the table. we could add an endpoint to manually add the builder, but this would add complexity. IMO logging with warning here feels correct, but happy to continue discussing if you still think we should change it!

Is there an equivalent of log.Warnf but as an error?

yes! there is log.Errorf i didn't use this because every new builder that connects will hit this code path, so it could potentially be a lot of noise in terms of errors that are actually expected. happy to do errorf if you think it is better though!

services/api/service.go

read optimistic_submission field from db during GetBuilderSubmissions

michaelneuder added a commit that referenced this pull request Jan 29, 2023

take #3 :-)

2b7f102

michaelneuder force-pushed the mikeneuder-20230128-1 branch from 6831616 to 2b7f102 Compare January 29, 2023 04:29

michaelneuder added a commit that referenced this pull request Jan 29, 2023

take #3 :-)

1e3adbf

small fixes

michaelneuder force-pushed the mikeneuder-20230128-1 branch from 2b7f102 to 1e3adbf Compare January 29, 2023 13:12

michaelneuder added 2 commits January 29, 2023 13:53

take #3 :-)

03ce6e5

tests

3fc6369

michaelneuder force-pushed the mikeneuder-20230128-1 branch from 1e3adbf to 3fc6369 Compare January 29, 2023 18:53