Proposal: Make MemBuffer Simple and Robust #58289

you06 · 2024-12-16T08:24:53Z

MemBuffer stores staging mutations before a transaction is committed. It offers a rich interface to meet TiDB's requirements.

A high-performance, single-threaded in-memory indexing data structure, supporting set/get operations and iterators.
Cascading transactions to support statement or row-level rollback.
Checkpointing functionality.
Snapshot reads/scans to provide historical versions.

The current usage in TiDB exceeds the capabilities MemBuffer was designed for. This issue identifies improper usages and tracks improvements.

Unsafe Concurrent Opearations

DATA RACE github.com/tikv/client-go/v2/internal/unionstore/rbt.(*RBT) #56178

MemBuffer is not thread-safe, meaning it may panic due to data races if TiDB attempts concurrent operations.

Even with mutexes to avoid data races, TiDB cannot guarantee correctness, as explained in this comment.

There is already an RWMutex in MemBuffer, used to prevent SnapshotGetter races with write operations. While it feels unusual to have an RWMutex in a single-threaded data structure, at least the write operations will not affect snapshot read results.

Improper Snapshot Usage

Consider a simple scenario: we have a B-tree, and how do we update it during iteration?

package main

import (
	"fmt"

	"github.com/tidwall/btree"
)

func main() {
	// create a map
	var tree btree.Map[int, int]

	// init
	for i := 0; i < 10; i++ {
		tree.Set(i, i)
	}

	// iterate while updating
	tree.Scan(func(key, value int) bool {
		fmt.Printf("%d %d\n", key, value)
		tree.Set(2*key, 2*key) // this can be considered as a UB
		return true
	})
}

MemBuffer provides SnapshotIter and SnapshotIterReverse methods. However, during iteration over a snapshot, TiDB does not fully drain it at once and may write new data into MemBuffer midway. To ensure the correctness of the snapshot iterator, ART introduces a complex node retention mechanism (tikv/client-go#1503). This seems overly complex for an in-memory data structure. IMO, the iterator should fully drain the data in a single call, as MemBuffer is not designed to support MVCC.

Little changes.

type MemBuffer interface {
	...
+ 	// deprecated
	SnapshotIter([]byte, []byte) Iterator
+ 	// deprecated
	SnapshotIterReverse([]byte, []byte) Iterator

+ 	SnapshotScan([]byte, []byte) [][]byte
+ 	SnapshotScanReverse([]byte, []byte) [][]byte

More changes.

type MemBuffer interface {
	...
+ 	// deprecated
	SnapshotIter([]byte, []byte) Iterator
+ 	// deprecated
	SnapshotIterReverse([]byte, []byte) Iterator
+ 	// deprecated
	SnapshotGetter() Getter

+ 	Snapshot() MemBufferSnapshot
	...
}

// TiDB might read from a snapshot across multiple concurrent threads. Each operation within `MemBufferSnapshot` is guarded by an `RWMutex` to prevent data races with writes to the `MemBuffer`.
+type MemBufferSnapshot interface {
+	Scan([]byte, []byte) [][]byte
+	ScanReverse([]byte, []byte) [][]byte
+	Get(ctx context.Context, k []byte) ([]byte, error)
+}

During each statement execution, TiDB should using the snapshot following this procedure:

Check if the MemBuffer is dirty, if so, create a MemBufferSnapshot by Snapshot method.
When need to read snapshot data, always read from the Snapshot fetched in step1, it's thread safe.
When the statement ends, should drop the snapshot and never use it anymore.

The text was updated successfully, but these errors were encountered:

you06 · 2024-12-18T05:57:04Z

To the developers of TiDB, we need to reach a consensus that during execution, read executors (see below) should exclusively use snapshot read operations on the MemBuffer. It's clear that we cannot completely avoid concurrent read and write operations, but at the very least, we must ensure that the read results are consistent (from a snapshot).

The related datasource executors are:

The following usage can be considered improper:

Executor Thread 1	Executor Thread 2
`exec1.Next`	`exec2.Next`
write to `MemBuffer`	read through `MemBuffer` if txn is not read-only

*: Executor Thread 1 and 2 are two concurrent threads of the same SQL.

The read-only check is performed by MemBuffer.Dirty, which can lead to potential race conditions with write operations on MemBuffer. Furthermore, the result of MemBuffer.Dirty() depends on the order of write operations, meaning that sometimes the query reads from MemBuffer, and other times it does not. Even though the two execution paths do not affect the result set, this behavior is improper and high risk.

My proposal is that we introduce a snapshot interface for MemBuffer. Before executing a SQL statement, TiDB should take a snapshot of MemBuffer if the transaction is not read-only. During execution, when the datasource executors need to read through MemBuffer, which should be handled by the snapshot. Each read operation through the snapshot is protected by RWMutex.RLock. Additionally, TiDB should enforce the rule that only one thread can write to MemBuffer, with all write operations being guarded by RWMutex.Lock.

The execution procedure is like:

parse & optimize

memBufferSnapshot := txn.IsReadOnly() ? txn.GetMemBuffer().Snapshot() : nil

build the executor and start execution
  - write thread: txn.GetMemBuffer().Set(...)
  - read thread:
      if memBufferSnapshot != nil
        union read
          memBufferSnapshot.Get / memBufferSnapshot.Scan
          tikvSnapshot.Get / tikvSnapshot.Scan
      else
        tikvSnapshot.Get / tikvSnapshot.Scan

Welcome to vote by emoji, 👍 if agree, 👎 if you have other ideas.

ekexium · 2024-12-18T06:25:59Z

the iterator should fully drain the data in a single call, as MemBuffer is not designed to support MVCC.

It seems a safer way. Just need to know if there will be regressions, in both latency and memory

you06 · 2024-12-18T07:06:02Z

It seems a safer way. Just need to know if there will be regressions, in both latency and memory

I haven't tested for it.

I suppose no regression in most cases, for latency sensitive cases, the size of MemBuffer is used to be small, so there won't be much difference between iterator and scan.

For memory consumption, the returned KVs are just slice headers, it's not cloned memory, the risk is not such high I think.

ekexium · 2025-01-13T07:15:28Z

Based on discussion, here are the subtasks:

Prevent iterator invalidation #59153
Protect each memdb with a single RWLock to prevent data race
Fix the misuse about "dirty" field (DATA RACE github.com/tikv/client-go/v2/internal/unionstore/rbt.(*RBT) #56178) in TiDB executor
Revert parser: add testcases for lexer #1503 after these enhancements

cc @you06 @cfzjywxk

cfzjywxk · 2025-01-13T09:52:17Z

@ekexium

Protect each memdb with a single RWLock to prevent data race

Do we also need to abandon the thread unsafe exposed interfaces as well?

ekexium · 2025-01-13T10:08:21Z

@ekexium

Protect each memdb with a single RWLock to prevent data race

Do we also need to abandon the thread unsafe exposed interfaces as well?

We hope to make memdb fully thread safe by locking. In practice, they need to be addressed case by case, particularly when assessing potential performance impacts

you06 added the type/enhancement The issue or PR belongs to an enhancement. label Dec 16, 2024

This was referenced Dec 17, 2024

membuffer: protect snapshot getter with mutex tikv/client-go#1527

Open

executor: only use snapshot getter when point/batch point executor is running #58345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Make MemBuffer Simple and Robust #58289

Proposal: Make MemBuffer Simple and Robust #58289

you06 commented Dec 16, 2024 •

edited

Loading

you06 commented Dec 18, 2024

ekexium commented Dec 18, 2024

you06 commented Dec 18, 2024

ekexium commented Jan 13, 2025 •

edited

Loading

cfzjywxk commented Jan 13, 2025

ekexium commented Jan 13, 2025

Proposal: Make MemBuffer Simple and Robust #58289

Proposal: Make MemBuffer Simple and Robust #58289

Comments

you06 commented Dec 16, 2024 • edited Loading

Unsafe Concurrent Opearations

Improper Snapshot Usage

you06 commented Dec 18, 2024

ekexium commented Dec 18, 2024

you06 commented Dec 18, 2024

ekexium commented Jan 13, 2025 • edited Loading

cfzjywxk commented Jan 13, 2025

ekexium commented Jan 13, 2025

you06 commented Dec 16, 2024 •

edited

Loading

ekexium commented Jan 13, 2025 •

edited

Loading