Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Make MemBuffer Simple and Robust #58289

Open
you06 opened this issue Dec 16, 2024 · 6 comments
Open

Proposal: Make MemBuffer Simple and Robust #58289

you06 opened this issue Dec 16, 2024 · 6 comments
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@you06
Copy link
Contributor

you06 commented Dec 16, 2024

MemBuffer stores staging mutations before a transaction is committed. It offers a rich interface to meet TiDB's requirements.

  • A high-performance, single-threaded in-memory indexing data structure, supporting set/get operations and iterators.
  • Cascading transactions to support statement or row-level rollback.
  • Checkpointing functionality.
  • Snapshot reads/scans to provide historical versions.

The current usage in TiDB exceeds the capabilities MemBuffer was designed for. This issue identifies improper usages and tracks improvements.

Unsafe Concurrent Opearations

MemBuffer is not thread-safe, meaning it may panic due to data races if TiDB attempts concurrent operations.

Even with mutexes to avoid data races, TiDB cannot guarantee correctness, as explained in this comment.

There is already an RWMutex in MemBuffer, used to prevent SnapshotGetter races with write operations. While it feels unusual to have an RWMutex in a single-threaded data structure, at least the write operations will not affect snapshot read results.

Improper Snapshot Usage

Consider a simple scenario: we have a B-tree, and how do we update it during iteration?

package main

import (
	"fmt"

	"github.com/tidwall/btree"
)

func main() {
	// create a map
	var tree btree.Map[int, int]

	// init
	for i := 0; i < 10; i++ {
		tree.Set(i, i)
	}

	// iterate while updating
	tree.Scan(func(key, value int) bool {
		fmt.Printf("%d %d\n", key, value)
		tree.Set(2*key, 2*key) // this can be considered as a UB
		return true
	})
}

MemBuffer provides SnapshotIter and SnapshotIterReverse methods. However, during iteration over a snapshot, TiDB does not fully drain it at once and may write new data into MemBuffer midway. To ensure the correctness of the snapshot iterator, ART introduces a complex node retention mechanism (tikv/client-go#1503). This seems overly complex for an in-memory data structure. IMO, the iterator should fully drain the data in a single call, as MemBuffer is not designed to support MVCC.

Little changes.
type MemBuffer interface {
	...
+ 	// deprecated
	SnapshotIter([]byte, []byte) Iterator
+ 	// deprecated
	SnapshotIterReverse([]byte, []byte) Iterator

+ 	SnapshotScan([]byte, []byte) [][]byte
+ 	SnapshotScanReverse([]byte, []byte) [][]byte
More changes.
type MemBuffer interface {
	...
+ 	// deprecated
	SnapshotIter([]byte, []byte) Iterator
+ 	// deprecated
	SnapshotIterReverse([]byte, []byte) Iterator
+ 	// deprecated
	SnapshotGetter() Getter

+ 	Snapshot() MemBufferSnapshot
	...
}

// TiDB might read from a snapshot across multiple concurrent threads. Each operation within `MemBufferSnapshot` is guarded by an `RWMutex` to prevent data races with writes to the `MemBuffer`.
+type MemBufferSnapshot interface {
+	Scan([]byte, []byte) [][]byte
+	ScanReverse([]byte, []byte) [][]byte
+	Get(ctx context.Context, k []byte) ([]byte, error)
+}

During each statement execution, TiDB should using the snapshot following this procedure:

  1. Check if the MemBuffer is dirty, if so, create a MemBufferSnapshot by Snapshot method.
  2. When need to read snapshot data, always read from the Snapshot fetched in step1, it's thread safe.
  3. When the statement ends, should drop the snapshot and never use it anymore.
@you06
Copy link
Contributor Author

you06 commented Dec 18, 2024

To the developers of TiDB, we need to reach a consensus that during execution, read executors (see below) should exclusively use snapshot read operations on the MemBuffer. It's clear that we cannot completely avoid concurrent read and write operations, but at the very least, we must ensure that the read results are consistent (from a snapshot).

The related datasource executors are:

The following usage can be considered improper:

Executor Thread 1 Executor Thread 2
exec1.Next exec2.Next
write to MemBuffer read through MemBuffer if txn is not read-only

*: Executor Thread 1 and 2 are two concurrent threads of the same SQL.

The read-only check is performed by MemBuffer.Dirty, which can lead to potential race conditions with write operations on MemBuffer. Furthermore, the result of MemBuffer.Dirty() depends on the order of write operations, meaning that sometimes the query reads from MemBuffer, and other times it does not. Even though the two execution paths do not affect the result set, this behavior is improper and high risk.


My proposal is that we introduce a snapshot interface for MemBuffer. Before executing a SQL statement, TiDB should take a snapshot of MemBuffer if the transaction is not read-only. During execution, when the datasource executors need to read through MemBuffer, which should be handled by the snapshot. Each read operation through the snapshot is protected by RWMutex.RLock. Additionally, TiDB should enforce the rule that only one thread can write to MemBuffer, with all write operations being guarded by RWMutex.Lock.

The execution procedure is like:

parse & optimize

memBufferSnapshot := txn.IsReadOnly() ? txn.GetMemBuffer().Snapshot() : nil

build the executor and start execution
  - write thread: txn.GetMemBuffer().Set(...)
  - read thread:
      if memBufferSnapshot != nil
        union read
          memBufferSnapshot.Get / memBufferSnapshot.Scan
          tikvSnapshot.Get / tikvSnapshot.Scan
      else
        tikvSnapshot.Get / tikvSnapshot.Scan

Welcome to vote by emoji, 👍 if agree, 👎 if you have other ideas.

@ekexium
Copy link
Member

ekexium commented Dec 18, 2024

the iterator should fully drain the data in a single call, as MemBuffer is not designed to support MVCC.

It seems a safer way. Just need to know if there will be regressions, in both latency and memory

@you06
Copy link
Contributor Author

you06 commented Dec 18, 2024

It seems a safer way. Just need to know if there will be regressions, in both latency and memory

I haven't tested for it.

I suppose no regression in most cases, for latency sensitive cases, the size of MemBuffer is used to be small, so there won't be much difference between iterator and scan.

For memory consumption, the returned KVs are just slice headers, it's not cloned memory, the risk is not such high I think.

@ekexium
Copy link
Member

ekexium commented Jan 13, 2025

Based on discussion, here are the subtasks:

cc @you06 @cfzjywxk

@cfzjywxk
Copy link
Contributor

@ekexium

Protect each memdb with a single RWLock to prevent data race

Do we also need to abandon the thread unsafe exposed interfaces as well?

@ekexium
Copy link
Member

ekexium commented Jan 13, 2025

@ekexium

Protect each memdb with a single RWLock to prevent data race

Do we also need to abandon the thread unsafe exposed interfaces as well?

We hope to make memdb fully thread safe by locking. In practice, they need to be addressed case by case, particularly when assessing potential performance impacts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

3 participants