-
Notifications
You must be signed in to change notification settings - Fork 5
Design
Keys in dtm-redis are mapped to hash buckets and each bucket is managed by an separate erlang process. Since erlang makes distributed programming rather easy, the buckets can exist in separate erlang nodes and on separate physical hosts. dtm-redis performs distributed transactions using an implementation of the two-phase commit protocol. The role of transaction coordinator is shared between the client session and the transaction monitor components in dtm-redis. The voting phase is managed by the client session. The transaction monitor records the status of the transaction before the cohorts (buckets) are informed of the decision. In case of recovery, a bucket will check with the transaction monitor associated with the transaction to know whether to commit or roll back. Because there can be an arbitrary number of buckets and an arbitrary number of transaction monitors, there is no single scalability bottleneck, meaning that dtm-redis can theoretically scale to support arbitrary levels of throughput.
- Server (server.erl): listens on a tcp port for connection requests from clients, spawning a client session for each connection received
- Client session (session.erl): manages communication with clients, maintains the session state, including allocating a transaction identifier from the transaction monitor and brokering transactions with the buckets
- Transaction monitor (txn_monitor.erl): maintains the state of transactions in memory and on disk
- Bucket (bucket.erl): maintains the state of the keys stored by a bucket and acts as a middle man between the client session and the back-end store
- Binlog (binlog.erl): records transaction state to disk for recovery in case of failure
- Redis store (redis_store.erl): handles reading and writing to redis back-end storage
The two main complexities of dtm-redis (currently) exist in the management of keys in the bucket component and the management of transactions in the client session.
The bucket component stores information per key that it is managing as long as there is at least one client session that has the key watched during a transaction or in the state of being read from or written to. When the bucket is performing the voting phase of a transaction, if the vote will be "yes", the keys associated with that transaction are locked by the bucket until the transaction is committed or rolled back. If a second transaction enters the voting phase for a key already locked by another transaction, the vote is pended until the first transaction completes. Since transactions can overlap on keys managed by multiple buckets, there is the possibility of deadlock occurring. Deadlock is not currently detected or avoided though this would be relatively easy to do by placing a timeout on the voting phase. After a transaction completes on a bucket, the keys associated with it are unlocked and any pended operations are flushed. The state maintained by the bucket for a key contains a version field that is incremented each time the key is modified. The state is deleted when the key is no longer active in any transaction, which will reset the version back to 1 the next time it is needed.
The transaction management by the client session component is the other major area of complexity. The key associated with a command is hashed to find the appropriate bucket for the command and the command is passed through to that bucket. The multi and exec commands are handled directly by the client session. When a multi or a watch command is received, a transaction is allocated from a randomly chosen transaction monitor. The transaction identifier is passed to the bucket with any watch command and all commands between a multi and an exec command. When an exec command is received, the client session begins the voting process by sending a message to each bucket that the current transaction identifier has been sent to (all buckets which were the target of a watch command or the target of any command after the multi command). When it has received a vote response from all buckets that it asked to participate in the vote, it makes a decision about the transaction. If all buckets involved in the transaction have voted "yes" then the client session sends a message to the transaction monitor asking it to record the transaction status. When the transaction monitor has persisted the transaction status, the transaction monitor will send a message back to the client session and the client session will send a commit message to the buckets which will then carry out the requested operations.
dtm-redis can operate in either durable or non-durable mode. In durable mode, all operations are logged to a file prior to being performed. The buckets will log the transaction operations to a file when asked to vote in a transaction before responding "yes". If the bucket were to crash before committing the transaction, it can read the transaction log when it is restarted to find out what transactions were outstanding when it terminated. For each outstanding transaction, it can ask the appropriate transaction monitor what the state of the transaction is, applying it if the transaction monitor indicates that it was committed, otherwise discarding it (crash recovery of the bucket is not currently implemented). When the transaction monitor is told by the client session that the transaction should be committed, the state of the transaction is recorded in memory and in a log file. If the transaction monitor were to crash, it can read its log when it is restarted and it will be able to answer questions about all transactions it has successfully recorded to disk (crash recovery of the transaction monitor is not currently implemented). The point at which a transaction is actually finalized is when the transaction state is written to disk by the transaction monitor. Up until this point, a failure will lead to the transaction being aborted. After this point, a failure will be recoverable, assuming the disk is not corrupted (once crash recovery is implemented). For this mode to actually function correctly, redis (or any other backing store) must also operate durably (appendfsync always). Durable mode will function orders of magnitude slower than non-durable mode.
Non-durable mode for dtm-redis currently does not guarantee consistency. The only persistence mechanism is that of redis. Since the value of this mode is that of performance, it doesn't make sense to make redis durable. Because the persistence of redis is not synchronized between buckets, there's no way to ensure that a transaction that spanned buckets will remain consistent if the redis backing store for one of those buckets fails immediately after committing the transaction. It may be possible to force the saving to be synchronized by disabling automatic saving in redis configuration and forcing a synchronized save across all buckets periodically. In order for this to work, it may be necessary for the redis save command to be modified to accept a filename to write the database to so that the save files can have a name corresponding to their generation. It may also be possible to simply alter the save filename using the "config set" command followed by a bgsave command. This is currently in just the idea stage and some additional investigation is necessary to figure out whether this is possible and perhaps more importantly whether it would work well.