Database — Part Three

Sanket Saxena
8 min readJul 10, 2023

--

Distributed Locking: A Detailed Overview

In a distributed system, multiple nodes work together to perform operations. A common challenge that arises in such a system is ensuring data consistency and avoiding race conditions — this is where the concept of distributed locking comes into play.

Problem Statement

Imagine a simple operation like incrementing a counter. In a single-node system, this operation is atomic. However, in a distributed system with multiple nodes trying to increment the counter simultaneously, we could encounter a race condition.

In another scenario, suppose we have an e-commerce system where multiple users attempt to purchase the last item of a product. Without a locking mechanism, it’s possible for multiple users to successfully place an order for the same item, leading to an overselling situation.

The crux of the problem lies in the fact that distributed systems lack a global, atomic view of operations. In distributed systems, separate nodes can’t inherently know what others are doing at a given moment. Without a mechanism to synchronize these operations, we risk inconsistencies and race conditions.

The Solution: Distributed Locking

Distributed locking is a method used to prevent multiple nodes from reading and writing to the same resource simultaneously. The idea is simple: before a node starts to perform an operation, it attempts to acquire a lock. If the lock acquisition is successful, the node can safely perform the operation knowing no other node will interfere. Once it’s done, it releases the lock, allowing other nodes to acquire it.

Distributed locks can be implemented in various ways, often with the help of coordination services like ZooKeeper, etcd, or distributed databases like Google’s Cloud Spanner.

Saga Pattern

The Saga Pattern is an architectural pattern that provides an approach to handling system failures during a long-running distributed transaction. It involves creating a series of local transactions, each undoable. This series of transactions as a whole makes up a single logical operation.

There are two ways to implement the Saga Pattern: orchestration and choreography.

Orchestration

In the orchestration method, there is a central saga orchestrator (or coordinator) responsible for creating and executing each local transaction and corresponding compensating transactions in case of failures.

Orchestration ensures a straightforward flow of control, making it easier to understand the system’s behavior. However, the orchestrator can become a potential single point of failure or a bottleneck.

Choreography

In the choreography method, each local transaction publishes an event when it finishes, and other local transactions subscribe to events and execute based on these. In this approach, the transactions know how to compensate themselves without relying on a central coordinator.

While choreography can eliminate bottlenecks, it can also introduce complexity, as it may be harder to understand the overall flow of control due to the decentralized nature of event communication.

The Complications with Distributed Locking

While distributed locking solves critical issues in distributed systems, it also introduces its own set of challenges:

  1. Deadlocks: If two or more operations hold locks and each is waiting for the other to release its lock, we have a deadlock situation.
  2. Lock Timeout: What should be the maximum time a process should wait to acquire a lock? A long timeout could lead to process starvation, while a short timeout might cause frequent retry overheads.
  3. Failure to Release Locks: If a node fails after acquiring a lock but before releasing it, other nodes waiting for the lock will be stuck indefinitely.
  4. Split-Brain Scenario: In the event of a network partition, two nodes might assume that they have the lock, leading to a split-brain scenario and potentially causing data corruption.

Implementing distributed locking requires careful consideration of these complexities and the inclusion of mechanisms like lock timeouts, heartbeat checks, and quorum-based lock acquisition to avoid potential pitfalls.

Two-Phase Commit (2PC) and Three-Phase Commit (3PC) Protocols

In distributed systems, the need for transactions that span across multiple nodes is commonplace. This brings up the problem of how to achieve consensus between these nodes to ensure atomicity and consistency. The two-phase commit (2PC) and three-phase commit (3PC) protocols are popular solutions to this problem.

Two-Phase Commit (2PC)

The 2PC protocol is a distributed algorithm that ensures all participating processes in a distributed system agree to either commit or abort a transaction. This is achieved in two phases, hence the name.

Phase 1 — Voting Phase

  1. The coordinator sends a “prepare” message to all the participants, asking if they are ready to commit the transaction.
  2. Each participant executes the transaction up to the point where it will be asked to commit. It then sends a “Yes” vote if it can commit or a “No” vote if it can’t (e.g., due to constraints or failures).

Phase 2 — Commit Phase

  1. If the coordinator receives a “Yes” vote from all participants, it sends a “commit” message; otherwise, it sends an “abort” message.
  2. Participants follow the coordinator’s decision to either commit or abort the transaction.

While 2PC ensures atomicity and consistency, it has a significant drawback: it’s a blocking protocol. If the coordinator fails after some nodes have voted “Yes” but before a decision is made, the nodes remain blocked, unsure whether to commit or abort.

Three-Phase Commit (3PC)

3PC protocol is an enhancement of 2PC and aims to overcome the blocking problem. It introduces an additional phase to allow recovery from failure scenarios more gracefully.

Phase 1 — Can Commit?

Just like in 2PC, the coordinator asks the participants if they can commit, and the participants respond with a “Yes” or “No”.

Phase 2 — Pre-Commit

Assuming the coordinator has received a “Yes” from all participants:

  1. The coordinator sends a “pre-commit” message to all participants and moves to the “pre-committed” state.
  2. The participants, after receiving the “pre-commit” message, also enter the “pre-committed” state and acknowledge the coordinator.

Phase 3 — Do Commit

  1. After receiving an acknowledgment from all participants, the coordinator sends a “do-commit” message, moving to the “committed” state.
  2. Upon receiving the “do-commit” message, the participants also move to the “committed” state.

This additional phase ensures that as long as a majority of processes are alive, the protocol will make progress. Unlike 2PC, the participants in 3PC don’t block waiting for the coordinator’s decision; they can decide based on a timeout.

Comparison

Blocking vs. Non-blocking: 2PC is a blocking protocol, while 3PC is a non-blocking protocol. In 2PC, participants can’t decide to commit or abort without the coordinator, whereas, in 3PC, they can make a decision based on a timeout.

Communication Overhead: 3PC has more communication overhead than 2PC due to the extra phase.

Complexity: 3PC is more complex than 2PC. This complexity arises from the need to handle more states and transitions, as well as timeouts.

Performance: If there’s a low risk of coordinator failure, 2PC is faster since it only requires two rounds of communication. However, in systems where the coordinator might fail, 3PC provides better performance due to its non-blocking nature.

Examples

Let’s consider an example of a banking system where we need to transfer funds from Account A to Account B, across different nodes.

2PC Example

  1. Coordinator sends a “prepare” message to nodes holding Account A and Account B.
  2. Both nodes deduct and add the funds locally, then respond with a “Yes”.
  3. The coordinator, having received “Yes” from both nodes, sends a “commit” message.
  4. Both nodes finalize the transaction.

If a “No” was received at any point, the coordinator would send an “abort” message, and the transaction would be rolled back.

3PC Example

  1. Coordinator sends a “Can commit?” message to both nodes.
  2. Both nodes respond with “Yes”.
  3. The coordinator sends a “Pre-commit” message, and both nodes acknowledge it.
  4. Finally, the coordinator sends a “Do commit” message, finalizing the transaction.

Again, if at any point a “No” vote is received or a timeout occurs, the transaction is aborted, and the changes are rolled back.

In conclusion, both 2PC and 3PC are protocols designed to achieve consensus in distributed systems. The choice between the two often depends on the specific requirements of the system, such as the likelihood of node failures and the system’s performance needs.

Understanding Database Data Structures

Different types of databases use various data structures to optimize the storage and retrieval of data. This article will delve into some of the most popular data structures used in databases, providing an overview of each one, along with some examples, pros, and cons.

Skiplist

A skiplist is a probabilistic data structure that is used for storing ordered sequences of elements. It allows for efficient search, deletion, and insertion operations. It consists of several layers of linked lists, where each layer skips over a certain number of elements, making search operations more efficient than a single linked list.

Example: Redis uses skiplists for implementing Sorted Sets and also in the Cluster implementation.

Pros:

  • Efficient search, deletion, and insertion operations
  • Simplicity of implementation

Cons:

  • Uses more space than other data structures due to pointers
  • Performance is probabilistic, not deterministic

Hash Index

A Hash index uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

Example: Almost all in-memory key-value stores, like Memcached or the main dictionary type in Python, use Hash indexes.

Pros:

  • Very efficient for point lookups
  • Offers constant-time average complexity for search operations

Cons:

  • Not efficient for range queries
  • Issues with hash collisions where two keys have the same hash

SSTable (Sorted String Table)

An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte streams. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range.

Example: SSTables are used in Google’s LevelDB and Apache Cassandra.

Pros:

  • Efficiently supports range queries
  • Write operations are fast

Cons:

  • Not efficient for point queries
  • High memory usage during compactions

LSM Tree (Log-Structured Merge-Tree)

An LSM Tree is a disk-based data structure designed to provide low latency for a high rate of random writes, and it’s the foundation of many key-value store databases. It utilizes a combination of both in-memory (Skiplist, for example) and on-disk (SSTable) data structures to accomplish this.

Example: Databases like LevelDB, RocksDB, and Cassandra use LSM Trees.

Pros:

  • High write throughput
  • Efficient for write-heavy workloads

Cons:

  • Read operations can be slower than in other structures
  • Compaction can impact performance

B-tree

B-trees are self-balancing tree data structures that maintain sorted data and allow for efficient insertion, deletion, and search operations. They are designed to work well with systems that read and write large blocks of data, like disks.

Example: B-trees are used extensively in filesystems and databases, including PostgreSQL, MySQL, and SQLite.

Pros:

  • Good performance for a wide range of operations
  • Efficient for large amounts of data
  • Consistent read/write performance

Cons:

  • Can be complex to implement
  • Rebalancing can impact performance

Inverted Index

An Inverted Index is a data structure used to create full text search. In an Inverted Index, there is a list of references associated with each word, so to find a document based on a keyword, the system can go directly to the word and find all documents related to it.

Example: Apache Lucene, Elasticsearch, and the search feature in most databases use Inverted Indexes.

Pros:

  • Efficient for text searches and information retrieval
  • Can handle large amounts of data

Cons:

  • Updates can be expensive
  • Requires additional processing to create the index

Suffix Tree

A Suffix Tree is a compressed tree containing all the suffixes of the given text as their keys and positions in the text as their values. It is a powerful data structure for text manipulation, allowing for efficient string operations.

Example: Suffix Trees are used in bioinformatics for sequence alignment.

Pros:

  • Efficient for string pattern search
  • Supports various complex string operations

Cons:

  • Requires a lot of space
  • Complex to implement

R-tree

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles, or polygons.

Example: R-trees are commonly used in geospatial systems like PostGIS.

Pros:

  • Can efficiently store spatial data
  • Efficient for nearest-neighbor search operations

Cons:

  • Can be complex to implement
  • Rebalancing and splitting can impact performance

Each of these data structures has their place depending on the requirements of the system, the type of data, and the specific operations that will be conducted most frequently. By understanding these data structures, we can make more informed decisions about how we build and manage our databases.

--

--

Sanket Saxena

I love writing about software engineering and all the cool things I learn along the way.