Parallel Algorithm For Red-Black Trees

Zhaokun Xue

Zhaokun.Xue@tufts.edu

Introduction

In our previous study, we have done a lot of detailed analysis of basic operations for the red-black trees based on the traditional sequential algorithm. In this project, I am going to use parallel algorithm to implement red-black tree's search and insertion operations. I will implement these two operations by both lock-based and lock-free synchronization in parallel algorithm, and then compare the performance of these two methods.

The red-black tree is a balanced binary search tree with height O(log n), and efficient search, insertion, and deletion operations, which makes it a better choice than regular binary search in search-intensive applications. And it only requires few rotations to rebalance the tree and keep it red-black properties.

As we known, it takes O(log n) for red-black tree's search and insertion operations by sequential algorithms. As for using parallel algorithms, the search and insertion time would be O(log n + logk), where k is the number threads/concurrents. Although they both have the upper bound O(log n), and parallel algorithm even have another add-on term log k, parallel algorithms would provide a smaller coefficient for the big-O notation. And in my project, I will do experiments to compare the performance on lock-based and lock-free parallel algorithms, and I will claim that lock-free synchronization has a better performance than lock-based.

Parallel Algorithm

A parallel algorithm, as opposed to a traditional sequential algorithm, is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result.

People usually build cost-effective parallel algorithms by defining multiple operations on each step to improve the performance of traditional sequential algorithm, which specifies a sequence of operations and each step does one operation.

Lock-Based Algorithm

A lock or mutex (from mutual exclusion) is a synchronization mechanism for enforcing limits on access to a resource in an environment where there are many threads of execution.

Lock-based parallel algorithms are usually implemented on the parallel random-access machine (PRAM), which is a shared-memory model of parallel computation that consists of a collection of identical processors and a shared memory. There are four types of PRAM.

Advantages Disadvantages
  • Simple to use
  • Ubiquitous
  • Easily to amenable to performance analysis
  • Potential for deadlock
  • Unavoidable overhead at each synchronization point, which usually can be a system call
  • Possible scheduling anomalies (such as priority inversion), where high-priority processes are delayed waiting for a lock held by a low-priority process
  • Convoying, where a delayed process holding a lock effectively blocks all other processes waiting for the lock

Lock-Free Algorithm

Opposite to lock-based synchronization, lock-free synchronization is algorithm that implements concurrents' operations without using "lock" or "mutex".

Advantage Disadvantage
  • Lock-free algorithms avoid problems in lock-based algorithms.
  • Lock-free algorithms are usually hard to implement, especially for complicate data structure.
  • It is usually harder to analyze the time complexity for Lock-free algorithms than Lock-based algorithms.

CAS (Compare and Swap)

In this project, I use CAS (Compare And Swap) synchronization primitive to implement my lock-free algorithm, which is supported by my most modern computer architectures.
Rather than acquiring a lock, a process makes a copy of the data it wishes to modify, changes the copy and then replaces the original data with the copy only if the original data is unchanged. If there is no contention, the update is made with no appreciable overhead. If there is contention, all but one concurrent process accessing the shared data must “roll-back” and redo its computation using the updated data. Most modern programming languages support CAS as an atomic operation. For example, in this project, I am going to use JAVA's AtomicBoolean class to realize the CAS strategy. We can also use other Atomic classes to implement CAS operation. The following piece of pesudocode just depicts how CAS function works.

CAS(shVble,savedValue,newValue)
Begin Atomic
if (shVble==savedValue) {
    shVble=newValue;
    return True;
} else {
    return False; 
}
End Atomic

do  {
    savedValue=shVble;
    newValue=computeNewValue(savedValue);
}while (!CAS(shVble,savedValue,newValue));
			

Red-Black Trees

red-black tree

Operations

For this project, I just focus on search and insertion operations on red-black tree.

Search

Just like any regular binary search tree, we use recursion on search operation.

Insertion

Before working on the details of the insertion operation, we need to introduce an important helper operation, rotation, for red-black tree.

Left_Rotate(T,x) {
    y = right[x];
    right[x] = left[y];
    p[left[y]] = x;
    p[y] = p[x];
    if (p[x] == p[root]) 
        root[T] = y;
    else if (x == left[p[x]])
        left[p[x]] = y;
    else 
        right[p[x]] = y;
    left[y] = x;
    p[x] = y;
}
							
rotation
Right_Rotate(T,y) {
    x = left[y];
    left[y] = right[x];
    p[right[x]] = y;
    p[x] = p[y];
    if (p[y] == p[root]) 
        root[T] = x;
    else if (y == left[p[y]])
        left[p[y]] = x;
    else 
        right[p[y]] = x;
    right[x] = y;
    p[y] = x;
}
							

When we insert a new node to the red-black tree, the first step is to physically insert the node to the given red-black tree, just as we do for regular binary search tree, and color the node to red. Then we need to fix upward from the inserted node to the root to maintain the red-black tree's properties. In general, we have three cases for insertion. Suppose we insert node N to the red-black tree, the following table explains each of the three cases.

Case 1: The uncle of inserted node is red. Case 2: The inserted node is right child of its parent and its uncle is black. Case 3: The inserted node is left child of its parent and its uncle is black.
case 1 case 2 case 3
For case 1, we just need to recolor the inserted node, its parent, uncle and its grandparent. Then we continue to fix upward and reuse the grandparent node as the "inserted" node N. We can to a left rotation on the inserted node's parent to turn it to case 3. And switch label for the parent and inserted node. In case 3, we first do a right rotation on the inserted node's parent, then recolor parent and grandparent. Then we move up and fix upward. In this case, we do not reuse the grandparent node.

Implementation

I programmed in JAVA for this project. I used JAVA's "Thread to simulate concurrents' operations. And I used "ReentrantLock" lock class to implement EREW lock-based synchronization. For lock-free synchronization, I implemented the CAS process by using "AtomicBoolean" class, which provides the convenience for doing the CAS atomic operation. My lock-free implementation is based on the idea of Ma's Algorithm and pseudocode provided in Kim's paper. Finally, I modified the code of treeGUI from GitHub to realize a small JFrame visualization for displaying the red-black tree built by my algorithm. As the code limits, the visualization only works and layouts well for small data set, like trees with less than 20 nodes.

Ma's Algorithm

Ma's algorithm is built upon the basic sequential insertion algorithm with the addition of a "local area" concept and the use of lock-free primitives to control concurrency. And Ma also adds an extra flag field to each node to indicate whether a process gains control of the node, which is used for CAS operation check.

Local Area

local area The local area consists of the set of nodes that a process must have full control of to ensure successful completion of an operation. As we see, the local area for insertion is just made of nodes involved in the rotation fix-up operation. With the local area concept, if several concurrents in a localized region have overlapping areas, only the one that obtains all the flags in its local area will be able to proceed. Other concurrents will have to re-attempt to gain control of their local areas. As one insertion concurrent completes its processing in one part of the tree, it may either finish entirely (and release all its flags) or move up the tree. In the later case, the concurrent first obtains the flags of nodes in its new local area, moves up, and then releases the flags that it set for the nodes in its old local area. Releasing flags allows other processes that may have been waiting to advance. The figure on the right shows the "local area" defined by Ma. The following slides demonstrate how local area works for insertion case 1.


insertion With the idea of Ma's Algorithm, I add a flag field to my "LockFreeRBNode" object, which indicates whether the current node is occupied by any concurrent. And then I added CAS and SetupLocalAreaForInsert functions to the regular sequential insertion operation as Kim's pseudocode did. The pseudocode on the right shows the shows the necessary changes for insertion function, and the red text indicates the differences from sequential algorithm. insertion

Experiment Results

Sample Output for visualization

I modified the TreeGUI.java code from GitHub to visualize red-black trees I got from my experiments. Due to the visualization code limits, it can only layout and display well for small trees. In my project, I applied the code to display trees with less than 20 nodes. I generated 20 random nodes from 0 to 1000. Then I ran lock-based insertion and lock-free insertion on the same dataset. As we known, different insertion order could result in different red-black trees. Here are one sample output from my experiment. Each figure demonstrates the red-black tree built by their own algorithm and concurrents flow. You can click to enlarge the figures.
Lock-Based Algorithm Output Lock-Free Algorithm Output
lock based tree demo lock free tree demo

Then I did experiments for comparing the performance of these two kinds of synchronizations. For every experiment, I ran my program 10 times. I collected results from eclipse console output, and imported the data to excel. Finally I calculated the averages and plotted the line charts in excel.

Insert 1000 nodes to an empty red-black tree

In this experiment I totally inserted 1000 random nodes to an empty red-black tree by using different number of threads. For example, when the number of threads is 10, it means each thread inserts 100 nodes to the tree.

Results for Lock-Based Synchronization

Threads # Time Cost (ms) Average
1 2118212122222123 192120.9
2 2625262627262628272926.6
4 2828342628303033332629.6
5 3230282724253030272828.1
8 2829292627252528282727.2
10 2529272628303026292827.8
20 2830343136292731303030.6

Results for Lock-Free Synchronization

Threads # Time Cost (ms) Average
1 1515161416191712181215.4
2 139121214141217121913.4
4 1518301410121619112617.1
5 2231162017151126281019.6
8 1317131521321531143921
10 2030131355271021271723.3
20 1514481629131219152020.1

Line Chart for Comparing the Average Time Costs

insertion node

Search 10000 times from the tree

I used different number of threads to do 10000 times search on the 1000-node red-black tree I built from the above experiment.

Results for Lock-Based Synchronization

Threads # Time Cost (ms) Average
1 810109108101011119.7
2 1825222021181824232521.4
4 4346345343466952784751.1
5 4354474240546546394947.9
8 6553548760596056547262
10 5974796493686259696369
20 9383128949487861029911297.8

Results for Lock-Free Synchronization

Threads # Time Cost (ms) Average
1 67667485686.3
2 77879876897.6
4 1639113123161826163423
5 1444162143473913202027.7
8 2320347520311830222930.2
10 4798312895202343312243.8
20 4537634048383847435044.9

Line Chart for Comparing the Average Time Costs

search node

Use 4 threads to insert different number of nodes

Then I fixed the number of threads and changed the number of nodes that each thread needs to insert. Combining results from the above two experiments, I noticed that when using 4 threads we have the largest differences for insertion and search. Therefore, I fixed the number of threads to 4 for this experiment. And I will use the set of data to roughly calculate the constants for Big-O nations of these two algorithms.

Results for Lock-Based Synchronization

Nodes # Time Cost (ms) Average
10 22232232342.5
20 33433423423.1
50 54555655665.2
100 1011101099119101110
200 2320222122212024272022
500 6867698166716869807171
1000 168167177161164162198185174178173.4

Results for Lock-Free Synchronization

Nodes # Time Cost (ms) Average
10 22221221121.7
20 23422322222.4
50 43534444634
100 676658510566.4
200 9121313111371881612
500 1719212317172523183721.7
1000 2328343833294036333232.6

Line Chart for Comparing the Average Time Costs

4 threads

Roughly Estimate the Constants for Big-O Notation

I roughly calculate the constants for Big-O Notations of these two algorithms by O(logn + logk), where k is the number of threads. I used Excel to find the logarithmic trendlines for these these two datasets. As I found, at inserting 4000 nodes, it causes outliers of the trendlines, so I just removed that data point from my dataset, and plotted the following figures. Based on my calculation, the constant for lock-based synchronization is around 30.72, and the constant for lock-free synchronization is about 6.43. The lock-free synchronization has a much smaller constant which means lock-free synchronization performs better than lock-based synchronization.

Lock-Based Synchronization Lock-Free Synchronization
lock-based trend lock-free trend

How to run my code

code layout You can checkout the code from my GitHub repository. Then you can just import the project to Eclipse to test it. The figure on the right shows the content of my program. The main function is in "TestTrees.java" file. You can change "num_threads" and "insert_nodes_per_thread" parameters to test the performance for different number of threads and different number of nodes. You can also change "visulize_locked_tree" and "visulize_lock_free_tree" parameters to enable or disable the visualization for these two trees. Instead, you can just check the video below. It demonstrates how to run some sample experiments with my program.

Conclusion and Future Work

Based on my implementation and experiment results, we can reach the conclusion that lock-free algorithms have a better performance on search and insertion operations on relatively large red-black tree. From the line charts, we can conclude that when the tree gets larger, the performance difference would be more obvious. For future work, I think we can work on finding the hidden constant in the Big-O notation for these two algorithms, so that we can compare the performance of these two algorithms theoretically. The other thing we can work on is to implement lock-free algorithm for deletion operation. Since red-black trees store information not only on leaf nodes but also on internal nodes, deletion operation for red-black trees would result in a more complex structural change, which makes the implementation much harder. Another thing we can do in the future is to find a way to visualize the process for lock-free insertion, which could give users a more straightforward demonstration on how these concurrents work. In addition, in Kim's paper, he proposed an improved version on Ma's algorithm, which needs fewer CAS operations by defining a larger "local area". We might try out his method, and compare its performance with the original Ma's algorithm. We can also study the better concurrent control method "wait-free" to improve the performance for lock-free synchronization.

References

  1. Park, Heejin, and Kunsoo Park. "Parallel algorithms for red–black trees." Theoretical Computer Science 262.1-2 (2001): 415-435.
  2. Kim, Jong Ho, Helen Cameron, and Peter Graham. "Lock-free red-black trees using cas." Concurrency and Computation: Practice and Experience(2006): 1-40.
  3. Jianwen Ma. Lock-Free Insertions on Red-Black Trees. MSc thesis, University of Manitoba, October,2003.
  4. Natarajan, Aravind, Lee H. Savoie, and Neeraj Mittal. "Concurrent wait-free red black trees." Symposium on Self-Stabilizing Systems. Springer International Publishing, 2013.
  5. GitHub TreeGui: https://github.com/EslaMx7/AI-Tasks-JADE-Tests/blob/master/src/trees/tasks/treeGUI.java
  6. Red-Black Tree figure: https://upload.wikimedia.org/wikipedia/commons/6/66/Red-black_tree_example.svg
  7. Rotation GIF figure: https://upload.wikimedia.org/wikipedia/commons/3/31/Tree_rotation_animation_250x250.gif