In our previous study, we have done a lot of detailed analysis of basic operations for the red-black trees based on the traditional sequential algorithm. In this project, I am going to use parallel algorithm to implement red-black tree's search and insertion operations. I will implement these two operations by both lock-based and lock-free synchronization in parallel algorithm, and then compare the performance of these two methods.
The red-black tree is a balanced binary search tree with height O(log n), and efficient search, insertion, and deletion operations, which makes it a better choice than regular binary search in search-intensive applications. And it only requires few rotations to rebalance the tree and keep it red-black properties.
As we known, it takes O(log n) for red-black tree's search and insertion operations by sequential algorithms. As for using parallel algorithms, the search and insertion time would be O(log n + logk), where k is the number threads/concurrents. Although they both have the upper bound O(log n), and parallel algorithm even have another add-on term log k, parallel algorithms would provide a smaller coefficient for the big-O notation. And in my project, I will do experiments to compare the performance on lock-based and lock-free parallel algorithms, and I will claim that lock-free synchronization has a better performance than lock-based.
A parallel algorithm, as opposed to a traditional sequential algorithm, is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result.
People usually build cost-effective parallel algorithms by defining multiple operations on each step to improve the performance of traditional sequential algorithm, which specifies a sequence of operations and each step does one operation.
A lock or mutex (from mutual exclusion) is a synchronization mechanism for enforcing limits on access to a resource in an environment where there are many threads of execution.
Lock-based parallel algorithms are usually implemented on the parallel random-access machine (PRAM), which is a shared-memory model of parallel computation that consists of a collection of identical processors and a shared memory. There are four types of PRAM.
Advantages | Disadvantages |
---|---|
|
|
Opposite to lock-based synchronization, lock-free synchronization is algorithm that implements concurrents' operations without using "lock" or "mutex".
Advantage | Disadvantage |
---|---|
|
|
In this project, I use CAS (Compare And Swap) synchronization primitive to implement my lock-free algorithm, which is supported by my most modern computer architectures.
Rather than acquiring a lock, a process makes a copy of the data it wishes to modify, changes the copy and then replaces the original data with the copy only if the original data is unchanged. If there is no contention, the update is made with no appreciable overhead. If there is contention, all but one concurrent process accessing the shared data must “roll-back” and redo its computation using the updated data. Most modern programming languages support CAS as an atomic operation. For example, in this project, I am going to use JAVA's AtomicBoolean class to realize the CAS strategy. We can also use other Atomic classes to implement CAS operation.
The following piece of pesudocode just depicts how CAS function works.
CAS(shVble,savedValue,newValue) Begin Atomic if (shVble==savedValue) { shVble=newValue; return True; } else { return False; } End Atomic do { savedValue=shVble; newValue=computeNewValue(savedValue); }while (!CAS(shVble,savedValue,newValue));
For this project, I just focus on search and insertion operations on red-black tree.
Just like any regular binary search tree, we use recursion on search operation.
Before working on the details of the insertion operation, we need to introduce an important helper operation, rotation, for red-black tree.
Left_Rotate(T,x) { y = right[x]; right[x] = left[y]; p[left[y]] = x; p[y] = p[x]; if (p[x] == p[root]) root[T] = y; else if (x == left[p[x]]) left[p[x]] = y; else right[p[x]] = y; left[y] = x; p[x] = y; } |
![]() |
Right_Rotate(T,y) { x = left[y]; left[y] = right[x]; p[right[x]] = y; p[x] = p[y]; if (p[y] == p[root]) root[T] = x; else if (y == left[p[y]]) left[p[y]] = x; else right[p[y]] = x; right[x] = y; p[y] = x; } |
When we insert a new node to the red-black tree, the first step is to physically insert the node to the given red-black tree, just as we do for regular binary search tree, and color the node to red. Then we need to fix upward from the inserted node to the root to maintain the red-black tree's properties. In general, we have three cases for insertion. Suppose we insert node N to the red-black tree, the following table explains each of the three cases.
Case 1: The uncle of inserted node is red. | Case 2: The inserted node is right child of its parent and its uncle is black. | Case 3: The inserted node is left child of its parent and its uncle is black. |
---|---|---|
|
|
|
For case 1, we just need to recolor the inserted node, its parent, uncle and its grandparent. Then we continue to fix upward and reuse the grandparent node as the "inserted" node N. | We can to a left rotation on the inserted node's parent to turn it to case 3. And switch label for the parent and inserted node. | In case 3, we first do a right rotation on the inserted node's parent, then recolor parent and grandparent. Then we move up and fix upward. In this case, we do not reuse the grandparent node. |
I programmed in JAVA for this project. I used JAVA's "Thread to simulate concurrents' operations. And I used "ReentrantLock" lock class to implement EREW lock-based synchronization. For lock-free synchronization, I implemented the CAS process by using "AtomicBoolean" class, which provides the convenience for doing the CAS atomic operation. My lock-free implementation is based on the idea of Ma's Algorithm and pseudocode provided in Kim's paper. Finally, I modified the code of treeGUI from GitHub to realize a small JFrame visualization for displaying the red-black tree built by my algorithm. As the code limits, the visualization only works and layouts well for small data set, like trees with less than 20 nodes.
Ma's algorithm is built upon the basic sequential insertion algorithm with the addition of a "local area" concept and the use of lock-free primitives to control concurrency. And Ma also adds an extra flag field to each node to indicate whether a process gains control of the node, which is used for CAS operation check.
The local area consists of the set of nodes that a process must have full control of
to ensure successful completion of an operation. As we see, the local area for insertion is just made of nodes involved in the rotation fix-up operation. With the local area concept, if several concurrents in a localized region have overlapping areas, only the one that obtains all the flags in its local area will be able to proceed. Other concurrents will have to re-attempt to gain control of their local areas. As one insertion concurrent completes its processing in one part of the tree, it may either finish entirely (and release all its flags) or move up the tree. In the later case, the concurrent first obtains the flags of nodes in its new local area, moves up, and then releases the flags that it set for the nodes in its old local area. Releasing flags allows other processes that may have been waiting to advance. The figure on the right shows the "local area" defined by Ma. The following slides demonstrate how local area works for insertion case 1.
With the idea of Ma's Algorithm, I add a flag field to my "LockFreeRBNode" object, which indicates whether the current node is occupied by any concurrent. And then I added CAS and SetupLocalAreaForInsert functions to the regular sequential insertion operation as Kim's pseudocode did. The pseudocode on the right shows the shows the necessary changes for insertion function, and the red text indicates the differences from sequential algorithm.
Lock-Based Algorithm Output | Lock-Free Algorithm Output |
![]() |
![]() |
Threads # | Time Cost (ms) | Average | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 21 | 18 | 21 | 21 | 22 | 22 | 21 | 23 | 19 | 21 | 20.9 |
2 | 26 | 25 | 26 | 26 | 27 | 26 | 26 | 28 | 27 | 29 | 26.6 |
4 | 28 | 28 | 34 | 26 | 28 | 30 | 30 | 33 | 33 | 26 | 29.6 |
5 | 32 | 30 | 28 | 27 | 24 | 25 | 30 | 30 | 27 | 28 | 28.1 |
8 | 28 | 29 | 29 | 26 | 27 | 25 | 25 | 28 | 28 | 27 | 27.2 |
10 | 25 | 29 | 27 | 26 | 28 | 30 | 30 | 26 | 29 | 28 | 27.8 |
20 | 28 | 30 | 34 | 31 | 36 | 29 | 27 | 31 | 30 | 30 | 30.6 |
Threads # | Time Cost (ms) | Average | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 15 | 15 | 16 | 14 | 16 | 19 | 17 | 12 | 18 | 12 | 15.4 |
2 | 13 | 9 | 12 | 12 | 14 | 14 | 12 | 17 | 12 | 19 | 13.4 |
4 | 15 | 18 | 30 | 14 | 10 | 12 | 16 | 19 | 11 | 26 | 17.1 |
5 | 22 | 31 | 16 | 20 | 17 | 15 | 11 | 26 | 28 | 10 | 19.6 |
8 | 13 | 17 | 13 | 15 | 21 | 32 | 15 | 31 | 14 | 39 | 21 |
10 | 20 | 30 | 13 | 13 | 55 | 27 | 10 | 21 | 27 | 17 | 23.3 |
20 | 15 | 14 | 48 | 16 | 29 | 13 | 12 | 19 | 15 | 20 | 20.1 |
Threads # | Time Cost (ms) | Average | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 8 | 10 | 10 | 9 | 10 | 8 | 10 | 10 | 11 | 11 | 9.7 |
2 | 18 | 25 | 22 | 20 | 21 | 18 | 18 | 24 | 23 | 25 | 21.4 |
4 | 43 | 46 | 34 | 53 | 43 | 46 | 69 | 52 | 78 | 47 | 51.1 |
5 | 43 | 54 | 47 | 42 | 40 | 54 | 65 | 46 | 39 | 49 | 47.9 |
8 | 65 | 53 | 54 | 87 | 60 | 59 | 60 | 56 | 54 | 72 | 62 |
10 | 59 | 74 | 79 | 64 | 93 | 68 | 62 | 59 | 69 | 63 | 69 |
20 | 93 | 83 | 128 | 94 | 94 | 87 | 86 | 102 | 99 | 112 | 97.8 |
Threads # | Time Cost (ms) | Average | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 6 | 7 | 6 | 6 | 7 | 4 | 8 | 5 | 6 | 8 | 6.3 |
2 | 7 | 7 | 8 | 7 | 9 | 8 | 7 | 6 | 8 | 9 | 7.6 |
4 | 16 | 39 | 11 | 31 | 23 | 16 | 18 | 26 | 16 | 34 | 23 |
5 | 14 | 44 | 16 | 21 | 43 | 47 | 39 | 13 | 20 | 20 | 27.7 |
8 | 23 | 20 | 34 | 75 | 20 | 31 | 18 | 30 | 22 | 29 | 30.2 |
10 | 47 | 98 | 31 | 28 | 95 | 20 | 23 | 43 | 31 | 22 | 43.8 |
20 | 45 | 37 | 63 | 40 | 48 | 38 | 38 | 47 | 43 | 50 | 44.9 |
Nodes # | Time Cost (ms) | Average | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
10 | 2 | 2 | 2 | 3 | 2 | 2 | 3 | 2 | 3 | 4 | 2.5 |
20 | 3 | 3 | 4 | 3 | 3 | 4 | 2 | 3 | 4 | 2 | 3.1 |
50 | 5 | 4 | 5 | 5 | 5 | 6 | 5 | 5 | 6 | 6 | 5.2 |
100 | 10 | 11 | 10 | 10 | 9 | 9 | 11 | 9 | 10 | 11 | 10 |
200 | 23 | 20 | 22 | 21 | 22 | 21 | 20 | 24 | 27 | 20 | 22 |
500 | 68 | 67 | 69 | 81 | 66 | 71 | 68 | 69 | 80 | 71 | 71 |
1000 | 168 | 167 | 177 | 161 | 164 | 162 | 198 | 185 | 174 | 178 | 173.4 |
Nodes # | Time Cost (ms) | Average | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
10 | 2 | 2 | 2 | 2 | 1 | 2 | 2 | 1 | 1 | 2 | 1.7 |
20 | 2 | 3 | 4 | 2 | 2 | 3 | 2 | 2 | 2 | 2 | 2.4 |
50 | 4 | 3 | 5 | 3 | 4 | 4 | 4 | 4 | 6 | 3 | 4 |
100 | 6 | 7 | 6 | 6 | 5 | 8 | 5 | 10 | 5 | 6 | 6.4 |
200 | 9 | 12 | 13 | 13 | 11 | 13 | 7 | 18 | 8 | 16 | 12 |
500 | 17 | 19 | 21 | 23 | 17 | 17 | 25 | 23 | 18 | 37 | 21.7 |
1000 | 23 | 28 | 34 | 38 | 33 | 29 | 40 | 36 | 33 | 32 | 32.6 |
I roughly calculate the constants for Big-O Notations of these two algorithms by O(logn + logk), where k is the number of threads. I used Excel to find the logarithmic trendlines for these these two datasets. As I found, at inserting 4000 nodes, it causes outliers of the trendlines, so I just removed that data point from my dataset, and plotted the following figures. Based on my calculation, the constant for lock-based synchronization is around 30.72, and the constant for lock-free synchronization is about 6.43. The lock-free synchronization has a much smaller constant which means lock-free synchronization performs better than lock-based synchronization.
Lock-Based Synchronization | Lock-Free Synchronization |
---|---|
![]() |
![]() |
You can checkout the code from my GitHub repository. Then you can just import the project to Eclipse to test it. The figure on the right shows the content of my program. The main function is in "TestTrees.java" file. You can change "num_threads" and "insert_nodes_per_thread" parameters to test the performance for different number of threads and different number of nodes. You can also change "visulize_locked_tree" and "visulize_lock_free_tree" parameters to enable or disable the visualization for these two trees. Instead, you can just check the video below. It demonstrates how to run some sample experiments with my program.
Based on my implementation and experiment results, we can reach the conclusion that lock-free algorithms have a better performance on search and insertion operations on relatively large red-black tree. From the line charts, we can conclude that when the tree gets larger, the performance difference would be more obvious. For future work, I think we can work on finding the hidden constant in the Big-O notation for these two algorithms, so that we can compare the performance of these two algorithms theoretically. The other thing we can work on is to implement lock-free algorithm for deletion operation. Since red-black trees store information not only on leaf nodes but also on internal nodes, deletion operation for red-black trees would result in a more complex structural change, which makes the implementation much harder. Another thing we can do in the future is to find a way to visualize the process for lock-free insertion, which could give users a more straightforward demonstration on how these concurrents work. In addition, in Kim's paper, he proposed an improved version on Ma's algorithm, which needs fewer CAS operations by defining a larger "local area". We might try out his method, and compare its performance with the original Ma's algorithm. We can also study the better concurrent control method "wait-free" to improve the performance for lock-free synchronization.