Parallel Algorithm For Red-Black Trees

Zhaokun Xue

Zhaokun.Xue@tufts.edu

Introduction

In our previous study, we have done a lot of detailed analysis of basic operations for the red-black trees based on the traditional sequential algorithm. In this project, I am going to use parallel algorithm to implement red-black tree's search and insertion operations. I will implement these two operations by both lock-based and lock-free synchronization in parallel algorithm, and then compare the performance of these two methods.

The red-black tree is a balanced binary search tree with height O(log n), and efficient search, insertion, and deletion operations, which makes it a better choice than regular binary search in search-intensive applications. And it only requires few rotations to rebalance the tree and keep it red-black properties.

As we known, it takes O(log n) for red-black tree's search and insertion operations by sequential algorithms. As for using parallel algorithms, the search and insertion time would be O(log n + logk), where k is the number threads/concurrents. Although they both have the upper bound O(log n), and parallel algorithm even have another add-on term log k, parallel algorithms would provide a smaller coefficient for the big-O notation. And in my project, I will do experiments to compare the performance on lock-based and lock-free parallel algorithms, and I will claim that lock-free synchronization has a better performance than lock-based.

Parallel Algorithm

A parallel algorithm, as opposed to a traditional sequential algorithm, is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result.

People usually build cost-effective parallel algorithms by defining multiple operations on each step to improve the performance of traditional sequential algorithm, which specifies a sequence of operations and each step does one operation.

Lock-Based Algorithm

A lock or mutex (from mutual exclusion) is a synchronization mechanism for enforcing limits on access to a resource in an environment where there are many threads of execution.

Lock-based parallel algorithms are usually implemented on the parallel random-access machine (PRAM), which is a shared-memory model of parallel computation that consists of a collection of identical processors and a shared memory. There are four types of PRAM.

EREW (exclusive read exclusive write)
CREW (concurrent read exclusive write)
CRCW (concurrent read concurrent write)
ERCW (Exclusive read concurrent write)

Advantages	Disadvantages
Simple to use Ubiquitous Easily to amenable to performance analysis	Potential for deadlock Unavoidable overhead at each synchronization point, which usually can be a system call Possible scheduling anomalies (such as priority inversion), where high-priority processes are delayed waiting for a lock held by a low-priority process Convoying, where a delayed process holding a lock effectively blocks all other processes waiting for the lock

Lock-Free Algorithm

Opposite to lock-based synchronization, lock-free synchronization is algorithm that implements concurrents' operations without using "lock" or "mutex".

Advantage	Disadvantage
Lock-free algorithms avoid problems in lock-based algorithms.	Lock-free algorithms are usually hard to implement, especially for complicate data structure. It is usually harder to analyze the time complexity for Lock-free algorithms than Lock-based algorithms.

CAS (Compare and Swap)

In this project, I use CAS (Compare And Swap) synchronization primitive to implement my lock-free algorithm, which is supported by my most modern computer architectures.
Rather than acquiring a lock, a process makes a copy of the data it wishes to modify, changes the copy and then replaces the original data with the copy only if the original data is unchanged. If there is no contention, the update is made with no appreciable overhead. If there is contention, all but one concurrent process accessing the shared data must “roll-back” and redo its computation using the updated data. Most modern programming languages support CAS as an atomic operation. For example, in this project, I am going to use JAVA's AtomicBoolean class to realize the CAS strategy. We can also use other Atomic classes to implement CAS operation. The following piece of pesudocode just depicts how CAS function works.

CAS(shVble,savedValue,newValue)
Begin Atomic
if (shVble==savedValue) {
    shVble=newValue;
    return True;
} else {
    return False; 
}
End Atomic

do  {
    savedValue=shVble;
    newValue=computeNewValue(savedValue);
}while (!CAS(shVble,savedValue,newValue));

Red-Black Trees

red-black tree

Nodes are colored red or black
Root is always black
Add black "dummy" leaves so every "real" node has 2 children
Every red node has a black parent
For any node x: all paths down to leaves contain equal number of black nodes

Operations

For this project, I just focus on search and insertion operations on red-black tree.

Search

Just like any regular binary search tree, we use recursion on search operation.

Insertion

Before working on the details of the insertion operation, we need to introduce an important helper operation, rotation, for red-black tree.

Left_Rotate(T,x) {
    y = right[x];
    right[x] = left[y];
    p[left[y]] = x;
    p[y] = p[x];
    if (p[x] == p[root]) 
        root[T] = y;
    else if (x == left[p[x]])
        left[p[x]] = y;
    else 
        right[p[x]] = y;
    left[y] = x;
    p[x] = y;
}

Right_Rotate(T,y) {
    x = left[y];
    left[y] = right[x];
    p[right[x]] = y;
    p[x] = p[y];
    if (p[y] == p[root]) 
        root[T] = x;
    else if (y == left[p[y]])
        left[p[y]] = x;
    else 
        right[p[y]] = x;
    right[x] = y;
    p[y] = x;
}

When we insert a new node to the red-black tree, the first step is to physically insert the node to the given red-black tree, just as we do for regular binary search tree, and color the node to red. Then we need to fix upward from the inserted node to the root to maintain the red-black tree's properties. In general, we have three cases for insertion. Suppose we insert node N to the red-black tree, the following table explains each of the three cases.

Case 1: The uncle of inserted node is red.	Case 2: The inserted node is right child of its parent and its uncle is black.	Case 3: The inserted node is left child of its parent and its uncle is black.

For case 1, we just need to recolor the inserted node, its parent, uncle and its grandparent. Then we continue to fix upward and reuse the grandparent node as the "inserted" node N.	We can to a left rotation on the inserted node's parent to turn it to case 3. And switch label for the parent and inserted node.	In case 3, we first do a right rotation on the inserted node's parent, then recolor parent and grandparent. Then we move up and fix upward. In this case, we do not reuse the grandparent node.

Implementation

I programmed in JAVA for this project. I used JAVA's "Thread to simulate concurrents' operations. And I used "ReentrantLock" lock class to implement EREW lock-based synchronization. For lock-free synchronization, I implemented the CAS process by using "AtomicBoolean" class, which provides the convenience for doing the CAS atomic operation. My lock-free implementation is based on the idea of Ma's Algorithm and pseudocode provided in Kim's paper. Finally, I modified the code of treeGUI from GitHub to realize a small JFrame visualization for displaying the red-black tree built by my algorithm. As the code limits, the visualization only works and layouts well for small data set, like trees with less than 20 nodes.

Ma's Algorithm

Ma's algorithm is built upon the basic sequential insertion algorithm with the addition of a "local area" concept and the use of lock-free primitives to control concurrency. And Ma also adds an extra flag field to each node to indicate whether a process gains control of the node, which is used for CAS operation check.

Local Area

The local area consists of the set of nodes that a process must have full control of to ensure successful completion of an operation. As we see, the local area for insertion is just made of nodes involved in the rotation fix-up operation. With the local area concept, if several concurrents in a localized region have overlapping areas, only the one that obtains all the flags in its local area will be able to proceed. Other concurrents will have to re-attempt to gain control of their local areas. As one insertion concurrent completes its processing in one part of the tree, it may either finish entirely (and release all its flags) or move up the tree. In the later case, the concurrent first obtains the flags of nodes in its new local area, moves up, and then releases the flags that it set for the nodes in its old local area. Releasing flags allows other processes that may have been waiting to advance. The figure on the right shows the "local area" defined by Ma. The following slides demonstrate how local area works for insertion case 1.

With the idea of Ma's Algorithm, I add a flag field to my "LockFreeRBNode" object, which indicates whether the current node is occupied by any concurrent. And then I added CAS and SetupLocalAreaForInsert functions to the regular sequential insertion operation as Kim's pseudocode did. The pseudocode on the right shows the shows the necessary changes for insertion function, and the red text indicates the differences from sequential algorithm.

Experiment Results

Sample Output for visualization

I modified the TreeGUI.java code from GitHub to visualize red-black trees I got from my experiments. Due to the visualization code limits, it can only layout and display well for small trees. In my project, I applied the code to display trees with less than 20 nodes. I generated 20 random nodes from 0 to 1000. Then I ran lock-based insertion and lock-free insertion on the same dataset. As we known, different insertion order could result in different red-black trees. Here are one sample output from my experiment. Each figure demonstrates the red-black tree built by their own algorithm and concurrents flow. You can click to enlarge the figures.

Lock-Based Algorithm Output	Lock-Free Algorithm Output
Close	Close

Then I did experiments for comparing the performance of these two kinds of synchronizations. For every experiment, I ran my program 10 times. I collected results from eclipse console output, and imported the data to excel. Finally I calculated the averages and plotted the line charts in excel.

Insert 1000 nodes to an empty red-black tree

In this experiment I totally inserted 1000 random nodes to an empty red-black tree by using different number of threads. For example, when the number of threads is 10, it means each thread inserts 100 nodes to the tree.

Results for Lock-Based Synchronization

Threads #	Time Cost (ms)										Average
1	21	18	21	21	22	22	21	23	19	21	20.9
2	26	25	26	26	27	26	26	28	27	29	26.6
4	28	28	34	26	28	30	30	33	33	26	29.6
5	32	30	28	27	24	25	30	30	27	28	28.1
8	28	29	29	26	27	25	25	28	28	27	27.2
10	25	29	27	26	28	30	30	26	29	28	27.8
20	28	30	34	31	36	29	27	31	30	30	30.6

Results for Lock-Free Synchronization

Threads #	Time Cost (ms)										Average
1	15	15	16	14	16	19	17	12	18	12	15.4
2	13	9	12	12	14	14	12	17	12	19	13.4
4	15	18	30	14	10	12	16	19	11	26	17.1
5	22	31	16	20	17	15	11	26	28	10	19.6
8	13	17	13	15	21	32	15	31	14	39	21
10	20	30	13	13	55	27	10	21	27	17	23.3
20	15	14	48	16	29	13	12	19	15	20	20.1

Line Chart for Comparing the Average Time Costs

Search 10000 times from the tree

I used different number of threads to do 10000 times search on the 1000-node red-black tree I built from the above experiment.

Results for Lock-Based Synchronization

Threads #	Time Cost (ms)										Average
1	8	10	10	9	10	8	10	10	11	11	9.7
2	18	25	22	20	21	18	18	24	23	25	21.4
4	43	46	34	53	43	46	69	52	78	47	51.1
5	43	54	47	42	40	54	65	46	39	49	47.9
8	65	53	54	87	60	59	60	56	54	72	62
10	59	74	79	64	93	68	62	59	69	63	69
20	93	83	128	94	94	87	86	102	99	112	97.8

Results for Lock-Free Synchronization

Threads #	Time Cost (ms)										Average
1	6	7	6	6	7	4	8	5	6	8	6.3
2	7	7	8	7	9	8	7	6	8	9	7.6
4	16	39	11	31	23	16	18	26	16	34	23
5	14	44	16	21	43	47	39	13	20	20	27.7
8	23	20	34	75	20	31	18	30	22	29	30.2
10	47	98	31	28	95	20	23	43	31	22	43.8
20	45	37	63	40	48	38	38	47	43	50	44.9

Line Chart for Comparing the Average Time Costs

Use 4 threads to insert different number of nodes

Then I fixed the number of threads and changed the number of nodes that each thread needs to insert. Combining results from the above two experiments, I noticed that when using 4 threads we have the largest differences for insertion and search. Therefore, I fixed the number of threads to 4 for this experiment. And I will use the set of data to roughly calculate the constants for Big-O nations of these two algorithms.

Results for Lock-Based Synchronization

Nodes #	Time Cost (ms)										Average
10	2	2	2	3	2	2	3	2	3	4	2.5
20	3	3	4	3	3	4	2	3	4	2	3.1
50	5	4	5	5	5	6	5	5	6	6	5.2
100	10	11	10	10	9	9	11	9	10	11	10
200	23	20	22	21	22	21	20	24	27	20	22
500	68	67	69	81	66	71	68	69	80	71	71
1000	168	167	177	161	164	162	198	185	174	178	173.4

Results for Lock-Free Synchronization

Nodes #	Time Cost (ms)										Average
10	2	2	2	2	1	2	2	1	1	2	1.7
20	2	3	4	2	2	3	2	2	2	2	2.4
50	4	3	5	3	4	4	4	4	6	3	4
100	6	7	6	6	5	8	5	10	5	6	6.4
200	9	12	13	13	11	13	7	18	8	16	12
500	17	19	21	23	17	17	25	23	18	37	21.7
1000	23	28	34	38	33	29	40	36	33	32	32.6

Line Chart for Comparing the Average Time Costs

Roughly Estimate the Constants for Big-O Notation

I roughly calculate the constants for Big-O Notations of these two algorithms by O(logn + logk), where k is the number of threads. I used Excel to find the logarithmic trendlines for these these two datasets. As I found, at inserting 4000 nodes, it causes outliers of the trendlines, so I just removed that data point from my dataset, and plotted the following figures. Based on my calculation, the constant for lock-based synchronization is around 30.72, and the constant for lock-free synchronization is about 6.43. The lock-free synchronization has a much smaller constant which means lock-free synchronization performs better than lock-based synchronization.

Lock-Based Synchronization	Lock-Free Synchronization

How to run my code

code layout You can checkout the code from my GitHub repository. Then you can just import the project to Eclipse to test it. The figure on the right shows the content of my program. The main function is in "TestTrees.java" file. You can change "num_threads" and "insert_nodes_per_thread" parameters to test the performance for different number of threads and different number of nodes. You can also change "visulize_locked_tree" and "visulize_lock_free_tree" parameters to enable or disable the visualization for these two trees. Instead, you can just check the video below. It demonstrates how to run some sample experiments with my program.

Conclusion and Future Work

Based on my implementation and experiment results, we can reach the conclusion that lock-free algorithms have a better performance on search and insertion operations on relatively large red-black tree. From the line charts, we can conclude that when the tree gets larger, the performance difference would be more obvious. For future work, I think we can work on finding the hidden constant in the Big-O notation for these two algorithms, so that we can compare the performance of these two algorithms theoretically. The other thing we can work on is to implement lock-free algorithm for deletion operation. Since red-black trees store information not only on leaf nodes but also on internal nodes, deletion operation for red-black trees would result in a more complex structural change, which makes the implementation much harder. Another thing we can do in the future is to find a way to visualize the process for lock-free insertion, which could give users a more straightforward demonstration on how these concurrents work. In addition, in Kim's paper, he proposed an improved version on Ma's algorithm, which needs fewer CAS operations by defining a larger "local area". We might try out his method, and compare its performance with the original Ma's algorithm. We can also study the better concurrent control method "wait-free" to improve the performance for lock-free synchronization.

References

Park, Heejin, and Kunsoo Park. "Parallel algorithms for red–black trees." Theoretical Computer Science 262.1-2 (2001): 415-435.
Kim, Jong Ho, Helen Cameron, and Peter Graham. "Lock-free red-black trees using cas." Concurrency and Computation: Practice and Experience(2006): 1-40.
Jianwen Ma. Lock-Free Insertions on Red-Black Trees. MSc thesis, University of Manitoba, October,2003.
Natarajan, Aravind, Lee H. Savoie, and Neeraj Mittal. "Concurrent wait-free red black trees." Symposium on Self-Stabilizing Systems. Springer International Publishing, 2013.
GitHub TreeGui: https://github.com/EslaMx7/AI-Tasks-JADE-Tests/blob/master/src/trees/tasks/treeGUI.java
Red-Black Tree figure: https://upload.wikimedia.org/wikipedia/commons/6/66/Red-black_tree_example.svg
Rotation GIF figure: https://upload.wikimedia.org/wikipedia/commons/3/31/Tree_rotation_animation_250x250.gif