Parallel merge algorithm on GPUs using CUDA

Given two sorted arrays A, B, we want to merge these two to form a resultant sorted array C. We formulate a parallel merging algorithm in CUDA for GPUs. 1) Algorithm 1: using non-coalesced accesses to global memory 2) Algorithm 2: using shared memory to reduce this 3) Algorithm 3: reducing shared memory requirement using a circular buffer