# Bitonic Merge Via Transposes

So far, you’ve derived two potential

schemes for a distributed Bitonic Merge. The first is a block

distribution scheme. Remember that it is log P stages of

communication followed by log n over P stages of purely local computation. The second scheme was the cyclic scheme. It has log n over P stages of purely

local computation followed by log P stages of communication. Now, the running time for

the two schemes is the same. It’s basically alpha log P plus

beta times n over P log P. Now, looking at these terms,

you might worry about the beta term. If n over P is really large and

the beta term dominates the alpha term, then it looks like you’re paying to

send n over P words log P times. So, a natural question is, is there

an alternative that would let you reduce the cost of the beta term, possibly at

the cost of increasing the alpha term. In fact, there is. First, let’s start cyclic. Starting cyclic means there’s

no communication initially. Then, let’s switch to block. Switching to block means that at

the end there’s no communication. Of course, to make this scheme work, you’re going to need to

reshuffle the data in between. This reshuffling is called a transpose. You can view the transpose as

either a matrix transpose or also as an all to all

personalized exchange. Let’s take the first process,

for instance. Notice that it has to send p minus one

messages, each of size n over p squared. Notice that it has to send these

messages to each of the other processes. The same thing goes for

every other process. For example,

let’s look at the last process. So how does this scheme compare

to the block of cyclic schemes. Again let’s recall the block and cyclic communication times

assuming a hypercube. Now since the transpose

needs to do an all-to-all let’s make a stronger assumption, which

is that the network is fully connected. In that case,

computing the cost would be really easy. So if the network is fully connected,

and then each process will send (P- 1) messages, and the messages

will be of size n / P squared. So as you can see, each scheme exhibits a latency bandwidth

trade off relative to the other. So here we get rid of the log P factor

on the beta term at the cost of increasing the alpha term

factor from log P to P. Now in practice, it’s typically very

hard for the block or cyclic schemes to actually beat the transposed schemes for

typical values of n over P. And that’s true even given the much

stronger assumption of a fully connected network versus a hypercube. Hm, can you think of any

reasons why that might be true?