# Bitonic Merge Via Transposes

So far, you’ve derived two potential
schemes for a distributed Bitonic Merge. The first is a block
distribution scheme. Remember that it is log P stages of
communication followed by log n over P stages of purely local computation. The second scheme was the cyclic scheme. It has log n over P stages of purely
local computation followed by log P stages of communication. Now, the running time for
the two schemes is the same. It’s basically alpha log P plus
beta times n over P log P. Now, looking at these terms,
you might worry about the beta term. If n over P is really large and
the beta term dominates the alpha term, then it looks like you’re paying to
send n over P words log P times. So, a natural question is, is there
an alternative that would let you reduce the cost of the beta term, possibly at
the cost of increasing the alpha term. In fact, there is. First, let’s start cyclic. Starting cyclic means there’s
no communication initially. Then, let’s switch to block. Switching to block means that at
the end there’s no communication. Of course, to make this scheme work, you’re going to need to
reshuffle the data in between. This reshuffling is called a transpose. You can view the transpose as
either a matrix transpose or also as an all to all
personalized exchange. Let’s take the first process,
for instance. Notice that it has to send p minus one
messages, each of size n over p squared. Notice that it has to send these
messages to each of the other processes. The same thing goes for
every other process. For example,
let’s look at the last process. So how does this scheme compare
to the block of cyclic schemes. Again let’s recall the block and cyclic communication times
assuming a hypercube. Now since the transpose
needs to do an all-to-all let’s make a stronger assumption, which
is that the network is fully connected. In that case,
computing the cost would be really easy. So if the network is fully connected,
and then each process will send (P- 1) messages, and the messages
will be of size n / P squared. So as you can see, each scheme exhibits a latency bandwidth
trade off relative to the other. So here we get rid of the log P factor
on the beta term at the cost of increasing the alpha term
factor from log P to P. Now in practice, it’s typically very
hard for the block or cyclic schemes to actually beat the transposed schemes for
typical values of n over P. And that’s true even given the much
stronger assumption of a fully connected network versus a hypercube. Hm, can you think of any
reasons why that might be true?