About VSIPL++

application/pdf Data Sheet
application/pdf Building Signal-Processing Applications for the Cell Broadband Engine
text/html Evaluation Request

Manuals

text/html Quickstart
application/pdf Quickstart
text/html Tutorial
application/pdf Tutorial
application/pdf VSIPL++ API Specification (Serial)
application/pdf VSIPL++ API Specification (Parallel)
text/html API Reference

Mailing Lists

icon Announcements
icon Development

Comparing Sourcery VISPL++ to MPI

Sourcery VSIPL++ provides support for data-parallel programming using multiple processors. The Message Passing Interface (MPI) API is an alternative API that also supports data-parallel programming. Many signal-processing applications are presently built using low-level math libraries on individual CPUs, with MPI used to communicate data between processors.

Sourcery VSIPL++ unifies mathematical operations with communication operations; you no longer need to explicitly manage communication. Sourcery VSIPL++ also provides a global view of arrays; in contrast, MPI forces you to treat what a single array as a collection of local data objects. The Sourcery VSIPL++ approach allows you to write much less code. And, Sourcery VSIPL++'s high-level view of data allows it to automatically perform optimizations that would require substantial additional effort when programming directly with MPI.

This page shows how to implement a "corner-turn" using both Sourcery VSIPL++ and MPI. A corner-turn is useful in algorithms where some computations are performed along the rows of a matrix while others are performed along the columns. (For example, a row-wise FFT might be followed by a column-wise FFT.) When using multiple processors, it efficient to distribute the matrix by rows (so that each processor has one or more complete rows) when performing the row-wise operations. Then, the matrix is redistributed by columns before performing the column-wise operations. The redistribution operation is called a corner-turn.

  • Sourcery VSIPL++ Implementation

    The Sourcery VSIPL++ implementation requires just a few lines of code to express the corner-turn.

  • MPI Implementation

    The MPI implementation requires much more code to perform a similar operation.

  • Sourcery VSIPL++ Advantages

    The Sourcery VSIPL++ implementation has other advantages over the MPI implementation beyond the fact that it requires many fewer lines of code.

Sourcery VSIPL++ Implementation

In the VSIPL++ API, every matrix has an associated "map" which describes how the matrix is distributed across processors. If you assign a matrix with one distribution to a matrix with another, Sourcery VSIPL++ will automatically reorganize the data. (On a system using MPI, Sourcery VSIPL++ will use MPI to do this for you, but you will not need to directly make use of MPI.)

The following seven-statement example shows how easy it is to implement a corner turn with Sourcery VSIPL++. You just declare maps that correspond to the row-wise and column-wise distributions, create matrices using those maps, and then assign one matrix to the other.

// Define the input matrix type.
Map<> map_in (num_processors(), 1);
typedef Matrix<float, Dense<2, float, row2_type, Map<> > >
  view_in_type;

// Define the output matrix type.
Map<> map_out(1, num_processors());
typedef Matrix<float, Dense<2, float, col2_type, Map<> > >
  view_out_type;

// Create the matrices.
view_in_type  in (rows, cols, map_in);
view_out_type out(rows, cols, map_out);

// Perform the corner turn.
out = in;

MPI Implementation

The MPI implementation is far more complicated. MPI does not have a global picture of the data layout; instead, each node manages its own local slice of a global object. Therefore, in order to perform the corner-turn, MPI dataypes must be manually created for the input and output of the operation. The construction of these datatypes is complex. Furthermore, the datatypes must be manually allocated and deallocated. Therefore, the example below requires 29 statements to perform the corner-turn.

void corner_turn(float* in,
		 float* out,
		 size_t rows,
		 size_t cols,
		 size_t num_processors) {
  MPI_Datatype src_datatype;
  MPI_Datatype dst_datatype;
  MPI_Datatype tmp0_datatype;
  MPI_Datatype tmp1_datatype;
  MPI_Datatype tmp2_datatype;

  size_t nrows_per_send = rows / num_processors;
  size_t ncols_per_recv = cols / num_processors;

  /* Create send-side datatype. */
  MPI_Type_vector(nrows_per_send,
		  ncols_per_recv,
		  cols,          
		  MPI_FLOAT,
		  &tmp0_datatype);
  MPI_Type_commit(&tmp0_datatype);

  int          lena[2]   = { 1, 1 };
  MPI_Aint     loca[2]   = { 0, ncols_per_recv * sizeof(float) };
  MPI_Datatype typesa[2] = { tmp0_datatype, MPI_UB };
  MPI_Type_struct(2, lena, loca, typesa, &src_datatype);
  MPI_Type_commit(&src_datatype);

  /* Create recv-side datatype.  */
  MPI_Type_vector (ncols_per_recv,
		   1,
		   rows,
		   MPI_FLOAT,
		   &tmp1_datatype);
  MPI_Type_commit(&tmp1_datatype);

  int          len[3]   = { 1, 1, 1 };
  MPI_Aint     loc[3]   = { 0, 0, sizeof(float) };
  MPI_Datatype types[3] = { tmp1_datatype, MPI_LB, MPI_UB };
  MPI_Type_struct(3, len, loc, types, &tmp2_datatype);
  MPI_Type_commit(&tmp2_datatype);

  MPI_Type_hvector(nrows_per_send, 
		   1, 
		   sizeof(float), 
		   tmp2_datatype,
		   &dst_datatype);
  MPI_Type_commit(&dst_datatype);

  /* Perform corner-turn.  */
  MPI_Alltoall(in,  1, src_datatype,
	       out, 1, dst_datatype,
	       MPI_COMM_WORLD);

  /* Cleanup.  */
  MPI_Type_free(&tmp0_datatype);
  MPI_Type_free(&tmp1_datatype);
  MPI_Type_free(&tmp2_datatype);
  MPI_Type_free(&src_datatype);
  MPI_Type_free(&dst_datatype);
}

Sourcery VSIPL++ Advantages

Sourcery G++ has other advantages, beyond requiring significantly less code:

  • The MPI implementation only works if the number of processors in use evenly divides the number of rows and columns in the matrix. If that condition is not true, then even more MPI code is required to compensate. In contrast, the Sourcery VSIPL++ implementation will work independent of this condition.

  • The MPI implementation only works if the processors that will hold the matrix before and after the corner-turn precisely overlap. If, instead, the corner turn is being performed before two separate sets of processors, it is no longer possible to use MPI_Alltoall, i.e., the algorithm itself would change. In constrast, the Sourcery VSIPL++ implementation would require only a change to the maps used to declare the matrix.

  • Better performance would be obtained with most MPI implementations by transposing the data locally, after receiving it, rather than using MPI datatypes to handle the transpose. However, this approach is more complex, and therefore not shown above. Sourcery VSIPL++ can perform this optimization automatically.

  • Most MPI implementations impose some overhead on communications from a processor to itself, relative to a simple memory block copy. Sourcery VSIPL++ recognizes this case, and handles it as a direct copy, without involving MPI (or an alternative underlying parallel communication API) at all.