New Conventional and Parallel Programming Language Planning C 2.0

Hello dear readers.

I want to write here about one of my projects – the language Planning C (v2.0). It is an extension of C++ that adds a number of new constructs to the core language. The project is currently available in repositories (source code of the prototypical translator-preprocessor, many examples, MPI->Planning C converter for simple programs). Planning C differs from other languages ​​in that many of its new constructs are based on so-called re-entry scheduling procedures, which are primarily convenient for programming some algorithms using a stack, deck, or queue (but can also be used for programming arbitrary algorithms) . The language contains various means of algorithmization and parallelization, more or less unified for modern computers with multi-core processors, and for video cards, and for cluster systems. The second version of the language introduced standard tools for extending the language with new constructions, “intelligent” memoization, and some other features. I hope that this language will seem interesting to someone, maybe even promising for application and / or development. I myself sometimes use it to quickly write some computational parallel programs.

In this article I will write only about the most basic features of the language, mainly with examples. If the topic arouses interest, then perhaps later I will write one or two more articles about “advanced” / unusual features.

A bit of theory

A procedure with reentry planning is a procedure with a plan. The procedure is executed as many times (sequentially or in parallel) as there are elements in the plan. Elements are retrieved from the beginning of the plan, for each retrieved element the procedure is called again (in an implicitly present loop). Each element contains a set of procedure parameter values ​​that are passed to it when processing the current plan element. The plan can be filled outside the procedure or inside it [с помощью специальных функций вставки в начало плана (plan_first) или в конец плана (plan_last)].

Parameters passed by reference are handled in a slightly special way – their values ​​simply pass from stage to stage. However, if in one of the elements of the plan such a parameter was scheduled with a cutoff (with the sign “!” after the parameter value), then exactly the specified value will be passed to the procedure when executing this element.

A function with reentry scheduling differs from a procedure in that it has a return value, which is determined by the result of the last executed element of the plan.

Procedures and functions with reentry scheduling can be combined into chains (sets of such procedures or functions), which can be executed simply in parallel (vector mode) or in parallel with the transfer of data along this chain (pipeline mode, while passing the value to the beginning of the plan of the next procedure used throw_first, and at the end – throw_last).

Examples of conventional calculations

Example. Consider a procedure that loops from 0 to 5:

reenterable Loop(int x, int last) {

 cout << x << “ “;

 if (++x <= last) plan_first(x);

}

/* Вызов: Loop(0,5); */

Example. Consider the sequential numbering of nodes of a binary tree by levels from top to bottom and from left to right.

typedef struct TreeTag {

  int Data;

  struct TreeTag * Left;

  struct TreeTag * Right;

} TreeNode;

int Number;

reenterable EnumerateByLevel(TreeNode * Cur) {

 Cur->Data = Number++;

 if (Cur->Left) plan_last(Cur->Left);

 if (Cur->Right) plan_last(Cur->Right);

}

/* Вызов: Number = 1; EnumerateByLevel(Root); */

the same function EnumerateByLevel can also be used as a lambda function, with the difference that instead of functions plan_first/plan_last are used retain_first/retain_last:

auto f = reenterable (TreeNode * Cur) {

 Cur->Data = Number++;

 if (Cur->Left) reent_last(Cur->Left);

 if (Cur->Right) reent_last(Cur->Right);

}

/* Вызов: Number = 1; f(Root); */

Examples of Parallel Computing

When planning execution elements, markers can be inserted into the plan, which divide it into groups. By default (without markers), the group is the entire plan. Also, by default, any group is executed sequentially, but if you call the directive plan_group_parallelize, then the parallel execution mode of such a group will immediately turn on. And if you call plan_group_vectorize, then a group of plan elements will be launched on the video card (I’ll make a reservation right away that in this case there are a number of features and restrictions on the syntax).

Example. In parallel, we sum up the elements of the tree, while using the reduction for the sum parameter:

reenterable TreeSum(TreeNode * Cur, reduction(+) int & SumResult)

{

 if (Cur==Root) plan_group_parallelize;

 if (Cur->Left) plan_last(Cur->Left,SumResult);

 if (Cur->Right) plan_last(Cur->Right,SumResult);

 SumResult = Cur->Data;

}

An example for working on a video card (we multiply the array of numbers A[N] by k, we get the array B[N]):

#pragma plan vectorized

#pragma plan common begin

#define N 5

#pragma plan common end

reenterable Mul(bool init, int k, _global(N) int * A, int i, _local(1) int * out) {

  int ii;

  if (init) { // Заполняем план и далее запускаем на распараллеливание

     for (ii = 0; ii < N; ii++) {

         plan_last(false, k, A, ii, out);

         out++;

     }

     plan_group_vectorize(NULL);

  } else { // Собственно параллельная работа

         *out = A[i]*k;

  }

}

/* Вызов: Mul(true, k, A, 0, B); */

An example of parallel computing on a pipeline (for a multi-core processor)

Let the conveyor process a tree, and at its first stage there is a maximum element, and at the second stage – a minimum element. The first stage, in turn, is parallelized into 4 processors, and the second stage is similarly parallelized. Thus, here we have combined parallelism.

#ifndef __min

#define __min(a,b) ((a)<(b) ? (a) : (b))

#endif

#ifndef __max

#define __max(a,b) ((a)>(b) ? (a) : (b))

#endif

chain TreeMax(TreeNode * Cur, reduction(__max) int & MaxResult)

{ // Первая стадия конвейера

 static int DummyMin;

 throw_last(Cur, DummyMin);

 if (Cur==Root) plan_group_parallelize;

 if (Cur->Left) plan_last(Cur->Left, MaxResult);

 if (Cur->Right) plan_last(Cur->Right, MaxResult);

 MaxResult = Cur->Data;

}

chain TreeMin(TreeNode * Cur, reduction(__min) int & MinResult)

{ // Вторая стадия конвейера

 if (Cur==Root) plan_group_parallelize;

 MinResult = Cur->Data;

}

void TreeMinMax(int & Min, int & Max)

{

 Max = 0.0;

 Min = 1000.0;

// Запуск ковейера в параллельном режиме

plan_parallel_chain(0, TreeMax(Root,Max)/4, TreeMin(Root,Min)/4);

}

An example of parallel computing (abstract) on a pipeline, on a cluster

On the node with ID 1 prints numbers from 0 to 29 to standard output.

#pragma plan clustered

#include <iostream>

using namespace std;

chain A1() throw(int DATA) { // Первая стадия конвейера

  for (int i = 0; i < 30; i++)

      throw_last(i);

}

chain B1(int DATA) { // Вторая стадия конвейера

  cout<<DATA<<endl;

}

int main() {

  int IDs[2] = {0,1}; // MPI-Идентификаторы узлов, на которых будет запущен конвейер

  clustered(IDs) plan_parallel_chain(0, A1(), B1(0));

  return 0;

}

A bit about transactional memory

Transactional memory is available, even in two versions: a) implemented by the compiler (if it supports it) and b) software (partially transactional, implemented in Planning C, involving the use of both transactional and conventional memory in one block of code). It is programmed in a very similar way to the above example of a program for a video card: a group of plan stages is formed, after which a directive is given to enable parallel calculation in the mode of one or another type of transactional memory. I will give an example with software partially transactional memory – a standard example (calculation of a histogram of array values). The program sums the frequencies and displays the sum on the screen – naturally, this is the number of array elements (in this case, 10000).

#include <stdlib.h>

#include <iostream>

using namespace std;

const int N = 10000;

const int M = 600;

const int NF = 20;

reenterable calc_histo(bool init, int np, int * A, soft_transact_array(int) * F, double grain, int k) {

     if (init) { // Формируем план для распараллеливания

         for (int i = 0; i < N; i += np) {

              for (int j = 0; j < np; j++)

                   plan_last(false, np, A, F, grain, i + j);

              plan_group_soft_atomize; // Включаем режим параллельной работы в программной транзакционной памяти

         }

     } else { // Работа, собственно, в программной транзакционной памяти

         if (k < N) {

              int _k = A[k] / grain;

              if (_k >= NF) _k = NF - 1;

              ++(*F)[_k];

         }

     }

}

int main() {

     int * A = new int[N];

     soft_transact_array(int) F(NF);

     srand(184415);

     for (int i = 0; i < N; i++)

         A[i] = rand() % M;

     double grain = 1.0 * M / NF;

     F = 0;

     calc_histo(true, 4, A, &F, grain, 0);

// Гистограмму подсчитали, делаем проверку

     int SF = 0;

     for (int i = 0; i < NF; i++)

         SF += F[i];

     cout << SF;

     delete[] A;

     return 0;

}

More about parallel computing

A natural question is whether this language has traditional means of parallel programming in shared memory – semaphores, mutexes, barriers, critical sections … Yes, all this is there, I just didn’t give examples with them, since there are some significant features of working with them not in the language.

Also in the language there are some less standard tools for working on cluster systems, namely atomic global variables, semaphore, channels, and barrier. I will give (without comments) a small, also quite abstract example of working with an atomic global DATA variable on a “clique” cluster topology.

#pragma plan clustered

#include <iostream>>

using namespace std;

cvar(int) * DATA = NULL;

int Counter;

chain A(input_proc Src, int N) {

  int id = plan_linear_num()-1;

  if (Src == empty_proc) {

     (*DATA)++;

     Counter = 1;

     for (int i = 0; i < N; i++)

         if (i != id)

            throw_first(A[i+1], N);

  } else {

     (*DATA)++;

     Counter++;

     if (Counter == N) {

        plan_registered_barrier(topology);

        if (id == N-1) {

           cout<<**DATA<<endl;

           plan_topology_quit();

        }

     }

  }

}

int main() {

  const int N = 5;

  DATA = new cvar(int)(1);

  if (plan_cluster_id() == 0) *DATA = 0;

  int IDS[N];

  for (int i = 0; i < N; i++)

      IDS[i] = i;

  clustered(IDS) clique(N)/N;

  delete DATA;

  return 0;

}

Instead of a conclusion

In the examples given, the emphasis was on parallel programming, since, in my opinion, these are the most interesting features of the language. In more detail (albeit in a more complex style) you can read about the features of the language in the documentation posted in repositories

Similar Posts

Leave a Reply