Tuesday, September 16, 2014

Autothreading

needle & thread

Part of the design of the Kori language is to incorporate autothreading. Autothreading is when the language decides where threads would be useful, removing the burden from the programmer. The main problem with autothreading is determining what size chunks should allotted to each thread. Too small and the program spends more time deciding what to thread than doing its job. Too large and the program does not benefit greatly from threading. Getting it just right is a difficult balancing act.

feather

Threads are often described a light-weight processes since they don't have the overhead a process has. But they still have some overhead. A program that would create and destroy lots of threads during its execution would encounter this overhead. Is it possible to have lots of threads and not encounter this overhead? Yes, by borrowing techniques from server processes.

Many web servers, like Apache, start up child processes and leave them running. When it gets a request, it hand it off to one of the child processes. This way, there is no delay and no overhead to start a new process; one is already running.

The same can be done with threads. By breaking the program into virtual threads, and by assign each virtual thread to a real thread in a round-robin manner, the overhead of starting and stopping a thread is eliminated.

The only problem that remains is how to break the program up into virtual threads. What follows is one such scheme.

Intertwined Threads

Celtic knot

One way to look at a program is to say that the syntax of it shows the control flow of it. That is, the syntax describes the steps that the program follows. And, by extension, the semantics of a program describe the data flow it follows. That is, the what, where, and how of its variables show the what, where, and how of the data. By analyzing the placement of the variables, one can create the intertwined threads of its data flow.

Autothreading requires that these intertwined threads be separated so they can be executed without interfering with each other.

Statements As Threading Chunks

Consider the following code:

  function slope ( x1, y1, x2, y2 )
  {
      new Δx = x2 - x1;
      new Δy = y2 - y1;
      new m = Δy ÷ Δx;
      new b = y1 - m × x1;
      return m, b;
  }

It calculates the slope and Y-intercept given two points. Clearly, the first two statements can be done in parallel and the remaining ones in series.

To create virtual threads for it, a bit vector is created call in_use which represents every variable in a block. Other bit vectors are created for each statement, called uses, which is set for every variables the statements uses. A task queue is created and each statement is enqueued in order of appearance.

  bit vector: x1, y1, x2, y2, Δx, Δy, m, b

  is_use: 00000000

  Task Queue                  Uses
  ----------                  --------
  new Δx = x2 - x1;           10101000
  new Δy = y2 - y1;           01010100
  new m = Δy ÷ Δx;            00001110
  new b = y1 - m × x1;        11000011
  return m, b;                00000011

The in_use bit vector is initially cleared.

The first thread makes a copy of in_use. For each statement, it ands its in_use with the statement's uses and if the result is all zero, it can execute the statements. But first, it ors into the block's in_use, the statement's uses. This is to prevent another thread from accessing any variables it uses. It removes the statement from the queue and executes it. Later, when it has finished the statement, it will and the complement of the statement's uses with the block's in_use to indicate the variables are free to use.

The next thread does the same thing: copies the in_use and ands it with each statement's uses until it finds one that has a cleared result. In this case, it finds and removes the second statement. The queue now looks like this:

  bit vector: x1, y1, x2, y2, Δx, Δy, m, b

  is_use: 11111100

  Task Queue                  Uses
  ----------                  --------
  new m = Δy ÷ Δx;            00001110
  new b = y1 - m × x1;        11000011
  return m, b;                00000011

The third thread copies the in_use and examines the next statement. Since the and result is not clear, it ors the statement's uses with its in_use. This is necessary because chains of variables can look like latter variables are free when they're not. Consider this:

  bit vector: a, b, c

  is_use: 000

  Task Queue                  Uses
  ----------                  ----
  new a = 7;                  100
  new b = a;                  110
  new c = b;                  011

When the first statement is being executed, the block's in_use is 100. If the second statement is not ored in in passing, the third statement would be executed before the second, which is an error. That's why each thread must keep its own copy of in_use.

Synchronization

needle & thread

How would synchronization of the virtual threads happen? By placing a pseudo-task in the queue that has all its uses bits set. This prevents any thread from accepting the pseudo-task until all variables are free.

if's and Loops

if's and loop are enqueue as one statement; their blocks are not. Whether and which block is enqueued is determine when they are executed. That is, if an if condition is true, its then block is added to the queue. If it is false, its else block is added to the queue. Statements are not added until it is determined that they should be executed.

Their uses vector is set to all the variables from its surrounding blocks that it uses. None of the variables that are scoped inside of them are included. As noted, those will be added in a new block of tasks.

Multiple Blocks

needle & thread

The above works well for a single block, but if each block has its own in_use vector, keeping track of the will create a lot of bookkeeping. To reduce the bookkeeping, a block queue is used. Each entry in the queue has a in_use variable, a task queue, and a quiescence flag.

The threads then examine each block in turn to see if it has any task to run. If a thread examines a block and finds no executable task in it, the quiescence flag is set. This allows the next thread to skip the block entirely. It is only when a task is done and the block's in_use is changed would the flag be cleared. This is because until the in_use is changed, all tasks in the block will not executable.

Multiple Reads

Consider the following code:

  new a = 7;
  new b = a + 1;
  new c = -a;

The last two statements can be done in parallel since the variable a is only read in them. But to do this, the bit vectors have to be changed to arrays of integers. A statement's uses array would have 1 for a read, -1 of a write, and 0 if it's unused. The block's in_use array would count up the number of reads as each statement is executed, and count down when completed. It's only when the counter is zero can a statement with a write be executed.

This would take more time and more memory and with computers getting faster and faster, its additional complexity may not be needed.

Virtual Threads

Virtual threads offer a way to do autothreading without incurring the overhead of real threads. But it is not without its own bookkeeping. With some care, the bookkeeping can be kept to a minimum.

The above scheme is organic. That is, the execution of its virtual threads are determine at run time. There is no need for a program to determine its virtual threads in advance. Parallel tasks may be done in any order, though those that depend on one another are executed in order.

Virtual threads offer the best to design programs for the future, when computers with hundreds and thousands of CPUs become available.


Acknowledgements