14.2. Hello Threading! Writing Your First Multithreaded Program
In this section, we examine the ubiquitous POSIX thread library Pthreads. POSIX is an acronym for Portable Operating System Interface. It is an IEEE standard that specifies how UNIX systems look, act, and feel. The POSIX threads API is available on almost all UNIX-like operating systems, each of which meets the standard in its entirety or to some great degree. So, if you write parallel code using POSIX threads on a Linux machine, it will certainly work on other Linux machines, and it will likely work on machines running macOS or other UNIX variants.
Let's begin by analyzing an example "Hello World" Pthreads program (hellothreads.c). For brevity, we have excluded error handling in the listing, though the downloadable version contains sample error handling.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
/* The "thread function" passed to pthread_create. Each thread executes this
* function and terminates when it returns from this function. */
void *HelloWorld(void *id) {
/* We know the argument is a pointer to a long, so we cast it from a
* generic (void *) to a (long *). */
long *myid = (long *) id;
printf("Hello world! I am thread %ld\n", *myid);
return NULL; // We don't need our threads to return anything.
}
int main(int argc, char **argv) {
int i;
int nthreads; //number of threads
pthread_t *thread_array; //pointer to future thread array
long *thread_ids;
// Read the number of threads to create from the command line.
if (argc !=2) {
fprintf(stderr, "usage: %s <n>\n", argv[0]);
fprintf(stderr, "where <n> is the number of threads\n");
return 1;
}
nthreads = strtol(argv[1], NULL, 10);
// Allocate space for thread structs and identifiers.
thread_array = malloc(nthreads * sizeof(pthread_t));
thread_ids = malloc(nthreads * sizeof(long));
// Assign each thread an ID and create all the threads.
for (i = 0; i < nthreads; i++) {
thread_ids[i] = i;
pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
}
/* Join all the threads. Main will pause in this loop until all threads
* have returned from the thread function. */
for (i = 0; i < nthreads; i++) {
pthread_join(thread_array[i], NULL);
}
free(thread_array);
free(thread_ids);
return 0;
}
Let's examine this program in smaller components.
-
Notice the inclusion of the
pthread.h
header file, which declarespthread
types and functions. -
Next, the
HelloWorld
function defines the thread function that we later pass topthread_create
. A thread function is analogous to amain
function for a worker (created) thread --- a thread begins execution at the start of its thread function and terminates when it reaches the end. Each thread executes the thread function using its private execution state (i.e., its own stack memory and register values). Note also that the thread function is of typevoid*
. Specifying an anonymous pointer in this context allows programmers to write thread functions that deal with arguments and return values of different types. -
Lastly, in the
main
function, the main thread initializes the program state before creating and joining the worker threads.
14.2.1. Creating and Joining Threads
The program first starts as a single-threaded process. As it executes
the main
function, it reads the number of threads to create, and it
allocates memory for two arrays: thread_array
and thread_ids
. The
thread_array
array contains the set of addresses for each thread
created. The thread_ids
array stores the set of arguments that each
thread is passed. In this example, each thread is passed the address of
its rank (or ID, represented by thread_ids[i]
).
After all the preliminary variables are allocated and initialized, the
main
thread executes the two major steps of multithreading:
-
The creation step, in which the main thread spawns one or more worker threads. After being spawned, each worker thread runs within its own execution context concurrently with the other threads and processes on the system.
-
The join step, in which the main thread waits for all the workers to complete before proceeding as a single-thread process. Joining a thread that has terminated frees the thread's execution context and resources. Attempting to join a thread that hasn't terminated blocks the caller until the thread terminates, similar to the semantics of the wait function for processes.
The Pthreads library offers a pthread_create
function for creating
threads and a pthread_join
function for joining them. The
pthread_create
function has the following signature:
pthread_create(pthread_t *thread, const pthread_attr_t *attr,
void *thread_function, void *thread_args)
The function takes a pointer to a thread struct (of type pthread_t
), a
pointer to an attribute struct (normally set to NULL
), the name of the
function the thread should execute, and the array of arguments to pass
to the thread function when it starts.
The Hello World program calls pthread_create
in the main
function
using:
pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
Here:
-
&thread_array[i]
contains the address of thread i. Thepthread_create
function allocates apthread_t
thread object and stores its address at this location, enabling the programmer to reference the thread later (e.g., when joining it). -
NULL
specifies that the thread should be created with default attributes. In most programs, it is safe to leave this second parameter asNULL
. -
HelloWorld
names the thread function that the created thread should execute. This function behaves like the "main" function for the thread. For an arbitrary thread function (e.g.,function
), its prototype must match the formvoid * function(void *)
. -
&thread_ids[i]
specifies the address of the arguments to be passed to thread i. In this case,thread_ids[i]
contains a singlelong
representing the thread's ID. Since the last argument topthread_create
must be a pointer, we pass the address of the thread's ID.
To generate several threads that execute the HelloWorld
thread
function, the program assigns each thread a unique ID and creates each
thread within a for
loop:
for (i = 0; i < nthreads; i++) {
thread_ids[i] = i;
pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
}
The OS schedules the execution of each created thread; the user cannot make any assumption on the order in which the threads will execute.
The pthread_join
function suspends the execution of its caller until
the thread it references terminates. Its signature is:
pthread_join(pthread_t thread, void **return_val)
The pthread_join
takes as input a pthread_t
struct, indicating which
thread to wait on, and an optional pointer argument that specifies where
the thread's return value should be stored.
The Hello World program calls pthread_join
in main
using:
pthread_join(thread_array[t], NULL);
This line indicates that the main thread must wait on the termination of
thread t
. Passing NULL
as the second argument indicates that the
program does not use the thread's return value.
In the previous program, main
calls pthread_join
in a loop because
all of the worker threads need to terminate before the main
function
proceeds to clean up memory and terminate the process:
for (i = 0; i < nthreads; i++) {
pthread_join(thread_array[i], NULL);
}
14.2.2. The Thread Function
In the previous program, each spawned thread prints out
Hello world! I am thread n
, where n
is the thread's unique id. After
the thread prints out its message, it terminates. Let's take a closer
look at the HelloWorld
function:
void *HelloWorld(void *id) {
long *myid = (long*)id;
printf("Hello world! I am thread %ld\n", *myid);
return NULL;
}
Recall that pthread_create
passes the arguments to the thread function
using the thread_args
parameter. In the pthread_create
function in
main
, the Hello World program specified that this parameter is in fact
the thread's ID. Note that the parameter to HelloWorld
must be
declared as a generic or anonymous pointer
(void *
).
The Pthreads library uses void *
to make pthread_create
more general
purpose by not prescribing a parameter type. As a programmer, the
void *
is mildly inconvenient given that it must be recast before use.
Here, we know the parameter is of type long *
because that's what we
passed to pthread_create
in main
. Thus, we can safely cast the value
as a long *
and dereference the pointer to access the long
value.
Many parallel programs follow this structure.
Similar to the thread function's parameter, the Pthreads library avoids
prescribing the thread function's return type by specifying another
void *
--- the programmer is free to return any pointer from the
thread function. If the program needs to access the thread's return
value, it can retrieve it via the second argument to pthread_join
. In
our example, the thread has no need to return a value, so it simply
returns a NULL
pointer.
14.2.3. Running the Code
The command that follows shows how to use GCC to compile
hellothreads.c. Building a Pthreads
application requires that the -pthread
linker flag be passed to GCC to
ensure that the Pthreads functions and types are accessible:
$ gcc -o hellothreads hellothreads.c -pthread
Running the program without a command line argument results in a usage message:
$ ./hellothreads
usage: ./hellothreads <n>
where <n> is the number of threads
Running the program with four threads yields the following output:
$ ./hellothreads 4
Hello world! I am thread 1
Hello world! I am thread 2
Hello world! I am thread 3
Hello world! I am thread 0
Notice that each thread prints its unique ID number. In this run, thread 1's output displays first, followed by threads 2, 3, and 0. If we run the program again, we may see the output displayed in a different order:
$ ./hellothreads 4
Hello world! I am thread 0
Hello world! I am thread 1
Hello world! I am thread 2
Hello world! I am thread 3
Recall that the operating system's scheduler determines the thread
execution order. From a user's perspective, the order is effectively
random due to being influenced by many factors that vary outside the
user's control (e.g., available system resources, the system receiving
input, or OS scheduling). Since all threads are running concurrently
with one another and each thread executes a call to printf
(which
prints to stdout
), the first thread that prints to stdout
will have
its output show up first. Subsequent executions may (or may not) result
in different output.
+-----------------------------------+-----------------------------------+ | | | | | Thread Execution Order | | | ::: | | | | | | ::: paragraph | | | You should never make any | | | assumptions about the order in | | | which threads will execute. If | | | the correctness of your program | | | requires that threads run in a | | | particular order, you must add | | | [sync | | | hronization](synchronization.ht | | | ml#_synchronizing_threads) | | | to your program to prevent | | | threads from running when they | | | shouldn't. | | | ::: | +-----------------------------------+-----------------------------------+
14.2.4. Revisiting Scalar Multiplication
Let's explore how to create a multithreaded implementation of the
scalar
multiplication
program from the previous section. Recall that our general strategy for
parallelizing scalar_multiply
is to:
-
Create multiple threads,
-
Assign each thread a subset of the input array,
-
Instruct each thread to multiply the elements in its array subset by
s
.
The following is a thread function that accomplishes this task. Notice
that we have moved array
, length
, and s
to the global scope of the
program.
long *array; //allocated in main
long length; //set in main (1 billion)
long nthreads; //number of threads
long s; //scalar
void *scalar_multiply(void *id) {
long *myid = (long *) id;
int i;
//assign each thread its own chunk of elements to process
long chunk = length / nthreads;
long start = *myid * chunk;
long end = start + chunk;
if (*myid == nthreads - 1) {
end = length;
}
//perform scalar multiplication on assigned chunk
for (i = start; i < end; i++) {
array[i] *= s;
}
return NULL;
}
Let's break this down into parts. Recall that the first step is to assign each thread a component of the array. The following lines accomplish this task:
long chunk = length / nthreads;
long start = *myid * chunk;
long end = start + chunk;
The variable chunk
stores the number of elements that each thread is
assigned. To ensure that each thread gets roughly the same amount of
work, we first set the chunk size to the number of elements divided by
the number of threads, or length / nthreads
.
Next, we assign each thread a distinct range of elements to process.
Each thread computes its range's start
and end
index using the
chunk
size and its unique thread ID.
For example, with four threads (with IDs 0-3) operating over an array
with 100 million elements, each thread is responsible for processing a
25 million element chunk
. Incorporating the thread ID assigns each
thread a unique subset of the input.
The next two lines account for the case in which length
is not evenly
divisible by the number of threads:
if (*myid == nthreads - 1) {
end = length;
}
Suppose that we specified three rather than four threads. The nominal chunk size would be 33,333,333 elements, leaving one element unaccounted for. The code in the previous example would assign the remaining element to the last thread.
+-----------------------------------+-----------------------------------+ | | | | | Creating balanced input | | | ::: | | | | | | ::: paragraph | | | The chunking code just shown is | | | imperfect. In the case where the | | | number of threads does not evenly | | | divide the input, the remainder | | | is assigned to the last thread. | | | Consider a sample run in which | | | the array has 100 elements, and | | | 12 threads are specified. The | | | nominal chunk size would be 8, | | | and the remainder would be 4. | | | With the example code, the first | | | 11 threads will each have 8 | | | assigned elements, whereas the | | | last thread will be assigned 12 | | | elements. Consequently, the last | | | thread performs 50% more work | | | than the other threads. A | | | potentially better way to chunk | | | this example is to have the first | | | 4 threads process 9 elements | | | each, while the last 8 threads | | | process 8 elements each. This | | | will result in better load | | | balancing of the input across | | | the threads. | | | ::: | +-----------------------------------+-----------------------------------+
With an appropriate local start
and end
index computed, each thread
is now ready to perform scalar multiplication on its component of the
array. The last portion of the scalar_multiply
function accomplishes
this:
for (i = start; i < end; i++) {
array[i] *= s;
}
14.2.5. Improving Scalar Multiplication: Multiple Arguments
A key weakness of the previous implementation is the wide use of global
variables. Our original discussion of global
variables
showed that although useful, global variables should generally be
avoided in C. To reduce the number of global variables in the program,
one solution is to declare a t_arg
struct as follows in the global
scope:
struct t_arg {
int *array; // pointer to shared array
long length; // num elements in array
long s; //scaling factor
long numthreads; // total number of threads
long id; // logical thread id
};
Our main function would, in addition to allocating array
and setting
local variables length
, nthreads
, and s
(our scaling factor),
allocate an array of t_arg
records:
long nthreads = strtol(argv[1], NULL, 10); //get number of threads
long length = strtol(argv[2], NULL, 10); //get length of array
long s = strtol( argv[3], NULL, 10 ); //get scaling factor
int *array = malloc(length*sizeof(int));
//allocate space for thread structs and identifiers
pthread_t *thread_array = malloc(nthreads * sizeof(pthread_t));
struct t_arg *thread_args = malloc(nthreads * sizeof(struct t_arg));
//Populate thread arguments for all the threads
for (i = 0; i < nthreads; i++){
thread_args[i].array = array;
thread_args[i].length = length;
thread_args[i].s = s;
thread_args[i].numthreads = nthreads;
thread_args[i].id = i;
}
Later in main
, when pthread_create
is called, the thread's
associated t_args
struct is passed as an argument:
for (i = 0; i < nthreads; i++){
pthread_create(&thread_array[i], NULL, scalar_multiply, &thread_args[i]);
}
Lastly, our scalar_multiply
function would look like the following:
void * scalar_multiply(void* args) {
//cast to a struct t_arg from void*
struct t_arg * myargs = (struct t_arg *) args;
//extract all variables from struct
long myid = myargs->id;
long length = myargs->length;
long s = myargs->s;
long nthreads = myargs->numthreads;
int * ap = myargs->array; //pointer to array in main
//code as before
long chunk = length/nthreads;
long start = myid * chunk;
long end = start + chunk;
if (myid == nthreads-1) {
end = length;
}
int i;
for (i = start; i < end; i++) {
ap[i] *= s;
}
return NULL;
}
Implementing this program fully is an exercise we leave to the reader. Please note that error handling has been omitted for the sake of brevity.