Fundamentals 65 min read

Multithreading Programming: Concepts, Synchronization, and Best Practices

Multithreaded programming splits tasks across logical and hardware threads to exploit multicore CPUs, requiring careful use of synchronization primitives such as mutexes, read‑write locks, condition variables, and lock‑free atomics, while avoiding pitfalls like race conditions, deadlocks, and false sharing for correct, high‑performance software.

Meituan Technology Team

Jul 18, 2024

Multithreading Programming: Concepts, Synchronization, and Best Practices

Multithreaded programming is a key technology in modern software development that allows developers to split complex tasks into independent threads for parallel execution, fully utilizing multi‑core processors. However, it introduces challenges such as thread synchronization, deadlocks, and race conditions.

1. Thread Basics

A thread is an execution context with its own flow, call stack, registers, and other state. In Linux, a thread is represented as a Task in the kernel.

1.1 Execution Flow

Each thread has its own instruction sequence (execution flow). The following C++ code illustrates a simple function that will be compiled into a series of machine instructions executed by a thread:

int calc(int a, int b, char op) {
  int c = 0;
  if (op == '+')
    c = a + b;
  else if (op == '-')
    c = a - b;
  else if (op == '*')
    c = a * b;
  else if (op == '/')
    c = a / b;
  else
    printf("invalid operation
");
  return c;
}

Because of branch instructions, instruction reordering, and out‑of‑order execution, the actual order of machine instructions may differ from the source order.

1.2 Logical vs. Hardware Threads

Logical thread : a software concept describing the sequence of operations (e.g., a function that sums an array). Hardware thread : the physical execution unit on a CPU core (or hyper‑thread) that runs the logical thread.

int sum(int a[], int n) {
  int x = 0;
  for (int i = 0; i < n; ++i)
    x += a[i];
  return x;
}

1.3 Processes, Threads, and Coroutines

A process is an instance of a program in execution; a thread is the smallest schedulable unit within a process. Coroutines are user‑level execution flows with lower context‑switch cost than threads.

2. Thread Synchronization

When multiple threads share data, unsynchronized access can cause race conditions. Synchronization coordinates access to shared resources and enforces ordering.

2.1 Why Synchronization Is Needed

Examples illustrate data inconsistency when two threads write to a shared buffer without protection, or when two threads increment a shared integer concurrently.

// Example 1
char msg[256] = "this is old msg";
char* read_msg() { return msg; }
void write_msg(char new_msg[], size_t len) { memcpy(msg, new_msg, std::min(len, sizeof(msg))); }
void thread1() { char new_msg[256] = "this is new msg, it's too looooooong"; write_msg(new_msg, sizeof(new_msg)); }
void thread2() { printf("msg=%s
", read_msg()); }

// Example 3
int x = 0;
void thread1() { ++x; }
void thread2() { ++x; }

2.2 Locks

Mutexes provide exclusive access. The typical pattern is lock → access → unlock:

DataType shared_resource;
Mutex shared_resource_mutex;

void shared_resource_visitor1() {
  shared_resource_mutex.lock();
  // operate on shared_resource
  shared_resource_mutex.unlock();
}

Read‑write locks allow multiple concurrent readers but exclusive writers.

2.3 Condition Variables

Used in producer‑consumer scenarios to avoid busy‑waiting. A condition variable must be used together with a mutex.

void io_thread() {
  while (1) {
    Msg* msg = read_msg_from_socket();
    {
      std::lock_guard<std::mutex> lock(msg_queue_mutex);
      msg_queue.push_back(msg);
    }
    msg_queue_not_empty.notify_all();
  }
}

void work_thread() {
  while (1) {
    Msg* msg = nullptr;
    {
      std::unique_lock<std::mutex> lock(msg_queue_mutex);
      msg_queue_not_empty.wait(lock, []{ return !msg_queue.empty(); });
      msg = msg_queue.get();
    }
    process(msg);
  }
}

2.4 Lock‑Free and Non‑Blocking Algorithms

Lock‑free algorithms guarantee system‑wide progress even if some threads are paused. They rely on atomic primitives such as Compare‑And‑Swap (CAS).

bool CAS(T* ptr, T expect, T new_val) {
  if (*ptr != expect) return false;
  *ptr = new_val;
  return true;
}

// CAS loop
T expect;
do {
  expect = *ptr;
} while (!CAS(ptr, expect, new_val));

A lock‑free stack using C++ atomics:

template<typename T>
struct node { T data; node* next; };

template<typename T>
class stack {
  std::atomic<node<T>*> head;
public:
  void push(const T& v) {
    node<T>* n = new node<T>{v, nullptr};
    n->next = head.load(std::memory_order_relaxed);
    while (!head.compare_exchange_weak(n->next, n,
                                 std::memory_order_release,
                                 std::memory_order_relaxed)) {}
  }
};

3. False Sharing

False sharing occurs when multiple threads write to different variables that reside on the same cache line, causing the line to bounce between cores and degrading performance.

const size_t shm_size = 16*1024*1024; // 16 MiB
static char shm[shm_size];
std::atomic<size_t> shm_offset{0};

void f() {
  for (;;) {
    auto off = shm_offset.fetch_add(sizeof(long));
    if (off >= shm_size) break;
    *(long*)(shm + off) = off;
  }
}

Running two threads on this function takes ~3.4 s because each write invalidates the cache line used by the other core.

An improved version groups 16 writes per atomic increment, eliminating the cache‑line bounce:

void f_fast() {
  const long inner_loop = 16;
  for (;;) {
    auto off = shm_offset.fetch_add(sizeof(long) * inner_loop);
    for (long j = 0; j < inner_loop; ++j) {
      if (off >= shm_size) return;
      *(long*)(shm + off) = j;
      off += sizeof(long);
    }
  }
}

This version runs in ~0.06 s. The key is to keep writes that belong to the same cache line within a single thread, avoiding false sharing.

Another classic false‑sharing example is two threads updating adjacent fields of a struct:

struct Data { int a; int b; } data;
void thread1() { data.a = 1; }
void thread2() { data.b = 2; }

Padding each field to a separate cache line removes the contention:

struct Data {
  int a;
  int padding[60]; // assume 64‑byte cache line
  int b;
};

4. Summary

Effective multithreaded programming requires understanding of thread concepts, synchronization primitives (mutexes, rw‑locks, condition variables), lock‑free techniques, memory ordering, and performance pitfalls such as false sharing. Using appropriate primitives, aligning data to cache lines, and applying memory barriers where needed leads to correct and high‑performance concurrent software.

C++synchronization multithreading Lock atomic

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.