
Threading and Parallelism
We generally open up the task manager and see 16 logical processors, thinking I can run 16 tasks in parallel at the same time. Well that's not quite true, and I am going to show you why by running an experiment in C++.
First Principles: What Is a "Core" vs a "Thread"?
Physical Cores
A physical core is an independent processing unit with its own ALU (arithmetic logic unit), registers, and execution pipeline. It can fetch, decode, and execute instructions completely independently of every other core on the chip.
This is real hardware parallelism. If you have 8 physical cores, you can genuinely execute 8 independent instruction streams simultaneously. Not time-sliced, not interleaved, truly simultaneous at the transistor level.
Logical Threads (SMT / Hyperthreading)
Intel's Hyper-Threading (and AMD's Simultaneous Multi-Threading) doubles the advertised thread count by allowing two instruction streams to share one physical core's execution resources.
But here's the critical detail: they share the same ALU, the same execution units, the same cache. The CPU holds two complete sets of architectural state simultaneously. The out-of-order scheduler dispatches instructions from both threads to available execution units each cycle — when Thread A can't fill every execution port (due to a data dependency or cache miss), Thread B's instructions fill the gap in that same cycle.
It does NOT double your throughput. For workloads that keep the execution pipeline fully occupied, the sibling thread gets almost nothing.
OS Threads
When you call std::thread in C++, you're creating an OS-level thread managed by the kernel's scheduler. The OS can create thousands of them, far more than the number of physical cores.
The scheduler gives each runnable thread a small time slice (CFS uses a dynamic targeted latency of ~6ms by default, divided across all runnable threads — so with many threads active, each slice can drop well below 1ms), then preempts it and runs the next one.
But if two CPU-bound threads share one core, the total work done doesn't increase, the CPU just alternates between them.
The Experiment Design
I have two components:
worker.cpp — A CPU Torture Tool
It's a C++ program that:
- Pins itself to a specific CPU core using
pthread_setaffinity_np()— this bypasses the OS scheduler's freedom to migrate threads between cores - Runs a tight arithmetic loop (xorshift + golden ratio mixing) that keeps the integer execution pipeline fully occupied
- Uses a per-thread sink array to prevent dead-code elimination while avoiding false sharing between threads at shutdown
- Reports precise operation counts so we can measure throughput
static uint64_t cpu_intensive_work(std::atomic<bool>& running,
int core_id,
int thread_index,
int sleep_us = 0,
int batch_size = 100000) {
pin_to_core(core_id);
uint64_t counter = 0;
uint64_t state = 0xdeadbeefcafe1234ULL ^ (uint64_t)core_id;
while (running.load(std::memory_order_relaxed)) {
// Tight arithmetic batch — no branches, no memory ops
for (int i = 0; i < batch_size; ++i) {
state ^= state << 13;
state ^= state >> 7;
state ^= state << 17;
state += 0x9e3779b97f4a7c15ULL; // golden ratio constant
++counter;
}
// Optional idle window (simulates I/O wait)
if (sleep_us > 0) {
usleep(static_cast<useconds_t>(sleep_us));
}
}
global_sink[thread_index].value = state;
return counter;
}
orchestrator.py — Orchestrating the worker based on the experiment
All experiments ran on an 8-core / 16-thread machine for 5 seconds each, using a tight arithmetic loop (xorshift + golden ratio mixing) that keeps the execution pipeline fully saturated.
Experiment 1: Single Thread Baseline
One thread, one core
Result: ~651 million operations per second on a single core
Lesson: This is the maximum throughput a single core can deliver for CPU-bound work. Every subsequent experiment is measured against this number.
Experiment 2: Two Busy Threads, Same Core
Both threads pinned to core 0. Both doing continuous computation.
Result: Total throughput equals the baseline — not double. Each thread got exactly half the core's time.
Inference: The second thread didn't bring additional compute power. The scheduler split the single core's execution budget between two threads. No extra work was produced.
Lesson: On a CPU-saturated core, adding a second thread produces zero additional throughput. You're dividing the same pie into smaller pieces, not baking a second one.
Experiment 3: What if Thread 1 Has Idle Time?
Thread 1 periodically sleeps (simulating I/O). Thread 2 stays busy. Both on core 0.
Result: At 1000μs idle time, Thread 1 only produces 401.8M ops, but Thread 2 fills those idle windows and produces 2.79B — approaching full core capacity. Total throughput stays near baseline.
Inference: When one thread yields (sleeps, waits for I/O), the OS scheduler immediately gives the core to another runnable thread. The core stays busy even though individual threads are idle.
Lesson: Threading works for I/O-bound applications (web servers, database drivers, file readers) because threads spend most of their time waiting. During that wait, other threads use the core. But for CPU-bound work that never yields, there's nothing to fill — the core is already at 100%.
Experiment 4: Two Threads on Different Cores
Thread 1 on core 0, Thread 2 on core 2 (a different physical core).
Result: ~1.95× speedup — each core independently executing its own instruction stream.
Inference: This is real parallelism — two pieces of silicon simultaneously doing computation. The drop from a perfect 2× is due to shared L3 cache, memory bus, and power delivery contention under load.
Lesson: True parallelism comes from separate physical cores, not from more threads on one core. If you need 2× throughput for CPU-bound work, you need 2 physical cores.
Experiment 5: Scaling from 1 to 16 Threads
Each worker pinned to a separate core. Workers beyond 8 land on SMT siblings.
Result: Even before reaching the SMT boundary, 8 physical cores only yield 6.16× — not the 8× you might expect. Cache contention, memory bandwidth saturation, and thermal throttling are already costing nearly 2× of theoretical throughput on physical cores alone. Crossing into SMT territory (9–16 workers) gives only ~10.85× total — not 16×.
Inference: The extra 8 SMT siblings add some throughput by using execution units during stalls, but they don't double it. Per-worker throughput decreases as we add threads due to:
- Shared resources — All cores contend for L3 cache and memory bandwidth
- SMT competition — At 16 threads, each physical core runs 2 threads competing for the same execution units
- Thermal throttling — More active cores = more heat = lower clock speeds
Lesson: Your "16-thread" CPU gives you ~6× from physical cores under contention, plus a modest boost from SMT. Scaling past the physical core count yields diminishing returns. When sizing infrastructure for CPU-bound workloads, count physical cores, not threads.
Conclusion
OS threads do not guarantee parallelism. A thread is a scheduling abstraction — whether it executes in parallel depends entirely on whether it gets mapped to a separate physical core. Threads on distinct physical cores are genuine simultaneous execution. Threads sharing a core are time-sliced illusions. Real parallelism comes from physically separate execution units.
Next time you hit a bottleneck, don't just "add more threads." Check which cores are saturated, whether you're I/O-bound or CPU-bound, and plan accordingly.
Thank you to my seniors Amrutansh and Priyanshu Sir for their motivation behind this experiment.
P.S. This is what I wrote after searching and reading from the internet after only a few hours of research, I am open to corrections and better reasoning for the results that is put up in the blog.
Email: ysinghcin@gmail.com