Hyper Threading Speeds Linux Hyper-Threading support in Linux kernel 2.5.x

Hyper-Threading Speeds Linux

By Duc Vianney, Ph. D. - 2003-12-31 Page: 1 2 3 4 5 6 7 8 9 10

Hyper-Threading support in Linux kernel 2.5.x

Linux kernel 2.4.x was made aware of HT since the release of 2.4.17. The kernel 2.4.17 knows about the logical processor, and it treats a Hyper-Threaded processor as two physical processors. However, the scheduler used in the stock kernel 2.4.x is still considered naive for not being able to distinguish the resource contention problem between two logical processors versus two separate physical processors.

Ingo Molnar has pointed out scenarios in which the current scheduler gets things wrong (see Resources for a link). Consider a system with two physical CPUs, each of which provides two virtual processors. If there are two tasks running, the current scheduler would let them both run on a single physical processor, even though far better performance would result from migrating one process to the other physical CPU. The scheduler also doesn't understand that migrating a process from one virtual processor to its sibling (a logical CPU on the same physical CPU) is cheaper (due to cache loading) than migrating it across physical processors.

The solution is to change the way the run queues work. The 2.5 scheduler maintains one run queue per processor and attempts to avoid moving tasks between queues. The change is to have one run queue per physical processor that is able to feed tasks into all of the virtual processors. Throw in a smarter sense of what makes an idle CPU (all virtual processors must be idle), and the resulting code "magically fulfills" the needs of scheduling on a Hyper-Threading system.

In addition to the run queue change in the 2.5 scheduler, there are other changes needed to give the Linux kernel the ability to leverage HT for optimal performance. Those changes were discussed by Molnar (again, please see Resources for more on that) as follows.

HT-aware passive load-balancing:
The IRQ-driven balancing has to be per-physical-CPU, not per-logical-CPU. Otherwise, it might happen that one physical CPU runs two tasks while another physical CPU runs no task; the stock scheduler does not recognize this condition as "imbalance." To the scheduler, it appears as if the first two CPUs have 1-1 task running while the second two CPUs have 0-0 tasks running. The stock scheduler does not realize that the two logical CPUs belong to the same physical CPU.
"Active" load-balancing:
This is when a logical CPU goes idle and causes a physical CPU imbalance. This is a mechanism that simply does not exist in the stock 1:1 scheduler. The imbalance caused by an idle CPU can be solved via the normal load-balancer. In the case of HT, the situation is special because the source physical CPU might have just two tasks running, both runnable. This is a situation that the stock load-balancer is unable to handle, because running tasks are hard to migrate away. This migration is essential -- otherwise a physical CPU can get stuck running two tasks while another physical CPU stays idle.
HT-aware task pickup:
When the scheduler picks a new task, it should prefer all tasks that share the same physical CPU before trying to pull in tasks from other CPUs. The stock scheduler only picks tasks that were scheduled to that particular logical CPU.
HT-aware affinity:
Tasks should attempt to "stick" to physical CPUs, not logical CPUs.
HT-aware wakeup:
The stock scheduler only knows about the "current" CPU, it does not know about any sibling. On HT, if a thread is woken up on a logical CPU that is already executing a task, and if a sibling CPU is idle, then the sibling CPU has to be woken up and has to execute the newly woken-up task immediately.

At this writing, Molnar has provided a patch to stock kernel 2.5.32 implementing all the above changes by introducing the concept of a shared runqueue: multiple CPUs can share the same runqueue. A shared, per-physical-CPU runqueue fulfills all of the HT-scheduling needs listed above. Obviously this complicates scheduling and load-balancing, and the effects on the SMP and uniprocessor scheduler are still unknown.

The change in Linux kernel 2.5.32 was designed to affect Xeon systems with more than two CPUs, especially in the load-balancing and thread affinity arenas. Due to hardware resource constraints, we were only able to measure its effects in our one-CPU test environment. Using the same testing process employed in 2.4.19, we ran the three workloads, chat, dbench, and tbench, on 2.5.32. For chat, HT could bring as much as a 60% speed-up in the case of 40 chat rooms. The overall improvement was about 45%. For dbench, 27% was the high speed-up mark, with the overall improvement about 12%. For tbench, the overall improvement was about 35%.

Table 7. Effects of Hyper-Threading on Linux kernel 2.5.32

chat workload
Number of chat rooms	2532s-noht	2532s-ht	Speed-up
20	137,792	207,788	51%
30	138,832	195,765	41%
40	144,454	231,509	47%
50	137,745	191,834	39%
Geometric Mean	139,678	202,034	45%
dbench workload
Number of clients	2532s-noht	2532s-ht	Speed-up
20	142.02	180.87	27%
30	129.63	141.19	9%
60	84.76	86.02	1%
90	67.89	70.37	4%
120	57.44	70.59	23%
Geometric Mean	90.54	101.76	12%
tbench workload
Number of clients	2532s-noht	2532s-ht	Speed-up
20	60.28	82.23	36%
30	60.12	81.72	36%
60	59.73	81.2	36%
90	59.71	80.79	35%
120	59.73	79.45	33%
Geometric Mean	59.91	81.07	35%
Note: chat data is the number of messages sent by the client/sec; dbench and tbench data are in MB/sec.

View Hyper-Threading Speeds Linux Discussion

Page: 1 2 3 4 5 6 7 8 9 10 Next Page: Conclusion & Resources

First published by IBM developerWorks