Skip to content

Commit c7b5f7c

Browse files
authored
[Support] Use atomic counter in parallelFor instead of per-task spawning (llvm#187989)
This function is primarily used by lld and debug info tools. Instead of pre-splitting work into up to MaxTasksPerGroup (1024) tasks and spawning each through the Executor's mutex+condvar, use an atomic counter for work distribution. Only ThreadCount workers are spawned; each grabs the next chunk via atomic fetch_add. This reduces futex calls from ~31K (glibc, release+assertions build) to ~1.4K when linking clang-14 (191MB PIE with --export-dynamic) with `ld.lld --threads=8` (each parallelFor spawned up to 1024 tasks, each requiring mutex lock + condvar signal). ``` Wall System futex glibc (assertions) before: 927ms 897ms 31K glibc (assertions) after: 879ms 765ms 1.4K mimalloc before: 872ms 694ms 25K mimalloc after: 830ms 661ms 1K ```
1 parent e73d8f8 commit c7b5f7c

1 file changed

Lines changed: 23 additions & 18 deletions

File tree

llvm/lib/Support/Parallel.cpp

Lines changed: 23 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -256,26 +256,31 @@ void llvm::parallelFor(size_t Begin, size_t End,
256256
llvm::function_ref<void(size_t)> Fn) {
257257
#if LLVM_ENABLE_THREADS
258258
if (parallel::strategy.ThreadsRequested != 1) {
259-
auto NumItems = End - Begin;
260-
// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
261-
// overhead on large inputs.
262-
auto TaskSize = NumItems / parallel::detail::MaxTasksPerGroup;
263-
if (TaskSize == 0)
264-
TaskSize = 1;
259+
size_t NumItems = End - Begin;
260+
if (NumItems == 0)
261+
return;
262+
// Distribute work via an atomic counter shared by NumWorkers threads,
263+
// keeping the task count (and thus Linux futex calls) at O(ThreadCount)
264+
// For lld, per-file work is somewhat uneven, so a multipler > 1 is safer.
265+
// While 2 vs 4 vs 8 makes no measurable difference, 4 is used as a
266+
// reasonable default.
267+
size_t NumWorkers = std::min<size_t>(NumItems, parallel::getThreadCount());
268+
size_t ChunkSize = std::max(size_t(1), NumItems / (NumWorkers * 4));
269+
std::atomic<size_t> Idx{Begin};
270+
auto Worker = [&] {
271+
while (true) {
272+
size_t I = Idx.fetch_add(ChunkSize, std::memory_order_relaxed);
273+
if (I >= End)
274+
break;
275+
size_t IEnd = std::min(I + ChunkSize, End);
276+
for (; I < IEnd; ++I)
277+
Fn(I);
278+
}
279+
};
265280

266281
parallel::TaskGroup TG;
267-
for (; Begin + TaskSize < End; Begin += TaskSize) {
268-
TG.spawn([=, &Fn] {
269-
for (size_t I = Begin, E = Begin + TaskSize; I != E; ++I)
270-
Fn(I);
271-
});
272-
}
273-
if (Begin != End) {
274-
TG.spawn([=, &Fn] {
275-
for (size_t I = Begin; I != End; ++I)
276-
Fn(I);
277-
});
278-
}
282+
for (size_t I = 0; I != NumWorkers; ++I)
283+
TG.spawn(Worker);
279284
return;
280285
}
281286
#endif

0 commit comments

Comments
 (0)