fix(core): optimize warm cache performance for task execution#35172
fix(core): optimize warm cache performance for task execution#35172FrozenPandaz wants to merge 8 commits intomasterfrom
Conversation
✅ Deploy Preview for nx-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for nx-dev ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
View your CI Pipeline Execution ↗ for commit 0437035
☁️ Nx Cloud last updated this comment at |
1115e1c to
771cc55
Compare
7221ff0 to
6962f00
Compare
6962f00 to
3aa24b3
Compare
Batch daemon calls, add bulk cache resolution fast path, and parallelize output hash checking to dramatically improve warm cache hit performance. - Batch daemon calls for recordOutputsHash and outputsHashesMatch - Add resolveCachedTasksBulk fast path for bulk cache resolution - Cache readProjectsConfigurationFromProjectGraph in getExecutorForTask - Add verified match cache in daemon to skip redundant filesystem scans - Add Rayon-parallel get_files_for_outputs_batch in Rust - Batch-hash unhashed tasks per topological level - Skip recordOutputsHash for local-cache-kept-existing
- Add NxCache.get_batch() in Rust: single SQL query + Rayon parallel file reads instead of N individual JS→Rust round-trips - Wire resolveCachedTasksBulk into coordinator loop to bulk-resolve cache hits with batched daemon calls - Batch task scheduling with single sort instead of re-sorting per insert - Remove unused executeDiscreteTaskLoop
Co-authored-by: FrozenPandaz <FrozenPandaz@users.noreply.github.com>
Two bugs in the batch scheduling optimization: 1. Collecting all schedulable roots at once bypassed parallelism checks - scheduling a non-parallel task must block subsequent tasks from being scheduled in the same pass. 2. Sort comparator was non-transitive when two tasks both lacked historical timing data, causing non-deterministic ordering.
The coordinator loop used a separate workerCompletedCallbacks list while the continuous task loop used waitingForTasks. This meant when a task completed via scheduleNextTasksAndReleaseThreads, only the continuous loop was woken — the coordinator would stay blocked, causing deadlocks in e2e scenarios with both discrete and continuous tasks. Consolidate both loops to use waitingForTasks.
…ht count Race condition: scheduleNextTasksAndReleaseThreads wakes the coordinator before .finally() decrements inFlightWorkers. The coordinator sees inFlightWorkers > 0, goes back to sleep, then .finally() decrements but nobody wakes the coordinator again — deadlock. Fix: also fire waitingForTasks from .finally() so the coordinator re-evaluates the exit condition after the decrement.
There was a problem hiding this comment.
Important
At least one additional CI pipeline execution has run since the conclusion below was written and it may no longer be applicable.
Nx Cloud is proposing a fix for your failed CI:
We added a .catch() handler on the fire-and-forget applyFromCacheOrRunTask call to capture errors (e.g. remote cache 401/connection failures from cache.put) that were previously silently swallowed by .finally(). The captured error is re-thrown after the coordinator loop exits, restoring the original behavior where these errors propagated to the CLI, printed the diagnostic message, and exited with a non-zero code.
Note
⏳ We are verifying this fix by re-running e2e-nx:e2e-ci--src/cache.test.ts.
Suggested Fix changes
diff --git a/packages/nx/src/tasks-runner/task-orchestrator.ts b/packages/nx/src/tasks-runner/task-orchestrator.ts
index 0c75521158..bbf924e089 100644
--- a/packages/nx/src/tasks-runner/task-orchestrator.ts
+++ b/packages/nx/src/tasks-runner/task-orchestrator.ts
@@ -227,6 +227,7 @@ export class TaskOrchestrator {
parallelism: number
) {
let inFlightWorkers = 0;
+ let firstWorkerError: unknown = null;
while (true) {
if (this.bailed || this.stopRequested) break;
@@ -287,8 +288,18 @@ export class TaskOrchestrator {
dispatched = true;
inFlightWorkers++;
const groupId = this.closeGroup();
- this.applyFromCacheOrRunTask(doNotSkipCache, task, groupId).finally(
- () => {
+ this.applyFromCacheOrRunTask(doNotSkipCache, task, groupId)
+ .catch((e) => {
+ // Capture the first worker error so it can be re-thrown after
+ // the coordinator loop exits. This preserves the old behavior
+ // where errors (e.g. remote cache 401/connection failures) from
+ // cache.put propagated to crash the build with a visible message.
+ if (!firstWorkerError) firstWorkerError = e;
+ this.bailed = true;
+ this.waitingForTasks.forEach((f) => f(null));
+ this.waitingForTasks.length = 0;
+ })
+ .finally(() => {
this.openGroup(groupId);
inFlightWorkers--;
// Wake coordinator — the decrement above may satisfy the
@@ -296,8 +307,7 @@ export class TaskOrchestrator {
// when scheduleNextTasksAndReleaseThreads fired earlier.
this.waitingForTasks.forEach((f) => f(null));
this.waitingForTasks.length = 0;
- }
- );
+ });
}
if (dispatched) continue;
@@ -309,6 +319,11 @@ export class TaskOrchestrator {
// 7. Wait for a worker to finish (woken by scheduleNextTasksAndReleaseThreads)
await new Promise((res) => this.waitingForTasks.push(res));
}
+
+ // Re-throw any error captured from a fire-and-forget worker so it
+ // propagates to run() and ultimately the CLI (which prints the message
+ // and exits with non-zero code).
+ if (firstWorkerError) throw firstWorkerError;
}
private async executeContinuousTaskLoop(continuousTaskCount: number) {
🔔 Heads up, your workspace has pending recommendations ↗ to auto-apply fixes for similar failures.
Or Apply changes locally with:
npx nx-cloud apply-locally cNYl-nDXD
Apply fix locally with your editor ↗ View interactive diff ↗
🎓 Learn more about Self-Healing CI on nx.dev
When applyFromCacheOrRunTask rejects (e.g. remote cache 401), the fire-and-forget dispatch silently swallowed the error. The task would never complete and the error message was lost. Add .catch() that prints the error via the lifecycle and marks the task as failed through postRunSteps, matching the behavior of the original sequential dispatch.
Current Behavior
When running cached tasks, the task orchestrator processes each task individually: separate cache lookups, individual daemon IPC calls for output hash checking/recording, and per-task scheduling with a full array re-sort on each insert. For a workspace with 1,110 projects, this means ~1,100 individual JS→Rust→SQLite cache lookups, ~2,200 sequential daemon round-trips, and O(n² log n) scheduling overhead.
Expected Behavior
Warm cache runs should resolve quickly by batching all hot-path operations: cache lookups, daemon calls, scheduling, and filesystem scans.
Key Changes
Rust Native (
cache.rs)NxCache.get_batch(): Single SQLUPDATE ... WHERE hash IN (...) RETURNINGquery + Rayon-parallel terminal output file reads. Replaces N individual JS→Rust boundary crossings and SQLite queries with 1.get_files_for_outputs_batch(): Rayon-parallel filesystem scanning for output expansion (from first commit).Task Orchestrator (
task-orchestrator.ts)resolveCachedTasksBulkinto coordinator loop: Bulk-resolves all cache hits before falling through to individual workers. Uses batched daemon calls for output hash checking.processTasklifecycle calls for tasks resolved from cache — only processes remaining cache misses.executeDiscreteTaskLoop: Unused after coordinator introduction.Task Scheduler (
tasks-schedule.ts)Cache (
cache.ts)DbCache.getBatch(): TypeScript wrapper for the native batch cache lookup.Daemon Outputs Tracking (
outputs-tracking.ts)outputsHashesMatchandoutputsHashesMatchBatchwhen the daemon has no recorded hashes, avoiding unnecessary Rayon filesystem scans afternx reset.From First Commit
recordOutputsHashBatchandoutputsHashesMatchBatchreadProjectsConfigurationFromProjectGraph: Avoids rebuilding project map per taskrecordOutputsHashforlocal-cache-kept-existing: Already has correct hashesBenchmark Results (1,110 projects, warm cache)
Related Issue(s)