Optimizing AutoResearch on RTX 5090: What Failed and What Worked

Initial Problems and Working Path
The initial setup for running AutoResearch on an RTX 5090/Blackwell system was "badly broken" with extremely poor performance—only a few thousand tokens per second and essentially useless MFU (Model FLOPs Utilization), despite the code technically running.
The working configuration path involved:
- Avoiding the broken full-model compile path on this setup
- Keeping the good fused optimizer compile improvements where they actually helped
- Using the stable SDPA/CuDNN attention path
- Tuning total batch and time budget empirically instead of guessing
- Automating the benchmark/extract/strategize/rerun loop
What Failed
Several failure modes were misleading:
- A path that was technically correct but catastrophically slow
- Misleading MFU interpretation until the denominator was corrected for the 5090 context
- Higher per-device batch settings that looked like they should help but actually made things much worse
- Automation bugs around lock cleanup/completion hooks/dispatch order
As the developer noted: "There were several ways to get a run that looked alive while doing something stupid."
What Helped
Real improvements came from:
- Re-enabling the fused optimizer compile path
- Reducing total batch from the original larger setting
- Validating 2**17 as the better total batch region
- Increasing time budget once the stable batch regime was found
- Treating automation as part of the benchmark system, not an afterthought
Performance Progression
The progression of useful runs showed clear improvements:
- Baseline healthy run: val_bpb: 1.165452, mfu: 40.49%
- Fused optimizer compile improvement: val_bpb: 1.155400, mfu: 42.88%
- TOTAL_BATCH_SIZE = 2**18: val_bpb: 1.108381, mfu: 43.18%
- TOTAL_BATCH_SIZE = 2**17 validation: val_bpb: 1.089424, mfu: 43.03%
- Best current auto-loop result: TOTAL_BATCH_SIZE = 2**17, TIME_BUDGET = 1200, LR multiplier = 1.0, val_bpb: 0.999445, mfu: 42.56%, total_tokens_M: 387.8, num_steps: 2959
Current Best Configuration
The best result found so far:
- TOTAL_BATCH_SIZE = 2**17
- TIME_BUDGET = 1200
- LR multiplier = 1.0
This combination beat larger batch variants, smaller 2**16 variant, a lower-LR test, and shorter training budgets.
Key Takeaways
The main lesson was that the winning configuration wasn't a "max everything" setup. The better path involved a stable batch regime, a longer training horizon, and careful elimination of automation and backend mistakes.
The developer emphasized that if you're working on Blackwell/5090 training and seeing bizarre behavior, "it may not be your imagination. Some paths are simply much worse than they first appear." The useful part of this exercise was finding a path that is stable, automatable, reproducible, and good enough to build real follow-on experiments on top of.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code Workflow Visual Explains Memory Hierarchy and Skills System
A Reddit user shared a visual diagram showing Claude Code's workflow structure, including memory layering with CLAUDE.md files and reusable skills defined in .claude/skills/ directories. The workflow loop suggests using Plan mode, describing features, auto-accepting, and committing frequently.

Claude Code Cheat Sheet with 140 Tips and LLMs.txt File
A GitHub repository contains a Claude Code cheat sheet with 140 tips organized into 14 sections, tagged by difficulty. The repository includes an llms.txt file that can be fed directly to Claude for learning or applying the tips.

OpenClaw Pre-Launch Checklist for Security and Reliability
A Reddit user shares a practical six-point checklist for OpenClaw setup before going live, covering access control, safety rules, memory management, automation testing, delivery validation, and failure handling.

Migrating OpenClaw agents to Claude Code after third-party harness deprecation
A developer migrated 17 OpenClaw agents to Claude Code in one afternoon after Anthropic ended third-party harness support. The process involved creating CLAUDE.md entry points, bash wrappers, and cron jobs while preserving existing agent logic.