Developer Seeks Architecture Advice for Serving Embed, Rerank, and Zero-Shot Models on 8GB VRAM

Problem Overview
A developer is building a unified Knowledge Graph/RAG service for a local coding agent that runs in a single Docker container via FastAPI. The system initially ran okay on Windows (WSL), but moving to native Linux exposed severe memory limit issues under stress tests.
Hardware and Model Constraints
Hardware:
- 8GB VRAM (Laptop GPU)
- ~16GB System RAM (Docker limits hit fast, usually only ~6GB free when models are loaded)
Model Stack:
- Embedding: nomic-ai/nomic-embed-text-v2-moe
- Reranking: BAAI/bge-reranker-base
- Classification: MoritzLaurer/ModernBERT-large-zeroshot-v2.0 (used to classify text pairs into 4 relations: dependency, expansion, contradiction, unrelated)
Technical Challenges
The developer cannot aggressively truncate text because they're feeding code chunks and natural text into these models and need to process variable, long sequences.
Specific issues encountered:
- Latency vs. OOM: Using
torch.cuda.empty_cache()to keep the GPU clean causes latency spikes to 18-20 seconds per request due to driver syncs. Removing it causes the GPU to instantly OOM when concurrent requests hit. - System RAM Explosion (Linux Exit 137): Using the Hugging Face pipeline("zero-shot-classification") caused massive CPU RAM bloat. Without truncation, the pipeline generates massive combination matrices in memory before sending them to the GPU, causing the Linux kernel to instantly kill the container.
- VRAM Spikes:
cudnn.benchmark = Truewas caching workspaces for every unique sequence length, draining 3GB of free VRAM in seconds during stress tests.
Current Implementation
The developer has a pure Python/FastAPI setup with the following workarounds:
- Bypassed the HF pipeline and wrote a manual NLI inference loop for ModernBERT
- Using
asyncio.Lock()to force serial execution (only one model touches the GPU at a time) - Using deterministic deallocation (
del inputs + gc.collect()) via FastAPI background tasks
This approach is better but still unstable under a 3-minute stress test.
Questions for the Community
The developer is seeking advice on:
- Model Alternatives: Smaller/faster models that maintain high accuracy for Zero-Shot NLI and Reranking that fit better in an 8GB envelope
- Prebuilt Architectures: Previously looked at infinity_emb but struggled to integrate custom 4-way NLI classification logic without double-loading models. Considering TEI (Text Generation Inference), TensorRT, or other solutions optimized for Encoder models
- Serving Strategy: Standard design patterns for hosting 3 transformer models on a single consumer GPU without them stepping on each other's memory
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude 4.6 Adaptive Thinking: Reddit User Reports Token Waste and Provides Disable Commands
A Reddit user reports that Claude 4.6's new adaptive thinking feature can waste tokens and add latency in Claude Code, providing shell commands to disable it or cap thinking tokens.

Gemini Embedding 2: Google's First Natively Multimodal Embedding Model Released
Google has released Gemini Embedding 2, its first natively multimodal embedding model that maps text, images, video, audio, and documents into a single embedding space. The model supports up to 8192 text tokens, 6 images per request, 120 seconds of video, and PDFs up to 6 pages long, with flexible output dimensions from 3072 down to 768.
Claude Code v2.1.140 Fixes Agent Tool Matching, /goal Hangs, Windows Event-Loop Stall
v2.1.140 improves Agent tool subagent_type matching to be case- and separator-insensitive, fixes /goal hanging with disableAllHooks, resolves Windows event-loop stall from missing executables, and more.

Claude Opus 4.7 Released with Hybrid Reasoning and 1M Context Window
Anthropic released Claude Opus 4.7, a hybrid reasoning model with a 1M context window that delivers stronger performance on coding, vision, and complex multi-step tasks. Pricing starts at $5 per million input tokens and $25 per million output tokens.