Performance engineering for the GPUs you already own.
Fixed-fee engagements where our inference engineers go deep on your stack — custom CUDA kernels, quantization, migration, on-prem deployment. Same team that built FireAttention. Zero hourly billing.
We ship code, not slides
Every engagement ends with deployed software in your repo or cluster. Decks are an artifact, not the deliverable.
Fixed-fee, not hourly
You know the cost before we start. We absorb the risk of going long. That's only possible because we're confident.
Same engineers as the platform
The people who built FireAttention v3, FP8 fused prefill, and tree speculative decoding. Veterans of frontier AI labs and large open-model serving teams.
Five productised engagements
Each one has a fixed scope, fixed fee, and concrete deliverables. Need something custom? We do that too — talk to us.
Inference Performance Audit
We benchmark your stack and tell you exactly where the bottleneck is.
Two-week deep dive: profile your current inference setup (any framework — vLLM, TGI, TensorRT-LLM, Ollama, custom), identify the top 3 throughput / latency bottlenecks, and deliver a prioritised roadmap with quantified impact estimates per fix.
Deliverables
- Profiling report with flame graphs (CUDA, kernel-level)
- Top 3 bottleneck analysis with quantified gain estimates
- Prioritised optimisation roadmap (effort × impact)
- Optional: 30-min readout call with engineering leadership
Custom Kernel Development
Hand-tuned CUDA / Triton kernels for your hottest path.
Our team writes custom CUDA, Triton, or CUTLASS kernels for the operations dominating your forward pass. Typical wins: 1.5-3× throughput on attention, MoE expert routing, or fused activation + matmul. Includes verification, regression tests, and deployment artifacts.
Deliverables
- Production-ready kernel(s) with test suite
- Numerical equivalence proof vs reference
- Integration with your existing serving stack
- Performance report (before/after, multiple batch sizes)
Migration Engagement
We move you from your current provider to optimised inference.
Full migration from existing closed-API providers (OpenAI, Anthropic) or self-hosted setups (vLLM, TGI, NVIDIA Triton) to a tuned inference stack — either on Luminet Cloud, your dedicated GPUs, or your own VPC. Zero downtime, instrumented cutover, post-migration support.
Deliverables
- End-to-end migration plan with rollback strategy
- API compatibility shim if your code uses non-OpenAI shapes
- Side-by-side traffic shadowing for 7 days
- 30-day on-call post-cutover
On-Prem / VPC Deployment
Bring FireAttention into your own data center.
We install and tune the FireAttention runtime inside your own AWS / GCP / Azure / on-prem cluster. Includes architecture review, deployment automation, monitoring stack integration, and performance benchmarking against your specific hardware (H100, H200, MI300X, B200).
Deliverables
- Air-gapped deployment with Helm charts / Terraform
- Custom hardware-specific tuning (NVLink, NVSwitch, IB topology)
- Prometheus + Grafana + alerting integration
- Knowledge transfer to your SRE team
Custom Model Optimization
Take a research checkpoint to production-grade inference.
You have a fine-tuned or proprietary model that runs slow. We quantize (FP8 / INT4 with quality validation), apply continuous batching, configure speculative decoding with a custom draft, and benchmark on your target hardware. Output: a deployment-ready artifact running 2-5× faster than your starting point.
Deliverables
- Quantized variants (FP8, INT4) with quality validation reports
- Custom draft model trained for speculative decoding
- Deployment recipe with Docker images
- Benchmark vs your prior baseline
What we actually deliver, in three diagrams
Concrete artifacts from real engagements. Click through to the relevant case study or service for the full story.
Service · Custom Kernel Development
Fuse three kernels into one
Stock inference engines launch attention, quantize, and matmul as separate CUDA kernels. We fuse them in registers — the numerical result is bit-exact, the wallclock cost drops 1.76× on the prefill path.
Read FireAttention v3 deep-diveBefore — 3 kernel launches
attention_bf164.2 msquantize_to_fp80.9 msmatmul_fp82.1 msAfter — fused kernel
fused_fa3_fp84.1 msBefore
After
Service · Migration Engagement
Your client code stays. Everything below it gets faster.
We sit a compatibility shim in front of your existing OpenAI / Anthropic / vLLM endpoint, run side-by-side traffic for 7 days to validate quality, then cut over with zero downtime. Typical outcome: −40-60% inference cost.
Estimate your migrationService · Custom Model Optimization
Hundreds of customer fine-tunes on a single H100
Multi-LoRA serving lets you give every downstream customer a private fine-tune at the per-token economics of a shared model. Adapter switch overhead under 2 ms. Pin the hot ones, evict the long-tail automatically.
Read multi-LoRA architectureIncoming requests
req_001→acme-v3req_002→globex-v1req_003→acme-v3req_004→initech-v2req_005→globex-v1Single H100 (shared base + N adapters)
From first call to deployed code in under a month
Discovery call (free)
30 minutes with an inference engineer. We diagnose at a high level, decide if we can help, and scope an engagement.
Statement of Work
Fixed-fee proposal with deliverables, timeline, and success criteria. Signed in 1-2 business days.
Engagement kickoff
Dedicated Slack channel, daily standups (or async if you prefer), shared GitHub repo for code deliverables.
Delivery & handoff
Production-ready artifacts, knowledge transfer, and 30-day post-engagement support included.
Three honest options. Here's when each makes sense.
We won't pretend Luminet is the right answer for every team. For some workloads DIY is fine. For others a generalist contractor beats us on price. Here's the honest comparison.
| Dimension | DIY (in-house) | Generalist contractor | Luminet |
|---|---|---|---|
| Time to first deliverable | 2-6 months | 4-8 weeks | 2-3 weeks |
| Pricing model | Fully loaded engineer cost | Hourly / T&M (open-ended) | Fixed-fee, signed SoW |
| Risk of going over budget | High (scope drift) | High (hourly meter) | Zero (fixed-fee) |
| Inference-specific expertise | Depends on hire | Variable; rarely deep | Built FireAttention |
| Knowledge stays in your team | Yes, fully | Partial — varies | Code in your repo + 30-day on-call |
| Best for | Large eng teams with slack capacity | Generic infra plumbing | Specialised inference work, deadline-bound |
If you have an inference engineer with FP8 / CUDA / MoE serving on their resume sitting on the bench — DIY. If you don't, talk to us before you spend 6 months building one.
Common questions before signing an SoW
How is fixed-fee handled if scope changes mid-engagement?+
We absorb scope drift inside the original deliverable. If you ask for genuinely new work outside the SoW, we issue a change order with its own fixed fee — no hourly billing, no surprise invoices. Original scope stays at the original price.
Do you require us to migrate to your platform?+
No. Every engagement deliverable is portable code that runs on your infrastructure (or any cloud you choose). The Luminet Platform is a separate product — useful if you want managed inference, but never required for Solutions engagements.
What if our model is proprietary or unreleased?+
Standard. We sign mutual NDA before discovery. Engagements involving proprietary weights are common — about 60% of our pipeline. Single-tenant code review repos, no cross-customer code sharing, IP fully assigned to you on completion.
Can we hire one of your engineers full-time afterwards?+
Yes — with a 12-month fee equivalent to the engineer's loaded annual cost (industry standard for placement). We don't lock anyone in; people leave Luminet eventually, and we'd rather it be to a happy customer than a competitor.
What happens to the deliverables if we churn from Luminet?+
Nothing. The code is yours, in your repo, under your license. The 30-day on-call period continues even if you've decided not to renew. We've never put a dependency on the Luminet runtime in delivered code unless explicitly contracted to.
What if you can't deliver on the promised performance lift?+
Performance Audits include quantified gain estimates per fix, with a confidence range. If we ship the work and miss the low end of that range by more than 20%, we extend the engagement at no charge until we hit it — or we refund the difference. We've not had to do this yet, but it's in every SoW.
Can you work on-site at our office?+
Yes for kickoff (1-2 days) and final handoff (1-2 days). Day-to-day work is async — we ship code, not slides, and slides are what would justify on-site work.
What hardware do you support?+
Any NVIDIA GPU from A100 onwards (A100, L40S, H100, H200, B200, GB200 NVL72). AMD MI300X support is in beta — we'll do it but expect a longer timeline. Inference-specific TPUs and Trainium engagements are case-by-case.
Tools to scope your engagement
Quote calculator
30-second rough estimate based on your hardware, model size, and timeline.
Get an estimateTeamMeet the engineers
The people who get assigned to your project. Names, faces, prior work.
See the teamFree PDFOptimization Playbook
104 pages of distilled inference engineering knowledge. Free with email.
Download playbook“The audit told us where 38% of our GPU-time was going. Six weeks later it was gone. Best $25K we've ever spent.”
— VP Infrastructure, $400M-ARR AI startup