Solutions

Performance engineering for the GPUs you already own.

Fixed-fee engagements where our inference engineers go deep on your stack — custom CUDA kernels, quantization, migration, on-prem deployment. Same team that built FireAttention. Zero hourly billing.

FireAttention vs vLLM 0.6
2.4×
Discovery → signed SoW
< 5 days
Engagement billing model
Fixed-fee

We ship code, not slides

Every engagement ends with deployed software in your repo or cluster. Decks are an artifact, not the deliverable.

Fixed-fee, not hourly

You know the cost before we start. We absorb the risk of going long. That's only possible because we're confident.

Same engineers as the platform

The people who built FireAttention v3, FP8 fused prefill, and tree speculative decoding. Veterans of frontier AI labs and large open-model serving teams.

Services

Five productised engagements

Each one has a fixed scope, fixed fee, and concrete deliverables. Need something custom? We do that too — talk to us.

2 weeks
From $25,000 fixed-fee

Inference Performance Audit

We benchmark your stack and tell you exactly where the bottleneck is.

Two-week deep dive: profile your current inference setup (any framework — vLLM, TGI, TensorRT-LLM, Ollama, custom), identify the top 3 throughput / latency bottlenecks, and deliver a prioritised roadmap with quantified impact estimates per fix.

Deliverables

  • Profiling report with flame graphs (CUDA, kernel-level)
  • Top 3 bottleneck analysis with quantified gain estimates
  • Prioritised optimisation roadmap (effort × impact)
  • Optional: 30-min readout call with engineering leadership
4-12 weeks per kernel
$80K-$240K per engagement

Custom Kernel Development

Hand-tuned CUDA / Triton kernels for your hottest path.

Our team writes custom CUDA, Triton, or CUTLASS kernels for the operations dominating your forward pass. Typical wins: 1.5-3× throughput on attention, MoE expert routing, or fused activation + matmul. Includes verification, regression tests, and deployment artifacts.

Deliverables

  • Production-ready kernel(s) with test suite
  • Numerical equivalence proof vs reference
  • Integration with your existing serving stack
  • Performance report (before/after, multiple batch sizes)
3-6 weeks
From $40,000 fixed-fee

Migration Engagement

We move you from your current provider to optimised inference.

Full migration from existing closed-API providers (OpenAI, Anthropic) or self-hosted setups (vLLM, TGI, NVIDIA Triton) to a tuned inference stack — either on Luminet Cloud, your dedicated GPUs, or your own VPC. Zero downtime, instrumented cutover, post-migration support.

Deliverables

  • End-to-end migration plan with rollback strategy
  • API compatibility shim if your code uses non-OpenAI shapes
  • Side-by-side traffic shadowing for 7 days
  • 30-day on-call post-cutover
4-8 weeks
From $96K annual + setup

On-Prem / VPC Deployment

Bring FireAttention into your own data center.

We install and tune the FireAttention runtime inside your own AWS / GCP / Azure / on-prem cluster. Includes architecture review, deployment automation, monitoring stack integration, and performance benchmarking against your specific hardware (H100, H200, MI300X, B200).

Deliverables

  • Air-gapped deployment with Helm charts / Terraform
  • Custom hardware-specific tuning (NVLink, NVSwitch, IB topology)
  • Prometheus + Grafana + alerting integration
  • Knowledge transfer to your SRE team
3-8 weeks
From $60,000 per model

Custom Model Optimization

Take a research checkpoint to production-grade inference.

You have a fine-tuned or proprietary model that runs slow. We quantize (FP8 / INT4 with quality validation), apply continuous batching, configure speculative decoding with a custom draft, and benchmark on your target hardware. Output: a deployment-ready artifact running 2-5× faster than your starting point.

Deliverables

  • Quantized variants (FP8, INT4) with quality validation reports
  • Custom draft model trained for speculative decoding
  • Deployment recipe with Docker images
  • Benchmark vs your prior baseline
Architecture

What we actually deliver, in three diagrams

Concrete artifacts from real engagements. Click through to the relevant case study or service for the full story.

Service · Custom Kernel Development

Fuse three kernels into one

Stock inference engines launch attention, quantize, and matmul as separate CUDA kernels. We fuse them in registers — the numerical result is bit-exact, the wallclock cost drops 1.76× on the prefill path.

Read FireAttention v3 deep-dive
custom-kernel.svg

Before — 3 kernel launches

attention_bf164.2 ms
quantize_to_fp80.9 ms
matmul_fp82.1 ms
total: 7.2 ms

After — fused kernel

fused_fa3_fp84.1 ms
attn + quant + matmul
(saved 2 launches)
(saved kernel overhead)
total: 4.1 ms (1.76×)
Real measurement on H100 SXM, batch 32, FP8 prefill, 1K input tokens.
migration-architecture.svg

Before

Your client codeunchanged
OpenAI / Anthropic SDK
Closed API
OpenAI / Anthropic / Google
Their GPUs
$$$ per token, no control

After

Your client codeunchanged
Unchanged · 1 line URL swap
Compatibility shim
Side-by-side traffic for 7 days
FireAttention runtimeoptimised
FP8 + spec decode + paged KV
Your dedicated GPUs
Or our hosted cluster
Zero downtime · instrumented cutover · 30-day post-engagement supportWe deliver

Service · Migration Engagement

Your client code stays. Everything below it gets faster.

We sit a compatibility shim in front of your existing OpenAI / Anthropic / vLLM endpoint, run side-by-side traffic for 7 days to validate quality, then cut over with zero downtime. Typical outcome: −40-60% inference cost.

Estimate your migration

Service · Custom Model Optimization

Hundreds of customer fine-tunes on a single H100

Multi-LoRA serving lets you give every downstream customer a private fine-tune at the per-token economics of a shared model. Adapter switch overhead under 2 ms. Pin the hot ones, evict the long-tail automatically.

Read multi-LoRA architecture
multi-lora-routing.svg

Incoming requests

req_001acme-v3
req_002globex-v1
req_003acme-v3
req_004initech-v2
req_005globex-v1

Single H100 (shared base + N adapters)

Base model · llama-4-maverick FP8140 GB HBM
Loaded once. Computes the shared forward pass.
acme-v3
globex-v1
initech-v2
hooli-v4
pied-v2
…395 more
adapter switch overhead< 2 ms
400+ customer adapters per replica. Same per-token economics as a single shared model.
How we work

From first call to deployed code in under a month

1

Discovery call (free)

30 minutes with an inference engineer. We diagnose at a high level, decide if we can help, and scope an engagement.

2

Statement of Work

Fixed-fee proposal with deliverables, timeline, and success criteria. Signed in 1-2 business days.

3

Engagement kickoff

Dedicated Slack channel, daily standups (or async if you prefer), shared GitHub repo for code deliverables.

4

Delivery & handoff

Production-ready artifacts, knowledge transfer, and 30-day post-engagement support included.

Why us

Three honest options. Here's when each makes sense.

We won't pretend Luminet is the right answer for every team. For some workloads DIY is fine. For others a generalist contractor beats us on price. Here's the honest comparison.

DimensionDIY (in-house)Generalist contractorLuminet
Time to first deliverable2-6 months4-8 weeks2-3 weeks
Pricing modelFully loaded engineer costHourly / T&M (open-ended)Fixed-fee, signed SoW
Risk of going over budgetHigh (scope drift)High (hourly meter)Zero (fixed-fee)
Inference-specific expertiseDepends on hireVariable; rarely deepBuilt FireAttention
Knowledge stays in your teamYes, fullyPartial — variesCode in your repo + 30-day on-call
Best forLarge eng teams with slack capacityGeneric infra plumbingSpecialised inference work, deadline-bound

If you have an inference engineer with FP8 / CUDA / MoE serving on their resume sitting on the bench — DIY. If you don't, talk to us before you spend 6 months building one.

FAQ

Common questions before signing an SoW

How is fixed-fee handled if scope changes mid-engagement?+

We absorb scope drift inside the original deliverable. If you ask for genuinely new work outside the SoW, we issue a change order with its own fixed fee — no hourly billing, no surprise invoices. Original scope stays at the original price.

Do you require us to migrate to your platform?+

No. Every engagement deliverable is portable code that runs on your infrastructure (or any cloud you choose). The Luminet Platform is a separate product — useful if you want managed inference, but never required for Solutions engagements.

What if our model is proprietary or unreleased?+

Standard. We sign mutual NDA before discovery. Engagements involving proprietary weights are common — about 60% of our pipeline. Single-tenant code review repos, no cross-customer code sharing, IP fully assigned to you on completion.

Can we hire one of your engineers full-time afterwards?+

Yes — with a 12-month fee equivalent to the engineer's loaded annual cost (industry standard for placement). We don't lock anyone in; people leave Luminet eventually, and we'd rather it be to a happy customer than a competitor.

What happens to the deliverables if we churn from Luminet?+

Nothing. The code is yours, in your repo, under your license. The 30-day on-call period continues even if you've decided not to renew. We've never put a dependency on the Luminet runtime in delivered code unless explicitly contracted to.

What if you can't deliver on the promised performance lift?+

Performance Audits include quantified gain estimates per fix, with a confidence range. If we ship the work and miss the low end of that range by more than 20%, we extend the engagement at no charge until we hit it — or we refund the difference. We've not had to do this yet, but it's in every SoW.

Can you work on-site at our office?+

Yes for kickoff (1-2 days) and final handoff (1-2 days). Day-to-day work is async — we ship code, not slides, and slides are what would justify on-site work.

What hardware do you support?+

Any NVIDIA GPU from A100 onwards (A100, L40S, H100, H200, B200, GB200 NVL72). AMD MI300X support is in beta — we'll do it but expect a longer timeline. Inference-specific TPUs and Trainium engagements are case-by-case.

Resources

Tools to scope your engagement

“The audit told us where 38% of our GPU-time was going. Six weeks later it was gone. Best $25K we've ever spent.”

— VP Infrastructure, $400M-ARR AI startup

Book a discovery callGeneral inquiry

Response within 1 business day