Solutions

Performance engineering for the GPUs you already own.

Fixed-fee engagements where our inference engineers go deep on your stack — custom CUDA kernels, quantization, migration, on-prem deployment. Same team that built FireAttention. Zero hourly billing.

Book a discovery call View services

FireAttention vs vLLM 0.6

2.4×

Discovery → signed SoW

< 5 days

Engagement billing model

Fixed-fee

We ship code, not slides

Every engagement ends with deployed software in your repo or cluster. Decks are an artifact, not the deliverable.

Fixed-fee, not hourly

You know the cost before we start. We absorb the risk of going long. That's only possible because we're confident.

Same engineers as the platform

The people who built FireAttention v3, FP8 fused prefill, and tree speculative decoding. Veterans of frontier AI labs and large open-model serving teams.

Services

Five productised engagements

Each one has a fixed scope, fixed fee, and concrete deliverables. Need something custom? We do that too — talk to us.

2 weeks

From $25,000 fixed-fee

Inference Performance Audit

We benchmark your stack and tell you exactly where the bottleneck is.

Two-week deep dive: profile your current inference setup (any framework — vLLM, TGI, TensorRT-LLM, Ollama, custom), identify the top 3 throughput / latency bottlenecks, and deliver a prioritised roadmap with quantified impact estimates per fix.

Deliverables

Profiling report with flame graphs (CUDA, kernel-level)
Top 3 bottleneck analysis with quantified gain estimates
Prioritised optimisation roadmap (effort × impact)
Optional: 30-min readout call with engineering leadership

4-12 weeks per kernel

$80K-$240K per engagement

Custom Kernel Development

Hand-tuned CUDA / Triton kernels for your hottest path.

Our team writes custom CUDA, Triton, or CUTLASS kernels for the operations dominating your forward pass. Typical wins: 1.5-3× throughput on attention, MoE expert routing, or fused activation + matmul. Includes verification, regression tests, and deployment artifacts.

Deliverables

Production-ready kernel(s) with test suite
Numerical equivalence proof vs reference
Integration with your existing serving stack
Performance report (before/after, multiple batch sizes)

3-6 weeks

From $40,000 fixed-fee

Migration Engagement

We move you from your current provider to optimised inference.

Full migration from existing closed-API providers (OpenAI, Anthropic) or self-hosted setups (vLLM, TGI, NVIDIA Triton) to a tuned inference stack — either on Luminet Cloud, your dedicated GPUs, or your own VPC. Zero downtime, instrumented cutover, post-migration support.

Deliverables

End-to-end migration plan with rollback strategy
API compatibility shim if your code uses non-OpenAI shapes
Side-by-side traffic shadowing for 7 days
30-day on-call post-cutover

4-8 weeks

From $96K annual + setup

On-Prem / VPC Deployment

Bring FireAttention into your own data center.

We install and tune the FireAttention runtime inside your own AWS / GCP / Azure / on-prem cluster. Includes architecture review, deployment automation, monitoring stack integration, and performance benchmarking against your specific hardware (H100, H200, MI300X, B200).

Deliverables

Air-gapped deployment with Helm charts / Terraform
Custom hardware-specific tuning (NVLink, NVSwitch, IB topology)
Prometheus + Grafana + alerting integration
Knowledge transfer to your SRE team

3-8 weeks

From $60,000 per model

Custom Model Optimization

Take a research checkpoint to production-grade inference.

You have a fine-tuned or proprietary model that runs slow. We quantize (FP8 / INT4 with quality validation), apply continuous batching, configure speculative decoding with a custom draft, and benchmark on your target hardware. Output: a deployment-ready artifact running 2-5× faster than your starting point.

Deliverables

Quantized variants (FP8, INT4) with quality validation reports
Custom draft model trained for speculative decoding
Deployment recipe with Docker images
Benchmark vs your prior baseline

Architecture

What we actually deliver, in three diagrams

Concrete artifacts from real engagements. Click through to the relevant case study or service for the full story.

Service · Custom Kernel Development

Fuse three kernels into one

Stock inference engines launch attention, quantize, and matmul as separate CUDA kernels. We fuse them in registers — the numerical result is bit-exact, the wallclock cost drops 1.76× on the prefill path.

Read FireAttention v3 deep-dive

custom-kernel.svg

Before — 3 kernel launches

attention_bf164.2 ms

quantize_to_fp80.9 ms

matmul_fp82.1 ms

total: 7.2 ms

After — fused kernel

fused_fa3_fp84.1 ms

attn + quant + matmul

(saved 2 launches)

(saved kernel overhead)

total: 4.1 ms (1.76×)

Real measurement on H100 SXM, batch 32, FP8 prefill, 1K input tokens.

migration-architecture.svg

Before

Your client codeunchanged

OpenAI / Anthropic SDK

Closed API

OpenAI / Anthropic / Google

Their GPUs

$$$ per token, no control

After

Your client codeunchanged

Unchanged · 1 line URL swap

Compatibility shim

Side-by-side traffic for 7 days

FireAttention runtimeoptimised

FP8 + spec decode + paged KV

Your dedicated GPUs

Or our hosted cluster

Zero downtime · instrumented cutover · 30-day post-engagement supportWe deliver

Service · Migration Engagement

Your client code stays. Everything below it gets faster.

We sit a compatibility shim in front of your existing OpenAI / Anthropic / vLLM endpoint, run side-by-side traffic for 7 days to validate quality, then cut over with zero downtime. Typical outcome: −40-60% inference cost.

Estimate your migration

Service · Custom Model Optimization

Hundreds of customer fine-tunes on a single H100

Multi-LoRA serving lets you give every downstream customer a private fine-tune at the per-token economics of a shared model. Adapter switch overhead under 2 ms. Pin the hot ones, evict the long-tail automatically.

Read multi-LoRA architecture

multi-lora-routing.svg

Incoming requests

req_001→acme-v3

req_002→globex-v1

req_003→acme-v3

req_004→initech-v2

req_005→globex-v1

Single H100 (shared base + N adapters)

Base model · llama-4-maverick FP8140 GB HBM

Loaded once. Computes the shared forward pass.

acme-v3

globex-v1

initech-v2

hooli-v4

pied-v2

…395 more

adapter switch overhead< 2 ms

400+ customer adapters per replica. Same per-token economics as a single shared model.

How we work

From first call to deployed code in under a month

Discovery call (free)

30 minutes with an inference engineer. We diagnose at a high level, decide if we can help, and scope an engagement.

Statement of Work

Fixed-fee proposal with deliverables, timeline, and success criteria. Signed in 1-2 business days.

Engagement kickoff

Dedicated Slack channel, daily standups (or async if you prefer), shared GitHub repo for code deliverables.

Delivery & handoff

Production-ready artifacts, knowledge transfer, and 30-day post-engagement support included.

Why us

Three honest options. Here's when each makes sense.

We won't pretend Luminet is the right answer for every team. For some workloads DIY is fine. For others a generalist contractor beats us on price. Here's the honest comparison.

Dimension	DIY (in-house)	Generalist contractor	Luminet
Time to first deliverable	2-6 months	4-8 weeks	2-3 weeks
Pricing model	Fully loaded engineer cost	Hourly / T&M (open-ended)	Fixed-fee, signed SoW
Risk of going over budget	High (scope drift)	High (hourly meter)	Zero (fixed-fee)
Inference-specific expertise	Depends on hire	Variable; rarely deep	Built FireAttention
Knowledge stays in your team	Yes, fully	Partial — varies	Code in your repo + 30-day on-call
Best for	Large eng teams with slack capacity	Generic infra plumbing	Specialised inference work, deadline-bound

If you have an inference engineer with FP8 / CUDA / MoE serving on their resume sitting on the bench — DIY. If you don't, talk to us before you spend 6 months building one.

FAQ

Common questions before signing an SoW

How is fixed-fee handled if scope changes mid-engagement?+

We absorb scope drift inside the original deliverable. If you ask for genuinely new work outside the SoW, we issue a change order with its own fixed fee — no hourly billing, no surprise invoices. Original scope stays at the original price.

Do you require us to migrate to your platform?+

No. Every engagement deliverable is portable code that runs on your infrastructure (or any cloud you choose). The Luminet Platform is a separate product — useful if you want managed inference, but never required for Solutions engagements.

What if our model is proprietary or unreleased?+

Standard. We sign mutual NDA before discovery. Engagements involving proprietary weights are common — about 60% of our pipeline. Single-tenant code review repos, no cross-customer code sharing, IP fully assigned to you on completion.

Can we hire one of your engineers full-time afterwards?+

Yes — with a 12-month fee equivalent to the engineer's loaded annual cost (industry standard for placement). We don't lock anyone in; people leave Luminet eventually, and we'd rather it be to a happy customer than a competitor.

What happens to the deliverables if we churn from Luminet?+

Nothing. The code is yours, in your repo, under your license. The 30-day on-call period continues even if you've decided not to renew. We've never put a dependency on the Luminet runtime in delivered code unless explicitly contracted to.

What if you can't deliver on the promised performance lift?+

Performance Audits include quantified gain estimates per fix, with a confidence range. If we ship the work and miss the low end of that range by more than 20%, we extend the engagement at no charge until we hit it — or we refund the difference. We've not had to do this yet, but it's in every SoW.

Can you work on-site at our office?+

Yes for kickoff (1-2 days) and final handoff (1-2 days). Day-to-day work is async — we ship code, not slides, and slides are what would justify on-site work.

What hardware do you support?+

Any NVIDIA GPU from A100 onwards (A100, L40S, H100, H200, B200, GB200 NVL72). AMD MI300X support is in beta — we'll do it but expect a longer timeline. Inference-specific TPUs and Trainium engagements are case-by-case.

Resources

Tools to scope your engagement

Calculator

“The audit told us where 38% of our GPU-time was going. Six weeks later it was gone. Best $25K we've ever spent.”

— VP Infrastructure, $400M-ARR AI startup

Book a discovery call General inquiry

Response within 1 business day

Performance engineering for the GPUs you already own.

We ship code, not slides

Fixed-fee, not hourly

Same engineers as the platform

Five productised engagements

Inference Performance Audit

Custom Kernel Development

Migration Engagement

On-Prem / VPC Deployment

Custom Model Optimization

What we actually deliver, in three diagrams

Fuse three kernels into one

Your client code stays. Everything below it gets faster.

Hundreds of customer fine-tunes on a single H100

From first call to deployed code in under a month

Discovery call (free)

Statement of Work

Engagement kickoff

Delivery & handoff

Three honest options. Here's when each makes sense.

Common questions before signing an SoW

Tools to scope your engagement

Quote calculator

Meet the engineers

Optimization Playbook