FP8 Quantization Support for Older Nvidia GPUs
Request to support NVFP4 models on Turing and Ampere GPUs by implementing FP8ScaledMMLinearKernel via Marlin FP8.
Signal
Visibility
Sign in free to unlock the full scoring breakdown, root-cause analysis, and solution blueprint.
Sign up freeAlready have an account? Sign in
Deep Analysis
Root causes, cross-domain patterns, and opportunity mapping
Sign up free to read the full analysis — no credit card required.
Already have an account? Sign in
Solution Blueprint
Tech stack, MVP scope, go-to-market strategy, and competitive landscape
Sign up free to read the full analysis — no credit card required.
Already have an account? Sign in
Similar Problems
surfaced semanticallyML Inference Lacks Generalized Low-Latency GEMM Kernels with Broad Precision Support
Current low-latency GPU GEMM kernels for ML inference only support specific shapes and bf16 precision. Engineers need generalized versions supporting fp8, nvfp4, and arbitrary shapes for flexible model deployment with PDL after auto-regressive decoding.
llama.cpp lacks native support for 1-bit quantized Bonsai LLM models
The new 1-bit Bonsai 8B model achieves competitive performance at 14x smaller size, but requires a fork of llama.cpp to run. Users want native support in the main project to enable efficient local inference with this architecture.
VLM Model Wrapper Lacks Piecewise CUDAGraph Support
Piecewise cudagraph is not supported for VLM model wrappers in the auto-deploy pipeline. Users deploying vision-language models like Qwen3.5 cannot leverage cudagraph optimizations for the text model component.
LoRA Support Missing for Gemma 4 Models in vLLM
vLLM added Gemma 4 model support but LoRA adapters do not work for Gemma4ForCausalLM or Gemma4ForConditionalGeneration, blocking fine-tuned model deployment.
LLM Inference Frameworks Leave Most GPU Bandwidth Untapped
Conventional LLM inference stacks dispatch one kernel per operation, resulting in hundreds of kernel launches per token, repeated CPU round-trips, and significant memory re-fetching — leaving the majority of available GPU compute and bandwidth unused. This affects developers and researchers running local or self-hosted inference on consumer and prosumer NVIDIA hardware. The gap between theoretical hardware capability and realized throughput is large, but this post is primarily a project announcement rather than a problem statement from users experiencing pain.
Problem descriptions, scores, analysis, and solution blueprints may be updated as new community data becomes available.