AntAngelMed-eagle3

Model Overview

AntAngelMed-eagle3 is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.

The model is trained on high-quality medical datasets, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.

Key Features

  • Speculative Sampling Optimization: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
  • Outstanding Throughput Performance: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+%
  • Production-Grade Optimization: Achieving 3267 tokens/s output throughput on single NVIDIA H200

Performance

Speculative Sampling Efficiency

Average Acceptance Length with speculative length of 4:

Benchmark Average Acceptance Length
HumanEval 2.816
GSM8K 3.24
Math-500 3.326
Med_MCPA 2.600
Health_Bench 2.446

Throughput Improvement

Using FP8 quantization + EAGLE3 optimization, throughput improvement compared to FP8-only at 16 concurrency:

Benchmark Throughput Improvement
HumanEval +67.3%
GSM8K +58.6%
Math-500 +89.8%
Med_MCPA +46%
Health_Bench +45.3%

Ultimate Inference Performance

  • Hardware Environment: NVIDIA H200 single GPU

1 2 3

Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200

Technical Specifications

  • Model Architecture: LlamaForCausalLMEagle3
  • Number of Layers: 1 layer (Draft Model)
  • Hidden Size: 4096
  • Attention Heads: 32 (KV heads: 8)
  • Intermediate Size: 14336
  • Vocabulary Size: 157,184
  • Max Position Embeddings: 32,768
  • Data Type: bfloat16

Quick Start

Requirements

  • H200-class Computational Performance
  • CUDA 12.0+
  • PyTorch 2.0+

Installation

pip install sglang==0.5.6

and include PR https://github.com/sgl-project/sglang/pull/15119

Inference with SGLang

python3 -m sglang.launch_server  \
    --model-path MedAIBase/AntAngelMed-FP8 \
    --host 0.0.0.0 --port 30012  \
    --trust-remote-code  \
    --attention-backend fa3  \
    --mem-fraction-static 0.9 \
    --tp-size 1  \
    --speculative-algorithm EAGLE3  \
    --speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \
    --speculative-num-steps 3  \
    --speculative-eagle-topk 1   \
    --speculative-num-draft-tokens 4 

Training Data

  • Data Quality: Rigorously filtered and cleaned to ensure high-quality training data

Use Cases

  • High-concurrency inference services
  • Real-time dialogue systems
  • Code generation and completion
  • Mathematical reasoning and computation
  • Production environments requiring low-latency responses

Open Source Contribution

We actively contribute back to the open-source community. Related optimization achievements have been submitted to the SGLang community:

Limitations and Notes

  • This model is a draft model that needs to be used with a target model to achieve speculative sampling
  • FP8 quantization is recommended for optimal performance
  • Performance may vary across different hardware platforms
  • Medical domain applications must comply with relevant regulations; model outputs are for reference only

License

This code repository is licensed under the MIT License.

Downloads last month
79
Safetensors
Model size
0.4B params
Tensor type
I64
·
BF16
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MedAIBase/AntAngelMed-eagle3

Finetuned
(1)
this model