AntAngelMed-eagle3

Model Overview

AntAngelMed-eagle3 is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.

The model is trained on high-quality medical datasets, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.

Key Features

Speculative Sampling Optimization: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
Outstanding Throughput Performance: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+%
Production-Grade Optimization: Achieving 3267 tokens/s output throughput on single NVIDIA H200

Performance

Speculative Sampling Efficiency

Average Acceptance Length with speculative length of 4:

Benchmark	Average Acceptance Length
HumanEval	2.816
GSM8K	3.24
Math-500	3.326
Med_MCPA	2.600
Health_Bench	2.446

Throughput Improvement

Using FP8 quantization + EAGLE3 optimization, throughput improvement compared to FP8-only at 16 concurrency:

Benchmark	Throughput Improvement
HumanEval	+67.3%
GSM8K	+58.6%
Math-500	+89.8%
Med_MCPA	+46%
Health_Bench	+45.3%

Ultimate Inference Performance

Hardware Environment: NVIDIA H200 single GPU

Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200

Technical Specifications

Model Architecture: LlamaForCausalLMEagle3
Number of Layers: 1 layer (Draft Model)
Hidden Size: 4096
Attention Heads: 32 (KV heads: 8)
Intermediate Size: 14336
Vocabulary Size: 157,184
Max Position Embeddings: 32,768
Data Type: bfloat16

Quick Start

Requirements

H200-class Computational Performance
CUDA 12.0+
PyTorch 2.0+

Installation

pip install sglang==0.5.6

and include PR https://github.com/sgl-project/sglang/pull/15119

Inference with SGLang

python3 -m sglang.launch_server  \
    --model-path MedAIBase/AntAngelMed-FP8 \
    --host 0.0.0.0 --port 30012  \
    --trust-remote-code  \
    --attention-backend fa3  \
    --mem-fraction-static 0.9 \
    --tp-size 1  \
    --speculative-algorithm EAGLE3  \
    --speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \
    --speculative-num-steps 3  \
    --speculative-eagle-topk 1   \
    --speculative-num-draft-tokens 4

Training Data

Data Quality: Rigorously filtered and cleaned to ensure high-quality training data

Use Cases

High-concurrency inference services
Real-time dialogue systems
Code generation and completion
Mathematical reasoning and computation
Production environments requiring low-latency responses

Open Source Contribution

We actively contribute back to the open-source community. Related optimization achievements have been submitted to the SGLang community:

PR #15119: EAGLE3 Optimization Implementation

Limitations and Notes

This model is a draft model that needs to be used with a target model to achieve speculative sampling
FP8 quantization is recommended for optimal performance
Performance may vary across different hardware platforms
Medical domain applications must comply with relevant regulations; model outputs are for reference only

License

This code repository is licensed under the MIT License.

Downloads last month: 79

Safetensors

Model size

0.4B params

Tensor type

I64

BF16

BOOL

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MedAIBase/AntAngelMed-eagle3

Base model

inclusionAI/Ling-flash-base-2.0

Finetuned

inclusionAI/Ling-flash-2.0

Finetuned

MedAIBase/AntAngelMed

Finetuned

(1)

this model

AntAngelMed-eagle3

Model Overview

Key Features

Performance

Speculative Sampling Efficiency

Throughput Improvement

Ultimate Inference Performance

Technical Specifications

Quick Start

Requirements

Installation

Inference with SGLang

Training Data

Use Cases

Open Source Contribution

Limitations and Notes

License

Model tree for MedAIBase/AntAngelMed-eagle3

🎉 Free Image Generator Now Available!