Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -15,6 +15,10 @@ tags:
|
|
| 15 |
- Stable Diffusion
|
| 16 |
- quantization
|
| 17 |
- fp8
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
inference:
|
| 19 |
parameters:
|
| 20 |
torch_dtype: torch.float8_e4m3fn
|
|
@@ -24,8 +28,17 @@ inference:
|
|
| 24 |
|
| 25 |
This repository contains an FP8 quantized version of the [Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0) model. **This is NOT a fine-tuned model** but a direct quantization of the original BFloat16 model to FP8 format for optimized inference performance. We provide an [online demo](https://huggingface.co/spaces/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0).
|
| 26 |
|
| 27 |
-
#
|
| 28 |
-
This model has been quantized from the original BFloat16 format to FP8 format.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
- **Reduced Memory Usage**: Approximately 50% smaller model size compared to BFloat16/FP16
|
| 30 |
- **Faster Inference**: Potential speed improvements, especially on hardware with FP8 support
|
| 31 |
- **Minimal Quality Loss**: Carefully calibrated quantization process to preserve output quality
|
|
|
|
| 15 |
- Stable Diffusion
|
| 16 |
- quantization
|
| 17 |
- fp8
|
| 18 |
+
- 8-bit
|
| 19 |
+
- e4m3
|
| 20 |
+
- reduced-precision
|
| 21 |
+
base_model: Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0
|
| 22 |
inference:
|
| 23 |
parameters:
|
| 24 |
torch_dtype: torch.float8_e4m3fn
|
|
|
|
| 28 |
|
| 29 |
This repository contains an FP8 quantized version of the [Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0) model. **This is NOT a fine-tuned model** but a direct quantization of the original BFloat16 model to FP8 format for optimized inference performance. We provide an [online demo](https://huggingface.co/spaces/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0).
|
| 30 |
|
| 31 |
+
# Quantization Details
|
| 32 |
+
This model has been quantized from the original BFloat16 format to FP8 format using PyTorch's native FP8 support. Here are the specifics:
|
| 33 |
+
|
| 34 |
+
- **Quantization Technique**: Native FP8 quantization
|
| 35 |
+
- **Precision**: E4M3 format (4 bits for exponent, 3 bits for mantissa)
|
| 36 |
+
- **Library Used**: PyTorch's built-in FP8 support
|
| 37 |
+
- **Data Type**: `torch.float8_e4m3fn`
|
| 38 |
+
- **Original Model**: BFloat16 format (Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0)
|
| 39 |
+
- **Model Size Reduction**: ~50% smaller than the original model
|
| 40 |
+
|
| 41 |
+
The benefits of FP8 quantization include:
|
| 42 |
- **Reduced Memory Usage**: Approximately 50% smaller model size compared to BFloat16/FP16
|
| 43 |
- **Faster Inference**: Potential speed improvements, especially on hardware with FP8 support
|
| 44 |
- **Minimal Quality Loss**: Carefully calibrated quantization process to preserve output quality
|