File size: 6,824 Bytes
57ca287 cc1cf2d 57ca287 51f2047 f1daf2f 57ca287 45f85cf 57ca287 f1daf2f 57ca287 3532de7 82ec3fa 2e42dfa 45f85cf 3532de7 2e42dfa 45f85cf 3532de7 45f85cf 3532de7 35e3573 98e0ae1 45f85cf 3532de7 35e3573 45f85cf 82ec3fa 3532de7 57ca287 c97477b 57ca287 f1daf2f 57ca287 580178f 57061f9 580178f 57ca287 f1daf2f 57ca287 f552f1a 232caa8 f552f1a 57ca287 82ec3fa 540546c 57ca287 540546c 57ca287 540546c 57ca287 7f3d5e4 540546c 7ce7203 540546c 311dfee 540546c 5c83925 7ce7203 311dfee 7ce7203 5ea49f8 13ce261 57ca287 5ea49f8 67ece55 540546c 5ea49f8 57ca287 5ea49f8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
---
tags:
- multimodal
- NPU
- On-device
- Snapdragon PC
- Android
license: other
license_name: nexa-research
license_link: LICENSE
---
<p align="center">
<img alt="omnineural" src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/zRUnoWmw43fl9hrXHg0pE.png">
</p>
# **OmniNeural** — World’s First NPU-aware Multimodal Model
## **Overview**
**OmniNeural** is the first fully multimodal model designed specifically for Neural Processing Units (NPUs). It natively understands **text, images, and audio**, and runs across PCs, mobile devices, automobile, IoT, and robotics.
## Demos
### 📱 Mobile Phone NPU - Demo on Samsung S25 Ultra
The first-ever fully local, multimodal, and conversational AI assistant that hears you and sees what you see, running **natively on Snapdragon NPU** for long battery life and low latency.
<video controls width="720" preload="metadata"
src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/MOBILE_50MB.mp4"
type="video/mp4"></video>
---
## ✨ PC NPU - Capabilities Highlights
<table>
<tr>
<td width="33%">
<video controls width="100%" preload="metadata"
src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/PC_demo_2_image.mov"></video>
<p align="center"><b>🖼️ Multi-Image Reasoning</b><br>Spot the difference across two images in multi-round dialogue.</p>
</td>
<td width="33%">
<video controls width="100%" preload="metadata"
src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/PC_Demo_Agent.mov"></video>
<p align="center"><b>🤖 Image + Text → Function Call</b><br>Snap a poster, add a text instruction, and AI agent creates a calendar event.</p>
</td>
<td width="33%">
<video controls width="100%" preload="metadata"
src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/PC_Demo_Audio.mov"></video>
<p align="center"><b>🎶 Multi-Audio Comparison</b><br>Tell the difference between two music clips locally.</p>
</td>
</tr>
</table>
---
## **Key Features**
- **Multimodal Intelligence** – Processes **text, image, and audio** in a unified model for richer reasoning and perception.
- **NPU-Optimized Architecture** – Uses ReLU ops, sparse tensors, convolutional layers, and static graph execution for maximum throughput — **20% faster than non-NPU-aware models** .
- **Hardware-Aware Attention** – Attention patterns tuned for NPU, lowering compute and memory demand .
- **Native Static Graph** – Supports variable-length multimodal inputs with stable, predictable latency .
- **Performance Gains** – **9× faster audio processing** and **3.5× faster image processing** on NPUs compared to baseline encoders .
- **Privacy-First Inference** – All computation stays local: private, offline-capable, and cost-efficient.
---
## **Performance / Benchmarks**
### Human Evaluation (vs baselines)
- **Vision**: Wins/ties in ~75% of prompts against Apple Foundation, Gemma-3n-E4B, Qwen2.5-Omni-3B.
- **Audio**: Clear lead over baselines, much better than Gemma3n and Apple foundation model.
- **Text**: Matches or outperforms leading multimodal baselines.
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/vsrg43GxTOSAj7q_SI60o.png" width="1560" alt="Human eval chart" />
</p>
### Nexa Attention Speedups
- **9× faster** audio encoding (vs Whisper encoder).
- **3.5× faster** image encoding (vs SigLIP encoder).
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/1039SN5JBQkS04z4YnoIi.png" width="400" alt="Human eval chart" />
</p>
---
## **Architecture Overview**
OmniNeural’s design is tightly coupled with NPU hardware:
- **NPU-friendly ops** (ReLU > GELU/SILU).
- **Sparse + small tensor multiplications** for efficiency.
- **Convolutional layers** favored over linear for better NPU parallelization.
- **Hardware-aware attention** patterns to cut compute cost.
- **Static graph execution** for predictable latency.

---
## **Production Use Cases**
- **PC & Mobile** – On-device AI agents combine **voice, vision, and text** for natural, accurate responses.
- Examples: Summarize slides into an email (PC)*, *extract action items from chat (mobile).
- Benefits: Private, offline, battery-efficient.
- **Automotive** – In-car assistants handle **voice control, cabin safety, and environment awareness**.
- Examples: Detects risks (child unbuckled, pet left, loose objects) and road conditions (fog, construction).
- Benefits: Decisions run locally in milliseconds.
- **IoT & Robotics** – Multimodal sensing for **factories, AR/VR, drones, and robots**.
- Examples: Defect detection, technician overlays, hazard spotting mid-flight, natural robot interaction.
- Benefits: Works without network connectivity.
---
## How to use
> ⚠️ **Hardware requirement:** OmniNeural-4B currently runs **only on Qualcomm NPUs** (e.g., Snapdragon-powered AIPC).
> Apple NPU support is planned next.
### 1) Install Nexa-SDK
- Download and follow the steps under "Deploy Section" Nexa's model page: [Download Windows arm64 SDK](https://sdk.nexa.ai/model/OmniNeural-4B)
- (Other platforms coming soon)
### 2) Get an access token
Create a token in the Model Hub, then log in:
```bash
nexa config set license '<access_token>'
```
### 3) Run the model
Running:
```bash
nexa infer NexaAI/OmniNeural-4B
```
/mic mode. Once the model is running, you can type below to record your voice directly in terminal
```bash
> /mic
```
For images and audio, simply drag your files into the command line. Remember to leave space between file paths.
---
## Links & Community
[](https://discord.com/invite/nexa-ai)
[](https://x.com/nexa_ai)
[](https://nexa.ai)
- **Issues / Feedback:** Use the **HF Discussions** tab or submit an issue in our discord or nexa-sdk github.
- **Roadmap & updates:** Follow us on X and Discord.
> If you want to see more **NPU-first, multimodal** releases on HF, please give our model a like ❤️.
## Limitation
The current model is mainly optimized for English. We will optimize other language as the next step.
---
## **Citation**
```bibtex
@misc{
title={OmniNeural: World’s First NPU-aware Multimodal Model},
author={Nexa AI},
year={2025},
url={https://huggingface.co/NexaAI/OmniNeural-4B},
}
``` |