|
--- |
|
tags: |
|
- multimodal |
|
- NPU |
|
- On-device |
|
- Snapdragon PC |
|
- Android |
|
license: other |
|
license_name: nexa-research |
|
license_link: LICENSE |
|
--- |
|
<p align="center"> |
|
<img alt="omnineural" src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/zRUnoWmw43fl9hrXHg0pE.png"> |
|
</p> |
|
|
|
# **OmniNeural** — World’s First NPU-aware Multimodal Model |
|
|
|
|
|
## **Overview** |
|
**OmniNeural** is the first fully multimodal model designed specifically for Neural Processing Units (NPUs). It natively understands **text, images, and audio**, and runs across PCs, mobile devices, automobile, IoT, and robotics. |
|
|
|
## Demos |
|
|
|
### 📱 Mobile Phone NPU - Demo on Samsung S25 Ultra |
|
The first-ever fully local, multimodal, and conversational AI assistant that hears you and sees what you see, running **natively on Snapdragon NPU** for long battery life and low latency. |
|
|
|
<video controls width="720" preload="metadata" |
|
src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/MOBILE_50MB.mp4" |
|
type="video/mp4"></video> |
|
|
|
--- |
|
|
|
## ✨ PC NPU - Capabilities Highlights |
|
|
|
<table> |
|
<tr> |
|
<td width="33%"> |
|
<video controls width="100%" preload="metadata" |
|
src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/PC_demo_2_image.mov"></video> |
|
<p align="center"><b>🖼️ Multi-Image Reasoning</b><br>Spot the difference across two images in multi-round dialogue.</p> |
|
</td> |
|
|
|
<td width="33%"> |
|
<video controls width="100%" preload="metadata" |
|
src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/PC_Demo_Agent.mov"></video> |
|
<p align="center"><b>🤖 Image + Text → Function Call</b><br>Snap a poster, add a text instruction, and AI agent creates a calendar event.</p> |
|
</td> |
|
|
|
<td width="33%"> |
|
<video controls width="100%" preload="metadata" |
|
src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/PC_Demo_Audio.mov"></video> |
|
<p align="center"><b>🎶 Multi-Audio Comparison</b><br>Tell the difference between two music clips locally.</p> |
|
</td> |
|
</tr> |
|
</table> |
|
|
|
|
|
|
|
--- |
|
|
|
## **Key Features** |
|
- **Multimodal Intelligence** – Processes **text, image, and audio** in a unified model for richer reasoning and perception. |
|
- **NPU-Optimized Architecture** – Uses ReLU ops, sparse tensors, convolutional layers, and static graph execution for maximum throughput — **20% faster than non-NPU-aware models** . |
|
- **Hardware-Aware Attention** – Attention patterns tuned for NPU, lowering compute and memory demand . |
|
- **Native Static Graph** – Supports variable-length multimodal inputs with stable, predictable latency . |
|
- **Performance Gains** – **9× faster audio processing** and **3.5× faster image processing** on NPUs compared to baseline encoders . |
|
- **Privacy-First Inference** – All computation stays local: private, offline-capable, and cost-efficient. |
|
|
|
--- |
|
|
|
## **Performance / Benchmarks** |
|
### Human Evaluation (vs baselines) |
|
- **Vision**: Wins/ties in ~75% of prompts against Apple Foundation, Gemma-3n-E4B, Qwen2.5-Omni-3B. |
|
- **Audio**: Clear lead over baselines, much better than Gemma3n and Apple foundation model. |
|
- **Text**: Matches or outperforms leading multimodal baselines. |
|
|
|
|
|
<p align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/vsrg43GxTOSAj7q_SI60o.png" width="1560" alt="Human eval chart" /> |
|
</p> |
|
|
|
### Nexa Attention Speedups |
|
- **9× faster** audio encoding (vs Whisper encoder). |
|
- **3.5× faster** image encoding (vs SigLIP encoder). |
|
|
|
|
|
<p align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/1039SN5JBQkS04z4YnoIi.png" width="400" alt="Human eval chart" /> |
|
</p> |
|
|
|
--- |
|
|
|
## **Architecture Overview** |
|
OmniNeural’s design is tightly coupled with NPU hardware: |
|
- **NPU-friendly ops** (ReLU > GELU/SILU). |
|
- **Sparse + small tensor multiplications** for efficiency. |
|
- **Convolutional layers** favored over linear for better NPU parallelization. |
|
- **Hardware-aware attention** patterns to cut compute cost. |
|
- **Static graph execution** for predictable latency. |
|
|
|
|
|
 |
|
|
|
--- |
|
|
|
## **Production Use Cases** |
|
|
|
- **PC & Mobile** – On-device AI agents combine **voice, vision, and text** for natural, accurate responses. |
|
- Examples: Summarize slides into an email (PC)*, *extract action items from chat (mobile). |
|
- Benefits: Private, offline, battery-efficient. |
|
|
|
- **Automotive** – In-car assistants handle **voice control, cabin safety, and environment awareness**. |
|
- Examples: Detects risks (child unbuckled, pet left, loose objects) and road conditions (fog, construction). |
|
- Benefits: Decisions run locally in milliseconds. |
|
|
|
- **IoT & Robotics** – Multimodal sensing for **factories, AR/VR, drones, and robots**. |
|
- Examples: Defect detection, technician overlays, hazard spotting mid-flight, natural robot interaction. |
|
- Benefits: Works without network connectivity. |
|
|
|
--- |
|
|
|
## How to use |
|
|
|
> ⚠️ **Hardware requirement:** OmniNeural-4B currently runs **only on Qualcomm NPUs** (e.g., Snapdragon-powered AIPC). |
|
> Apple NPU support is planned next. |
|
|
|
### 1) Install Nexa-SDK |
|
|
|
- Download and follow the steps under "Deploy Section" Nexa's model page: [Download Windows arm64 SDK](https://sdk.nexa.ai/model/OmniNeural-4B) |
|
- (Other platforms coming soon) |
|
|
|
### 2) Get an access token |
|
Create a token in the Model Hub, then log in: |
|
|
|
```bash |
|
nexa config set license '<access_token>' |
|
``` |
|
|
|
### 3) Run the model |
|
Running: |
|
|
|
```bash |
|
nexa infer NexaAI/OmniNeural-4B |
|
``` |
|
|
|
/mic mode. Once the model is running, you can type below to record your voice directly in terminal |
|
```bash |
|
> /mic |
|
``` |
|
|
|
For images and audio, simply drag your files into the command line. Remember to leave space between file paths. |
|
|
|
--- |
|
|
|
## Links & Community |
|
|
|
[](https://discord.com/invite/nexa-ai) |
|
|
|
[](https://x.com/nexa_ai) |
|
|
|
[](https://nexa.ai) |
|
|
|
- **Issues / Feedback:** Use the **HF Discussions** tab or submit an issue in our discord or nexa-sdk github. |
|
- **Roadmap & updates:** Follow us on X and Discord. |
|
|
|
> If you want to see more **NPU-first, multimodal** releases on HF, please give our model a like ❤️. |
|
|
|
## Limitation |
|
The current model is mainly optimized for English. We will optimize other language as the next step. |
|
|
|
--- |
|
|
|
## **Citation** |
|
|
|
```bibtex |
|
@misc{ |
|
title={OmniNeural: World’s First NPU-aware Multimodal Model}, |
|
author={Nexa AI}, |
|
year={2025}, |
|
url={https://huggingface.co/NexaAI/OmniNeural-4B}, |
|
} |
|
``` |