File size: 6,824 Bytes

57ca287
 
 
cc1cf2d
 
 
 
 
 
 
57ca287
51f2047
 
 
 
f1daf2f
57ca287
45f85cf
57ca287
f1daf2f
57ca287
3532de7
82ec3fa
2e42dfa
 
45f85cf
3532de7
 
 
 
 
 
2e42dfa
45f85cf
 
 
3532de7
 
 
 
45f85cf
3532de7
 
 
35e3573
98e0ae1
45f85cf
3532de7
 
 
35e3573
 
45f85cf
 
 
82ec3fa
3532de7
 
57ca287
 
 
 
 
 
 
 
c97477b
57ca287
 
 
 
 
 
f1daf2f
57ca287
 
 
580178f
57061f9
580178f
57ca287
 
f1daf2f
 
57ca287
f552f1a
 
232caa8
f552f1a
57ca287
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82ec3fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
540546c
57ca287
540546c
 
57ca287
540546c
57ca287
7f3d5e4
540546c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ce7203
540546c
 
 
 
 
 
 
 
 
311dfee
540546c
 
5c83925
7ce7203
311dfee
 
 
7ce7203
5ea49f8
13ce261
57ca287
5ea49f8
67ece55
540546c
5ea49f8
 
 
57ca287
5ea49f8

---
tags:
- multimodal
- NPU
- On-device
- Snapdragon PC
- Android
license: other
license_name: nexa-research
license_link: LICENSE
---
<p align="center">
  <img alt="omnineural" src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/zRUnoWmw43fl9hrXHg0pE.png">
</p>

# **OmniNeural** — World’s First NPU-aware Multimodal Model


## **Overview**  
**OmniNeural** is the first fully multimodal model designed specifically for Neural Processing Units (NPUs). It natively understands **text, images, and audio**, and runs across PCs, mobile devices, automobile, IoT, and robotics.  

## Demos

### 📱 Mobile Phone NPU - Demo on Samsung S25 Ultra
The first-ever fully local, multimodal, and conversational AI assistant that hears you and sees what you see, running **natively on Snapdragon NPU** for long battery life and low latency.

<video controls width="720" preload="metadata"
  src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/MOBILE_50MB.mp4"
  type="video/mp4"></video>

---

## ✨ PC NPU - Capabilities Highlights

<table>
<tr>
<td width="33%">
<video controls width="100%" preload="metadata"
  src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/PC_demo_2_image.mov"></video>
<p align="center"><b>🖼️ Multi-Image Reasoning</b><br>Spot the difference across two images in multi-round dialogue.</p>
</td>

<td width="33%">
<video controls width="100%" preload="metadata"
  src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/PC_Demo_Agent.mov"></video>
<p align="center"><b>🤖 Image + Text → Function Call</b><br>Snap a poster, add a text instruction, and AI agent creates a calendar event.</p>
</td>

<td width="33%">
<video controls width="100%" preload="metadata"
  src="https://huggingface.co/NexaAI/OmniNeural-4B/resolve/main/assets/PC_Demo_Audio.mov"></video>
<p align="center"><b>🎶 Multi-Audio Comparison</b><br>Tell the difference between two music clips locally.</p>
</td>
</tr>
</table>



---

## **Key Features**  
- **Multimodal Intelligence** – Processes **text, image, and audio** in a unified model for richer reasoning and perception.  
- **NPU-Optimized Architecture** – Uses ReLU ops, sparse tensors, convolutional layers, and static graph execution for maximum throughput — **20% faster than non-NPU-aware models** .  
- **Hardware-Aware Attention** – Attention patterns tuned for NPU, lowering compute and memory demand .  
- **Native Static Graph** – Supports variable-length multimodal inputs with stable, predictable latency .  
- **Performance Gains** – **9× faster audio processing** and **3.5× faster image processing** on NPUs compared to baseline encoders .  
- **Privacy-First Inference** – All computation stays local: private, offline-capable, and cost-efficient.

---

## **Performance / Benchmarks**  
### Human Evaluation (vs baselines)   
- **Vision**: Wins/ties in ~75% of prompts against Apple Foundation, Gemma-3n-E4B, Qwen2.5-Omni-3B.  
- **Audio**: Clear lead over baselines, much better than Gemma3n and Apple foundation model.  
- **Text**: Matches or outperforms leading multimodal baselines.  


<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/vsrg43GxTOSAj7q_SI60o.png" width="1560" alt="Human eval chart" />
</p>

### Nexa Attention Speedups   
- **9× faster** audio encoding (vs Whisper encoder).  
- **3.5× faster** image encoding (vs SigLIP encoder).  


<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/1039SN5JBQkS04z4YnoIi.png" width="400" alt="Human eval chart" />
</p>

---

## **Architecture Overview**  
OmniNeural’s design is tightly coupled with NPU hardware:  
- **NPU-friendly ops** (ReLU > GELU/SILU).  
- **Sparse + small tensor multiplications** for efficiency.  
- **Convolutional layers** favored over linear for better NPU parallelization.  
- **Hardware-aware attention** patterns to cut compute cost.  
- **Static graph execution** for predictable latency.


![image/png](https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/oINYbgXILJgTuKxKc1aO_.png)

---

## **Production Use Cases**  

- **PC & Mobile** – On-device AI agents combine **voice, vision, and text** for natural, accurate responses.  
   - Examples: Summarize slides into an email (PC)*, *extract action items from chat (mobile).  
   - Benefits: Private, offline, battery-efficient.  

- **Automotive** – In-car assistants handle **voice control, cabin safety, and environment awareness**.  
   - Examples: Detects risks (child unbuckled, pet left, loose objects) and road conditions (fog, construction).  
   - Benefits: Decisions run locally in milliseconds.  

- **IoT & Robotics** – Multimodal sensing for **factories, AR/VR, drones, and robots**.  
   - Examples: Defect detection, technician overlays, hazard spotting mid-flight, natural robot interaction.  
   - Benefits: Works without network connectivity.  

---

## How to use

> ⚠️ **Hardware requirement:** OmniNeural-4B currently runs **only on Qualcomm NPUs** (e.g., Snapdragon-powered AIPC).  
> Apple NPU support is planned next.

### 1) Install Nexa-SDK

- Download and follow the steps under "Deploy Section" Nexa's model page:  [Download Windows arm64 SDK](https://sdk.nexa.ai/model/OmniNeural-4B)
- (Other platforms coming soon)

### 2) Get an access token
Create a token in the Model Hub, then log in:

```bash
nexa config set license '<access_token>'
```

### 3) Run the model
Running:

```bash
nexa infer NexaAI/OmniNeural-4B
```

/mic mode. Once the model is running, you can type below to record your voice directly in terminal
```bash
> /mic
```

For images and audio, simply drag your files into the command line. Remember to leave space between file paths.

---

## Links & Community

[![Discord](https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&logoColor=white)](https://discord.com/invite/nexa-ai)

[![X (Twitter) Follow](https://img.shields.io/badge/Follow-@nexa_ai-111?logo=x&logoColor=white)](https://x.com/nexa_ai)

[![Website](https://img.shields.io/badge/Website-nexa.ai-0A84FF)](https://nexa.ai)

- **Issues / Feedback:** Use the **HF Discussions** tab or submit an issue in our discord or nexa-sdk github. 
- **Roadmap & updates:** Follow us on X and Discord.

> If you want to see more **NPU-first, multimodal** releases on HF, please give our model a like ❤️.

## Limitation
The current model is mainly optimized for English. We will optimize other language as the next step. 

---

## **Citation**  

```bibtex
@misc{
      title={OmniNeural: World’s First NPU-aware Multimodal Model}, 
      author={Nexa AI},
      year={2025},
      url={https://huggingface.co/NexaAI/OmniNeural-4B}, 
}
```