|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- HuggingFaceH4/ultrachat_200k |
|
- yahma/alpaca-cleaned |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
tags: |
|
- mesh |
|
- moe |
|
- mesh-labs |
|
- alpha |
|
- preview |
|
- research |
|
- experiment |
|
- routing |
|
- innovative |
|
- innovation |
|
- mesh-moe |
|
- custom_code |
|
new_version: mesh-labs/v0.1-2x2-stage003 |
|
--- |
|
|
|
# Mesh-v0.1-2x2 (Stage 002) |
|
 |
|
|
|
## Introducing mesh |
|
|
|
This is our first ever model! Allow us to explain how the `mesh` architecture works in detail. |
|
|
|
- Neural Mesh extends the concept of Mixture of Experts by allowing bidirectional expert communication. |
|
|
|
- The experts are shared in a bidimensional grid (2x2, 4x4, etc.) layout, that allows for them to communicate with their neighbors using the "Neighbor Exchange" method. |
|
- Just like MoE models, Mesh models have dynamic routing, and through the `routing_k` parameter you can define the amount of active parameters. For this model (2x2): |
|
- top-1 routing: 173M active parameters |
|
- top-2 routing: 242M active parameters (default) |
|
- dense routing: 302M active parameters |
|
|
|
## Here's how the mesh architecture works: |
|
 |
|
|
|
## Evaluation |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6747320df82ae35f0327cdd3/gYBBCS2d7mUCvSFHE8fBc.png" width="512px"/> |
|
|
|
## Disclaimer |
|
This small language model is just a proof-of-concept, paving the way to the final release, which is likely to happen in Q4 2025, and include more models and better support from external libraries such as Transformers and Llama.cpp. |