|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- robotics |
|
|
- vla |
|
|
- lerobot |
|
|
- imitation-learning |
|
|
- diffusion-policy |
|
|
- gemma-3 |
|
|
- siglip |
|
|
- scaledp |
|
|
- multimodal |
|
|
--- |
|
|
|
|
|
# Gemma-Le: SigLIP + Gemma 3 + ScaleDP (LeRobot VLA Policy) |
|
|
|
|
|
Gemma-Le is a compact Vision-Language-Action policy for robotic manipulation built on top of LeRobot. |
|
|
It replaces NV Eagle with standard Hugging Face components: |
|
|
|
|
|
- SigLIP `google/siglip-so400m-patch14-384` for vision |
|
|
- Gemma 3 `google/gemma-3-4b-it` for language/reasoning (with LoRA PEFT) |
|
|
- ScaleDP (Scalable Diffusion Transformer) as the action head |
|
|
|
|
|
This repo hosts exported checkpoints trained on LeRobot-format datasets (e.g., `robot_sim.PickNPlace`). |
|
|
|
|
|
## Architecture |
|
|
- Vision: SigLIP ViT encoder (384px, patch14), pooled embedding |
|
|
- Text: Gemma 3 4B-IT, mean-pooled hidden states |
|
|
- LoRA: rank=16 on `[q_proj, k_proj, v_proj, o_proj]` |
|
|
- Fusion: MLP projects [vision || text] -> `conditioning_dim=768` |
|
|
- Action head: ScaleDP Transformer (layers=12, d_model=320, heads=8, ff=1280) predicts diffusion noise |
|
|
- Temporal context: `chunk_size=8`; diffusion steps `num_diffusion_steps=50` |
|
|
- Mixed precision: AMP auto-selects bf16/fp16; bf16 uses no GradScaler |
|
|
|
|
|
## Default config (excerpt) |
|
|
```yaml |
|
|
vision_model_id: google/siglip-so400m-patch14-384 |
|
|
text_model_id: google/gemma-3-4b-it |
|
|
image_features: ["observation.images.ego_view"] |
|
|
action_feature: "action" |
|
|
chunk_size: 8 |
|
|
num_diffusion_steps: 50 |
|
|
conditioning_dim: 768 |
|
|
plan_update_interval: 10 |
|
|
scaledp_num_layers: 12 |
|
|
scaledp_dim_model: 320 |
|
|
scaledp_num_heads: 8 |
|
|
scaledp_dim_feedforward: 1280 |
|
|
use_lora: true |
|
|
lora_rank: 16 |
|
|
lora_target_modules: ["q_proj","k_proj","v_proj","o_proj"] |
|
|
optimizer_lr: 1e-4 |
|
|
optimizer_weight_decay: 1e-6 |
|
|
``` |
|
|
|
|
|
## Usage (with this repo’s LeRobot fork) |
|
|
Install deps and set `PYTHONPATH` to include `lerobot` in this repository. |
|
|
|
|
|
Evaluation-style load: |
|
|
```python |
|
|
import torch |
|
|
from lerobot.common.policies.gemma_le.modeling_gemma_le import GemmaLePolicy |
|
|
from huggingface_hub import snapshot_download |
|
|
ckpt_dir = snapshot_download(repo_id="Ryukijano/gemma-groot", revision="main") |
|
|
policy = GemmaLePolicy.from_pretrained(ckpt_dir, torch_dtype=torch.bfloat16) |
|
|
policy.eval() |
|
|
``` |
|
|
|
|
|
Training entrypoint: |
|
|
```bash |
|
|
python lerobot/lerobot/scripts/train.py \ |
|
|
--policy.type gemma_le \ |
|
|
--dataset.repo_id local/robot_sim.PickNPlace \ |
|
|
--dataset.root /path/to/robot_sim.PickNPlace \ |
|
|
--dataset.episodes "[0,1,2,3,4]" \ |
|
|
--batch_size 3 \ |
|
|
--steps 200000 \ |
|
|
--log_freq 100 \ |
|
|
--save_freq 5000 \ |
|
|
--policy.vision_model_id google/siglip-so400m-patch14-384 \ |
|
|
--policy.text_model_id google/gemma-3-4b-it \ |
|
|
--policy.use_amp true \ |
|
|
--progress_bar true \ |
|
|
--push_to_hub true \ |
|
|
--push_repo_id Ryukijano/gemma-groot \ |
|
|
--push_branch main \ |
|
|
--push_exist_ok true |
|
|
``` |
|
|
|
|
|
### Slurm (3× L40) |
|
|
See `submit_job.sh`. Ensure caches on scratch and set: |
|
|
- `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` |
|
|
- `HF_HOME`, `HUGGINGFACE_HUB_CACHE`, `TRANSFORMERS_CACHE` to scratch |
|
|
|
|
|
## Checkpoints |
|
|
- Latest runs uploaded under `runs/<date>/<run>/<step>` in this repo. |
|
|
- Example: `runs/2025-08-12/13-06-07_gemma_le/020000/`. |
|
|
|
|
|
## Data |
|
|
- LeRobotDataset (parquet + mp4 + metadata). Single RGB view: `observation.images.ego_view`. Targets: `action`. |
|
|
- Timestamp tolerance is auto-relaxed to `max(tolerance_s, 1/fps + 1e-4)` during training for robust decoding. |
|
|
|
|
|
## Notes |
|
|
- Base model access: `google/gemma-3-4b-it` may require TOS. |
|
|
- Intended for imitation learning; ThinkAct-style planning can be layered on top. |
|
|
|
|
|
## Citations |
|
|
- LeRobot: https://github.com/huggingface/lerobot |
|
|
- Gemma 3: https://ai.google.dev/gemma |
|
|
- SigLIP: https://huggingface.co/timm/ViT-SigLIP |
|
|
- Diffusion Policy: https://arxiv.org/abs/2303.04137 |
|
|
``` |