| --- |
| license: apache-2.0 |
| library_name: lerobot |
| pipeline_tag: robotics |
| tags: |
| - robotics |
| - lerobot |
| - act |
| - imitation-learning |
| - so101 |
| model_name: act |
| datasets: r2owb0/so101-DS1 |
| base_model: lerobot/smolvla_base |
| --- |
| |
| # ACT Model for SO101 Robot |
|
|
| This is an Action Chunking Transformer (ACT) model trained for the SO101 robot using LeRobot. The model was trained on demonstration data collected from teleoperation sessions. |
|
|
| ## Model Details |
|
|
| ### Architecture |
| - **Model Type**: Action Chunking Transformer (ACT) |
| - **Vision Backbone**: ResNet18 with ImageNet pretrained weights |
| - **Transformer Configuration**: |
| - Hidden dimension: 512 |
| - Number of heads: 8 |
| - Encoder layers: 4 |
| - Decoder layers: 1 |
| - Feedforward dimension: 3200 |
| - **VAE**: Enabled with 32-dimensional latent space |
| - **Chunk Size**: 50 steps |
| - **Action Steps**: 15 steps per inference |
|
|
| ### Camera Setup |
| The model uses a **dual-camera setup** for robust perception: |
|
|
| 1. **Wrist Camera** (`observation.images.wrist`): |
| - Resolution: 240×320 pixels |
| - Position: Mounted on the robot's wrist |
| - Purpose: Provides close-up, detailed view of manipulation tasks |
| - Field of view: Narrow, focused on the immediate workspace |
|
|
| 2. **Top Camera** (`observation.images.top`): |
| - Resolution: 480×640 pixels |
| - Position: Mounted above the workspace |
| - Purpose: Provides broader context and overview of the environment |
| - Field of view: Wide, captures the entire workspace |
|
|
| ### Input/Output Specifications |
|
|
| **Inputs:** |
| - **Robot State**: 6-dimensional joint positions |
| - `shoulder_pan.pos` |
| - `shoulder_lift.pos` |
| - `elbow_flex.pos` |
| - `wrist_flex.pos` |
| - `wrist_roll.pos` |
| - `gripper.pos` |
| - **Wrist Camera**: RGB image (240×320×3) |
| - **Top Camera**: RGB image (480×640×3) |
|
|
| **Outputs:** |
| - **Actions**: 6-dimensional joint commands (same structure as state) |
|
|
| ## Training Details |
|
|
| ### Dataset |
| - **Source**: `r2owb0/so101-DS1` |
| - **Episodes**: 10 demonstration episodes |
| - **Total Frames**: 5,990 frames |
| - **Frame Rate**: 30 FPS |
| - **Robot Type**: SO101 follower robot |
|
|
| ### Training Configuration |
| - **Training Steps**: 25,000 |
| - **Batch Size**: 4 |
| - **Learning Rate**: 1e-5 |
| - **Optimizer**: AdamW with weight decay 1e-4 |
| - **Validation Split**: 10% of episodes |
| - **Seed**: 1000 |
|
|
| ### Data Augmentation |
| The model was trained with comprehensive image augmentation: |
| - Brightness adjustment (0.8-1.2x) |
| - Contrast adjustment (0.8-1.2x) |
| - Saturation adjustment (0.5-1.5x) |
| - Hue adjustment (±0.05) |
| - Sharpness adjustment (0.5-1.5x) |
|
|
| ## Usage |
|
|
| ### Installation |
| ```bash |
| pip install lerobot |
| ``` |
|
|
| ### Loading the Model |
| ```python |
| from lerobot.policies import ACTPolicy |
| from lerobot.configs.policies import ACTConfig |
| |
| # Load the model |
| policy = ACTPolicy.from_pretrained("r2owb0/act1") |
| ``` |
|
|
| ### Evaluation |
| ```bash |
| lerobot-eval \ |
| --policy.path=r2owb0/act1 \ |
| --env.type=your_env_type \ |
| --eval.n_episodes=10 \ |
| --eval.batch_size=10 |
| ``` |
|
|
| ### Inference |
| ```python |
| import torch |
| |
| # Prepare observation |
| observation = { |
| "observation.state": torch.tensor([...]), # 6D robot state |
| "observation.images.wrist": torch.tensor([...]), # 240x320x3 RGB |
| "observation.images.top": torch.tensor([...]) # 480x640x3 RGB |
| } |
| |
| # Get action |
| with torch.no_grad(): |
| action = policy.select_action(observation) |
| ``` |
|
|
| ## Hardware Requirements |
|
|
| ### Robot Setup |
| - **Robot**: SO101 follower robot |
| - **Cameras**: |
| - Wrist-mounted camera (240×320 resolution) |
| - Top-mounted camera (480×640 resolution) |
| - **Control**: 6-DOF arm with gripper |
|
|
| ### Computing Requirements |
| - **GPU**: CUDA-compatible GPU recommended |
| - **Memory**: At least 4GB GPU memory |
| - **Storage**: ~200MB for model weights |
|
|
| ## Performance Notes |
|
|
| - The model uses action chunking, predicting 50 steps ahead but executing 15 steps at a time |
| - Temporal ensembling is disabled for real-time inference |
| - The model expects normalized inputs (mean/std normalization) |
| - VAE is enabled for better representation learning |
|
|
| ## Limitations |
|
|
| - Trained on a specific robot configuration (SO101) |
| - Requires the exact camera setup described above |
| - Performance may vary with different lighting conditions |
| - Limited to the task domain covered in the training dataset |
|
|
| ## Citation |
|
|
| If you use this model in your research, please cite: |
|
|
| ```bibtex |
| @misc{r2owb0_act1, |
| author = {Robert}, |
| title = {ACT Model for SO101 Robot}, |
| year = {2024}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/r2owb0/act1} |
| } |
| ``` |
|
|
| ## License |
|
|
| This model is licensed under the Apache 2.0 License. |
|
|