Toto-2.0-1B
Toto (Time Series Optimized Transformer for Observability) is a family of time series foundation models for multivariate forecasting developed by Datadog. Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.
The family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark.
📊 Performance
⚡ Quick Start
Inference code is available on GitHub.
Installation
pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2"
Inference Example
import torch
from toto2 import Toto2Model
model = Toto2Model.from_pretrained("Datadog/Toto-2.0-1B")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()
# (batch, n_variates, time_steps)
target = torch.randn(1, 1, 512, device=device)
target_mask = torch.ones_like(target, dtype=torch.bool)
series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)
# Returns quantiles of shape (9, batch, n_variates, horizon)
# Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
quantiles = model.forecast(
{"target": target, "target_mask": target_mask, "series_ids": series_ids},
horizon=96,
decode_block_size=768,
has_missing_values=False,
)
For more examples, see the Quick Start notebook and GluonTS integration notebook.
💾 Available Checkpoints
All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
| Model | Params | Weights (fp32) | Latency | Recommended for |
|---|---|---|---|---|
| Toto‑2.0‑4m | 4m | 16 MB | ~3.8 ms | Edge / CPU deployment; tightest latency or memory budgets. |
| Toto‑2.0‑22m | 22m | 84 MB | ~5.0 ms | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
| Toto‑2.0‑313m | 313m | 1.2 GB | ~15.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
| Toto‑2.0‑1B | 1B | 3.9 GB | ~20.9 ms | Best quality / cost tradeoff for production workloads. |
| Toto‑2.0‑2.5B | 2.5B | 9.1 GB | ~36.2 ms | Highest accuracy; #1 foundation model on every benchmark. |
✨ Key Features
- Zero-Shot Forecasting: Forecast without fine-tuning on your specific time series.
- Multi-Variate Support: Efficiently process multiple variables using alternating time/variate attention.
- Probabilistic Predictions: Generate point forecasts and uncertainty estimates via a quantile output head.
- Decoder-Only Architecture: Support for variable prediction horizons and context lengths.
- u-μP Scaling: A single training recipe transfers cleanly across all five sizes (4m → 2.5B).
🏗️ Architecture
🔗 Additional Resources
- Technical Report — (coming soon)
- Blog Post
- GitHub Repository
- Toto 2.0 Collection — all five base checkpoints
- BOOM Dataset — Datadog's observability time-series benchmark
- Toto 1.0 Weights
📖 Citation
(citation coming soon)
- Downloads last month
- 968
Collection including Datadog/Toto-2.0-1B
Paper for Datadog/Toto-2.0-1B
Evaluation results
- CRPS on BOOMBOOM 💥 Observability Time-Series Forecasting Leaderboard0.349
- MASE on BOOMBOOM 💥 Observability Time-Series Forecasting Leaderboard0.582
- CRPS on GIFT-EvalGIFT-Eval Time Series Forecasting Leaderboard0.478
- MASE on GIFT-EvalGIFT-Eval Time Series Forecasting Leaderboard0.699
- CRPS on TIMETIME Benchmark Leaderboard0.537
- MASE on TIMETIME Benchmark Leaderboard0.643