Model Card for Meow-Omni 1

Meow-Omni 1 is the world’s first Multimodal Large Language Model (MLLM) specifically engineered for Computational Ethology. It natively co-embeds four distinct modalities—Text, Video, Audio, and Biological Time-Series—to decode the latent intentions of non-verbal species.

🐾 Model Summary

Meow-Omni 1 is the fine-tuned, intent-aligned version of the Meow-Omni 1-Base architecture. By training on the Meow-10K dataset using a novel Next-Behaviour Prediction (NBP) logic, this model moves beyond simple action recognition to resolve "semantic aliasing"—distinguishing, for example, between contentment-purring and distress-purring by correlating vocalizations with internal physiological markers (ECG/EEG).

Fine-tuned from: Meow-Omni 1-Base
Primary Task: Feline Intention Decoding and Behavioural Interpretation

🚀 Key Features

Quad-Modal Reasoning: Simultaneously processes visual cues, acoustic signals, and high-frequency biometrics within a single transformer context.
Explainable Ethology: Unlike black-box classifiers, Meow-Omni 1 can articulate the causal relationship between a physiological spike and a behavioural display in natural language.
Uncertainty Quantification: Built-in predictive entropy allows the model to "flag" ambiguous or contradictory signals (e.g., when biometrics contradict visual cues), ensuring clinical safety.
Lightweight Deployment: Engineered with minimal dependencies to ensure reproducibility and accessibility for researchers in wildlife conservation.

📈 Performance: MeowBench

Meow-Omni 1 was evaluated on the MeowBench MCQ suite (586 expert-verified samples) and achieved state-of-the-art results. Detailed leaderboard coming soon.

🛠️ How to Use

Meow-Omni 1 accepts four inputs:

Video: Behavioral context.
Audio: Vocalization patterns.
Time-Series: IMU data (via custom control tokens).
Text: Instructions or questions regarding the animal's state.

import torch
import soundfile as sf
import numpy as np
from PIL import Image
from decord import VideoReader, cpu
from modeling_meow_omni_1 import MeowOmni1ForCausalLM
from processing_meow_omni_1 import MeowOmni1Processor

# 1. Setup Model and Processor
model_path = "smgjch/Meow-Omni-1"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = MeowOmni1Processor.from_pretrained(model_path, trust_remote_code=True)
model = MeowOmni1ForCausalLM.from_pretrained(
    model_path, 
    trust_remote_code=True, 
    torch_dtype=torch.bfloat16
).to(device).eval()

# 2. Prepare Modality Inputs
video_path = "sample_cat_video.mp4"
audio_path = "sample_cat_purr.wav"
ts_path = "sample_biometrics.json"

# Process Video (16 frames)
vr = VideoReader(video_path, ctx=cpu(0))
indices = np.linspace(0, len(vr) - 1, 16, dtype=int)
frames = [Image.fromarray(f).convert("RGB") for f in vr.get_batch(indices).asnumpy()]

# Process Audio
audio_arr, _ = sf.read(audio_path)
audios = [audio_arr[:480000].astype(np.float32)]

# 3. Construct Prompt with Modal Placeholders
# Note: Placeholders MUST match the number of input items (e.g., 16 image tags for 16 frames)
placeholders = (
    "".join(["<image>./</image>"] * len(frames)) +  # Video frames
    "<audio>./</audio>" +                          # Audio stream
    "<|ts_start|><|ts_unit|><|ts_end|>"            # Time-series block
)

raw_query = "Analyze the provided multi-modal data. What is this cat's intention?"
prompt = f"User: {placeholders}\n{raw_query}\nAssistant:"

# 4. Run Inference
inputs = processor(
    text=[prompt],
    images=frames,
    audios=audios,
    time_series_paths=[ts_path],
    time_series_sampling_rates=[100.0],
    return_tensors="pt"
).to(device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
        top_p=0.95
    )

response = processor.tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\n🔍 Meow-Omni 1 Analysis:\n{response}")

🔗 The Meow-Omni Ecosystem

Base Model: Meow-Omni 1-Base — The raw architectural foundation.
Training Dataset: Meow-10K — The synchronized 10k sample training corpus.
Evaluation Benchmark: MeowBench — The expert-verified quad-modal benchmark suite.

📝 Citation

Coming Soon.

Downloads last month: 163

Safetensors

Model size

9B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

smgjch
/

Meow-Omni-1