Image resizing for vision encoder onnx export

#75
by Jrd100 - opened

Hi,

We're trying to export Moondream's vision encoder to ONNX but running into a shape mismatch in the patch_emb layer:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (672x224 and 588x1152).

We already tried (378,378) and (384,384), we get the similar error.

Code snippet:

transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5

Can you please clarify:

The correct input image size for the vision encoder?

The patch size used?

The expected flattened patch dimension for patch_emb?

Any required preprocessing steps we might be missing?

This will help us align our preprocessing with the model's architecture.

Thanks

Sign up or log in to comment