Bounding boxes coordinates

#13
by ljoana - opened

What rescaling should be done so that the bbox coordinates are matching the original image? I am seeing some mismatches but can't seem to figure what's the issue.

The bbox_2d coordinates are x1, y1, x2, y2 rather than x,y,w,h. And they will be relative to your resized image size if you are resizing. For example:

image = Image.open(image_path)
img_width, img_height = image.size
max_size = 1280
if max(image.size) > max_size:
ratio = max_size / max(image.size)
new_size = tuple(int(dim * ratio) for dim in image.size)
# set each dimension to be a multiple of 28
new_size = tuple(int(dim // 28) * 28 for dim in new_size)
image = image.resize(new_size, Image.LANCZOS)
img_width, img_height = image.size

then in the messages:

{
    "role": "user",
    "content": [
        {
            "type": "image",
            "image": f"file://{image_path}",
            "resized_width": img_width,
            "resized_height": img_height,
        },

.....

i cannot manage to get the coordinates right ... Please help!

i have a image with traffic-signs and like to detect the stop/bus-sign.
the original image has 1920*1080 pixels. With Max_Pixel 1280 i scale down
to a image-size of 1260x700 (28 Pixel Blocks, smaller 1280, X:45x28, Y:25x28)

Prompt for the 7b Model is: "Locate the Stop-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}."
i run the detection on the scaled image as Base64.

Result in X looks always good but y is offset (but wondering why stop is too less and bus is too much ...).
image.png
image.png

i use OLLAMA for this model - so the complete Ollama call is:

------------------------------JSON-----------------------------------
[
{
"role": "system",
"content": "You are a knowledgeable, efficient, and direct AI assistant. \r\nProvide concise answers, focusing on the key information needed. \r\nOffer suggestions tactfully when appropriate to improve outcomes. \r\nEngage in productive collaboration with the user."
},
{
"role": "user",
"content": "Locate the Bus-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}.",
"Images": [
"iVBORw0KGgoAAAANSUhEUgAABOw ... ly4cOHChQsXLlwMG7gk1oULFy5cuHDhwoULFy5cDBM4zv8HKdhcabzmo1oAAAAASUVORK5CYII="
]
}
]

Annotation is:

            For Each item In items
                Dim X1 As Integer = CInt(item("bbox_2d")(0))
                Dim Y1 As Integer = CInt(item("bbox_2d")(1))
                Dim X2 As Integer = CInt(item("bbox_2d")(2))
                Dim Y2 As Integer = CInt(item("bbox_2d")(3))
                BMP2.Draw(New Rectangle(X1, Y1, X2 - X1, Y2 - Y1), New Bgra(0, 0, 255, 255), 2)
            Next

Could be also a problem with Ollama because there is no option (at least i don't found any) to set
"resized_width": img_width,
"resized_height": img_height,

maybe you have any sugestions how the y could be set to the correct position.

i cannot manage to get the coordinates right ... Please help!

i have a image with traffic-signs and like to detect the stop/bus-sign.
the original image has 1920*1080 pixels. With Max_Pixel 1280 i scale down
to a image-size of 1260x700 (28 Pixel Blocks, smaller 1280, X:45x28, Y:25x28)

Prompt for the 7b Model is: "Locate the Stop-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}."
i run the detection on the scaled image as Base64.

Result in X looks always good but y is offset (but wondering why stop is too less and bus is too much ...).
image.png
image.png

i use OLLAMA for this model - so the complete Ollama call is:

------------------------------JSON-----------------------------------
[
{
"role": "system",
"content": "You are a knowledgeable, efficient, and direct AI assistant. \r\nProvide concise answers, focusing on the key information needed. \r\nOffer suggestions tactfully when appropriate to improve outcomes. \r\nEngage in productive collaboration with the user."
},
{
"role": "user",
"content": "Locate the Bus-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}.",
"Images": [
"iVBORw0KGgoAAAANSUhEUgAABOw ... ly4cOHChQsXLlwMG7gk1oULFy5cuHDhwoULFy5cDBM4zv8HKdhcabzmo1oAAAAASUVORK5CYII="
]
}
]

Annotation is:

            For Each item In items
                Dim X1 As Integer = CInt(item("bbox_2d")(0))
                Dim Y1 As Integer = CInt(item("bbox_2d")(1))
                Dim X2 As Integer = CInt(item("bbox_2d")(2))
                Dim Y2 As Integer = CInt(item("bbox_2d")(3))
                BMP2.Draw(New Rectangle(X1, Y1, X2 - X1, Y2 - Y1), New Bgra(0, 0, 255, 255), 2)
            Next

Could be also a problem with Ollama because there is no option (at least i don't found any) to set
"resized_width": img_width,
"resized_height": img_height,

maybe you have any sugestions how the y could be set to the correct position.

Hi @Phreak87 , I struggled with that as well, I attempted to explain it in this medium post https://medium.com/@levchevajoana/qwen2-5-vl-with-mlx-vlm-c4329b40ab87. If you have any questions I’ll try to help.

Thank you so much for your help!

the grounding seems to not work correctly in the 7B-Variants. With the
3B-Parameter-Model this worked on the first try (detected and annotated the Horse-Sign):

image.png

Sign up or log in to comment