Pretrained model

#3
by Stefanvdp - opened

Thank you for adding EfficientLoFTR to Hugging Face, I was wondering if the pretrained weights for eloftr_outdoor (which are on the github repo) can also be used for the model on the Hugging Face repo?

Thanks in advance for the reaction!

Zhejiang University org

Hi @Stefanvdp
zju-community/efficientloftrand the pretrained eloftr_outdoorfrom the original repo are identical, only the name of the layers have changed to match the EfficientLoFTR implementation in the Transformers library

This is the model from Hugging Face:
EfficientLoFTRForKeypointMatching(
(efficientloftr): EfficientLoFTRModel(
(backbone): EfficientLoFTRepVGG(
(stages): ModuleList(
(0): EfficientLoFTRRepVGGStage(
(blocks): ModuleList(
(0): EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(1, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(1, 64, kernel_size=(1, 1), stride=(2, 2), bias=False)
(norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(activation): ReLU()
)
)
)
(1): EfficientLoFTRRepVGGStage(
(blocks): ModuleList(
(0-1): 2 x EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(identity): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): ReLU()
)
)
)
(2): EfficientLoFTRRepVGGStage(
(blocks): ModuleList(
(0): EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(activation): ReLU()
)
(1-3): 3 x EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(identity): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): ReLU()
)
)
)
(3): EfficientLoFTRRepVGGStage(
(blocks): ModuleList(
(0): EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(norm): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(norm): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(activation): ReLU()
)
(1-13): 13 x EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(norm): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(identity): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): ReLU()
)
)
)
)
)
(local_feature_transformer): EfficientLoFTRLocalFeatureTransformer(
(layers): ModuleList(
(0-3): 4 x EfficientLoFTRLocalFeatureTransformerLayer(
(self_attention): EfficientLoFTRAggregatedAttention(
(aggregation): EfficientLoFTRAggregationLayer(
(q_aggregation): Conv2d(256, 256, kernel_size=(4, 4), stride=(4, 4), groups=256, bias=False)
(kv_aggregation): MaxPool2d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(attention): EfficientLoFTRAttention(
(q_proj): Linear(in_features=256, out_features=256, bias=False)
(k_proj): Linear(in_features=256, out_features=256, bias=False)
(v_proj): Linear(in_features=256, out_features=256, bias=False)
(o_proj): Linear(in_features=256, out_features=256, bias=False)
)
(mlp): EfficientLoFTRMLP(
(fc1): Linear(in_features=512, out_features=512, bias=False)
(activation): LeakyReLU(negative_slope=0.01)
(fc2): Linear(in_features=512, out_features=256, bias=False)
(layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
(cross_attention): EfficientLoFTRAggregatedAttention(
(aggregation): EfficientLoFTRAggregationLayer(
(q_aggregation): Conv2d(256, 256, kernel_size=(4, 4), stride=(4, 4), groups=256, bias=False)
(kv_aggregation): MaxPool2d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(attention): EfficientLoFTRAttention(
(q_proj): Linear(in_features=256, out_features=256, bias=False)
(k_proj): Linear(in_features=256, out_features=256, bias=False)
(v_proj): Linear(in_features=256, out_features=256, bias=False)
(o_proj): Linear(in_features=256, out_features=256, bias=False)
)
(mlp): EfficientLoFTRMLP(
(fc1): Linear(in_features=512, out_features=512, bias=False)
(activation): LeakyReLU(negative_slope=0.01)
(fc2): Linear(in_features=512, out_features=256, bias=False)
(layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
)
)
(rotary_emb): EfficientLoFTRRotaryEmbedding()
)
(refinement_layer): EfficientLoFTRFineFusionLayer(
(out_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(out_conv_layers): ModuleList(
(0): EfficientLoFTROutConvBlock(
(out_conv1): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(out_conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(batch_norm): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): LeakyReLU(negative_slope=0.01)
(out_conv3): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(1): EfficientLoFTROutConvBlock(
(out_conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(out_conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(batch_norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): LeakyReLU(negative_slope=0.01)
(out_conv3): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
)
)
)

And this one from the github repo:
LoFTR(
(backbone): RepVGG_8_1_align(
(layer0): RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(1, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
)
(layer1): ModuleList(
(0-1): 2 x RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
(layer2): ModuleList(
(0): RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
)
(1-3): 3 x RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
(layer3): ModuleList(
(0): RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
)
(1-13): 13 x RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
)
(loftr_coarse): LocalFeatureTransformer(
(layers): ModuleList(
(0-7): 8 x AG_RoPE_EncoderLayer(
(aggregate): Conv2d(256, 256, kernel_size=(4, 4), stride=(4, 4), groups=256, bias=False)
(max_pool): MaxPool2d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
(rope_pos_enc): RoPEPositionEncodingSine()
(q_proj): Linear(in_features=256, out_features=256, bias=False)
(k_proj): Linear(in_features=256, out_features=256, bias=False)
(v_proj): Linear(in_features=256, out_features=256, bias=False)
(attention): Attention()
(merge): Linear(in_features=256, out_features=256, bias=False)
(mlp): Sequential(
(0): Linear(in_features=512, out_features=512, bias=False)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
(2): Linear(in_features=512, out_features=256, bias=False)
)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
(coarse_matching): CoarseMatching()
(fine_preprocess): FinePreprocess(
(layer3_outconv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(layer2_outconv): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(layer2_outconv2): Sequential(
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.01)
(3): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(layer1_outconv): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(layer1_outconv2): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.01)
(3): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
)
(fine_matching): FineMatching()
)

I cannot seem to manage to load the state_dict (eloftr_outdoor.ckpt) into the model of Hugging Face.

One other question related to the Hugging Face implementation, is it still possible to have a non-fixed input to the network (f.e. mutiples of 32), instead of 640x480?

Zhejiang University org

You don't need to use the eloftr_outdoor.ckpt file, just using the following code snippet loads the weights already :

model = AutoModelForKeypointMatching.from_pretrained("zju-community/efficientloftr")

For the size of the image, you can use :

processor = AutoImageProcessor.from_pretrained("zju-community/efficientloftr", size=(480, 320))
model = AutoModelForKeypointMatching.from_pretrained("zju-community/efficientloftr", embedding_size=(10, 15))

Thanks for the info!

Stefanvdp changed discussion status to closed
Zhejiang University org

@Stefanvdp for your information, EfficientLoFTR will be usable with different image sizes without having to change the embedding size like mentioned above (refer to this PR)

Sign up or log in to comment