Pretrained model
Thank you for adding EfficientLoFTR to Hugging Face, I was wondering if the pretrained weights for eloftr_outdoor (which are on the github repo) can also be used for the model on the Hugging Face repo?
Thanks in advance for the reaction!
Hi
@Stefanvdp
zju-community/efficientloftr
and the pretrained eloftr_outdoor
from the original repo are identical, only the name of the layers have changed to match the EfficientLoFTR implementation in the Transformers library
This is the model from Hugging Face:
EfficientLoFTRForKeypointMatching(
(efficientloftr): EfficientLoFTRModel(
(backbone): EfficientLoFTRepVGG(
(stages): ModuleList(
(0): EfficientLoFTRRepVGGStage(
(blocks): ModuleList(
(0): EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(1, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(1, 64, kernel_size=(1, 1), stride=(2, 2), bias=False)
(norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(activation): ReLU()
)
)
)
(1): EfficientLoFTRRepVGGStage(
(blocks): ModuleList(
(0-1): 2 x EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(identity): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): ReLU()
)
)
)
(2): EfficientLoFTRRepVGGStage(
(blocks): ModuleList(
(0): EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(activation): ReLU()
)
(1-3): 3 x EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(identity): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): ReLU()
)
)
)
(3): EfficientLoFTRRepVGGStage(
(blocks): ModuleList(
(0): EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(norm): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(norm): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(activation): ReLU()
)
(1-13): 13 x EfficientLoFTRRepVGGBlock(
(conv1): EfficientLoFTRConvNormLayer(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(norm): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(conv2): EfficientLoFTRConvNormLayer(
(conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Identity()
)
(identity): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): ReLU()
)
)
)
)
)
(local_feature_transformer): EfficientLoFTRLocalFeatureTransformer(
(layers): ModuleList(
(0-3): 4 x EfficientLoFTRLocalFeatureTransformerLayer(
(self_attention): EfficientLoFTRAggregatedAttention(
(aggregation): EfficientLoFTRAggregationLayer(
(q_aggregation): Conv2d(256, 256, kernel_size=(4, 4), stride=(4, 4), groups=256, bias=False)
(kv_aggregation): MaxPool2d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(attention): EfficientLoFTRAttention(
(q_proj): Linear(in_features=256, out_features=256, bias=False)
(k_proj): Linear(in_features=256, out_features=256, bias=False)
(v_proj): Linear(in_features=256, out_features=256, bias=False)
(o_proj): Linear(in_features=256, out_features=256, bias=False)
)
(mlp): EfficientLoFTRMLP(
(fc1): Linear(in_features=512, out_features=512, bias=False)
(activation): LeakyReLU(negative_slope=0.01)
(fc2): Linear(in_features=512, out_features=256, bias=False)
(layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
(cross_attention): EfficientLoFTRAggregatedAttention(
(aggregation): EfficientLoFTRAggregationLayer(
(q_aggregation): Conv2d(256, 256, kernel_size=(4, 4), stride=(4, 4), groups=256, bias=False)
(kv_aggregation): MaxPool2d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(attention): EfficientLoFTRAttention(
(q_proj): Linear(in_features=256, out_features=256, bias=False)
(k_proj): Linear(in_features=256, out_features=256, bias=False)
(v_proj): Linear(in_features=256, out_features=256, bias=False)
(o_proj): Linear(in_features=256, out_features=256, bias=False)
)
(mlp): EfficientLoFTRMLP(
(fc1): Linear(in_features=512, out_features=512, bias=False)
(activation): LeakyReLU(negative_slope=0.01)
(fc2): Linear(in_features=512, out_features=256, bias=False)
(layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
)
)
(rotary_emb): EfficientLoFTRRotaryEmbedding()
)
(refinement_layer): EfficientLoFTRFineFusionLayer(
(out_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(out_conv_layers): ModuleList(
(0): EfficientLoFTROutConvBlock(
(out_conv1): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(out_conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(batch_norm): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): LeakyReLU(negative_slope=0.01)
(out_conv3): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(1): EfficientLoFTROutConvBlock(
(out_conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(out_conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(batch_norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): LeakyReLU(negative_slope=0.01)
(out_conv3): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
)
)
)
And this one from the github repo:
LoFTR(
(backbone): RepVGG_8_1_align(
(layer0): RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(1, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
)
(layer1): ModuleList(
(0-1): 2 x RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
(layer2): ModuleList(
(0): RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
)
(1-3): 3 x RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
(layer3): ModuleList(
(0): RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
)
(1-13): 13 x RepVGGBlock(
(nonlinearity): ReLU()
(se): Identity()
(rbr_reparam): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
)
(loftr_coarse): LocalFeatureTransformer(
(layers): ModuleList(
(0-7): 8 x AG_RoPE_EncoderLayer(
(aggregate): Conv2d(256, 256, kernel_size=(4, 4), stride=(4, 4), groups=256, bias=False)
(max_pool): MaxPool2d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
(rope_pos_enc): RoPEPositionEncodingSine()
(q_proj): Linear(in_features=256, out_features=256, bias=False)
(k_proj): Linear(in_features=256, out_features=256, bias=False)
(v_proj): Linear(in_features=256, out_features=256, bias=False)
(attention): Attention()
(merge): Linear(in_features=256, out_features=256, bias=False)
(mlp): Sequential(
(0): Linear(in_features=512, out_features=512, bias=False)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
(2): Linear(in_features=512, out_features=256, bias=False)
)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
(coarse_matching): CoarseMatching()
(fine_preprocess): FinePreprocess(
(layer3_outconv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(layer2_outconv): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(layer2_outconv2): Sequential(
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.01)
(3): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(layer1_outconv): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(layer1_outconv2): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.01)
(3): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
)
(fine_matching): FineMatching()
)
I cannot seem to manage to load the state_dict (eloftr_outdoor.ckpt) into the model of Hugging Face.
One other question related to the Hugging Face implementation, is it still possible to have a non-fixed input to the network (f.e. mutiples of 32), instead of 640x480?
You don't need to use the eloftr_outdoor.ckpt
file, just using the following code snippet loads the weights already :
model = AutoModelForKeypointMatching.from_pretrained("zju-community/efficientloftr")
For the size of the image, you can use :
processor = AutoImageProcessor.from_pretrained("zju-community/efficientloftr", size=(480, 320))
model = AutoModelForKeypointMatching.from_pretrained("zju-community/efficientloftr", embedding_size=(10, 15))
Thanks for the info!
@Stefanvdp for your information, EfficientLoFTR will be usable with different image sizes without having to change the embedding size like mentioned above (refer to this PR)