Infrared Defect Detection in Solar Panels via Multi-Path Downsampling and Multi-Scale Feature Fusion

With the ongoing depletion of non-renewable energy sources, the development and utilization of new energy have become a crucial direction in the global energy strategy. Solar photovoltaic (PV) power generation systems, with their advantages of wide resource distribution, cleanliness, and sustainability, show broad development prospects in the new energy sector. However, during long-term operation, solar panels are prone to surface defects, which significantly reduce photoelectric conversion efficiency. These defects typically manifest as abnormal high-temperature areas in infrared thermography. In recent years, with the rapid development of unmanned aerial vehicle (UAV) inspection and infrared thermal imaging technology, image processing-based methods for defect detection in solar panels have become a research hotspot. As PV power stations are often deployed in harsh environments and exposed to complex climatic conditions, the panels are susceptible to damage, seriously affecting the safe and stable operation of the power station. Therefore, establishing an efficient and reliable defect detection system for PV components is of great significance for ensuring operational safety.

A visual example of a solar panel array.

As the installed capacity of PV power stations continues to grow, traditional manual inspection methods can no longer meet the maintenance needs of large-scale plants. Intelligent inspection systems based on UAV platforms are gradually becoming the mainstream industry solution. Infrared thermal imaging technology monitors the temperature field distribution of solar panels during operation, using thermal anomalies or “hotspots” as key diagnostic indicators for fault identification. Furthermore, infrared imaging is independent of natural lighting, insensitive to ambient light changes, and capable of all-weather imaging. Its image quality remains relatively stable across different seasons, climates, and lighting conditions, with strong consistency in representing thermal anomalies. This technology offers advantages for detection in challenging conditions, enabling rapid panoramic scanning of PV arrays from a distance without the need for component disassembly or physical contact, making it particularly suitable for efficient inspection of large-scale PV power plants.

Currently, methods for detecting defects in infrared images of solar panels can be primarily divided into two categories: traditional image processing-based methods and deep learning-based methods. Traditional methods perform defect detection through operations like image binarization and morphological processing. However, these methods generally suffer from inherent drawbacks such as low computational efficiency, poor adaptability, and susceptibility to environmental noise.

In contrast, intelligent detection methods based on deep learning, leveraging their powerful feature extraction and pattern recognition capabilities, provide a new direction for defect detection in PV components. Research in this field has achieved certain progress. Nevertheless, when facing target detection tasks against complex backgrounds, difficulties in feature extraction can easily lead to missed and false detections, subsequently causing a decline in the model’s detection accuracy. To address these issues, this work proposes an object detection method based on an improved RT-DETR-R50 model. The specific improvements are reflected in the following three aspects:

In the backbone network, a structure combining Partial Convolution (PConv) with an EMA attention mechanism, termed EMA-PConv, is constructed. The original BasicBlock modules in the model are replaced with EMA-PConv, effectively reducing the model’s parameter count and computational load.
A Multiscale Feature Adaptive Pyramid Network (MFAPN) is constructed for feature fusion. This fusion structure provides richer feature support for small target detection, thereby enhancing detection performance.
A Multi-path Downsampling Enhancement Module (MDEM) is proposed to replace the max-pooling layers in RT-DETR-R50, addressing the problem of semantic information loss and improving the model’s feature extraction capability.

Methodology

1. Overview of the RT-DETR Framework

The RT-DETR (Real-Time Detection Transformer) model adopts a three-stage cascade structure, consisting of a Backbone network, a Hybrid Encoder, and a Transformer decoder. The model extracts hierarchical features from the last three stages of the backbone network as input to the encoder. During the encoding phase, feature enhancement is achieved through an Attention-based Intra-scale Feature Interaction (AIFI) module, and multi-scale features are converted into a sequential image feature representation combined with cross-scale feature fusion technology. The encoder output employs an IoU-aware query selection mechanism to dynamically filter the most representative image features as the initial set of object queries for the decoder. The decoder iteratively refines the object query vectors over multiple rounds and collaboratively generates bounding box coordinates and corresponding confidence scores with the aid of an Auxiliary Prediction Head.

2. The Proposed MPMA-DETR Model

This paper introduces a series of improvements to the RT-DETR-R50 model, aiming to enhance the accuracy of the object detection algorithm while reducing its parameter count. The improved model, named MPMA-DETR, integrates the EMA-PConv module into the backbone, employs the MFAPN for multi-scale fusion, and utilizes the MDEM for enhanced downsampling.

3. EMA-PConv Module

To address the issues of parameter redundancy and computational inflation caused by standard convolutions in the backbone network while ensuring the model retains efficient feature extraction capabilities, this work introduces the Partial Convolution (PConv) technique. PConv’s core advantage is that it performs convolution on only a portion of the input feature map’s channels. Compared to a standard convolution, PConv requires only 1/16 of the floating-point operations, significantly reducing the overall computational complexity. Building on this, the basic residual structure (BasicBlock) is improved to propose a re-parameterized lightweight convolution module called EMA-PConv.

The EMA-PConv module performs a 3×3 convolution on only 1/4 of the input feature map’s channels, leaving the remaining channels unchanged. This design is based on low-rank approximation theory: image features have high redundancy in the channel dimension. By convolving only a subset of channels, key features can be preserved while reducing redundant computation. Furthermore, the retained unconvolved channels help maintain the integrity of the original semantic structure, facilitating subsequent attention mechanisms to more easily capture important target regions. Therefore, this module incorporates the efficient EMA channel attention mechanism, which dynamically learns channel importance weights to guide the model to focus on more discriminative regions for detection. The ReLU activation function and Layer Normalization (LayerNorm) are applied to alleviate gradient-related issues and enhance the module’s feature extraction and expressive capabilities. To further improve training stability, a Drop Path regularization strategy is introduced in the residual path to prevent overfitting when the model encounters complex samples. The computational process of EMA-PConv can be expressed as:

$$Z = EMA(ReLU(LayerNorm(PConv^{3×3}(Z_{C/4}))) + Z_{3C/4}$$

where $Z_{C/4}$ represents one-fourth of the feature vector’s channels, and $PConv^{3×3}(·)$ denotes the 3×3 partial convolution operation.

The EMA attention mechanism works as follows: First, the input feature map is divided into G groups along the channel dimension, and attention computation is performed only within each group to reduce computational overhead. In the spatial branch, global average pooling (AvgPool) is applied to each group’s feature map along the height and width dimensions, respectively. Spatial attention weights are obtained via 1×1 convolution and multiplied element-wise to guide the network to focus on key regions. The local branch employs a 3×3 convolution to extract fine-grained local information. Subsequently, the two branches generate attention weights via global average pooling and Softmax operation, respectively, which are then fused via matrix multiplication (MatMul) to produce the final output. The EMA operation can be summarized as:

$$Y = X \cdot \sigma \left( MatMul \left( Softmax(AvgPool(X_1)), Softmax(AvgPool(X_2)) \right) \right)$$

where $X^{Group}$ is the grouped representation of feature map $X$, with dimensions $B \times G \times H \times W$; $X_1$, $X_2$ are the feature maps processed by the spatial and local branch convolutions, respectively; $AvgPool$ is the global average pooling operation; $Softmax$ generates normalized weights; $MatMul$ combines different feature weights via matrix multiplication; and $\sigma$ is the sigmoid activation function.

4. Multiscale Feature Adaptive Pyramid Network (MFAPN)

In infrared solar panel defect detection, interference from varying image resolutions and target defect size differences poses challenges. Traditional object detection methods often struggle to balance the capture of detailed features with the expression of global semantic information, leading to lower target detection accuracy. To address this and enhance the model’s adaptability to multi-scale features, a Multiscale Feature Adaptive Pyramid Network (MFAPN) is proposed. Through the co-design of feature extraction and fusion, it effectively tackles object detection challenges in complex scenes.

First, input feature maps of different resolutions (denoted as L, M, S for large, medium, and small receptive field features) are resized to the same spatial resolution and concatenated along the channel dimension. A semantic strength priority strategy is introduced for the concatenation order: the higher-level, larger-receptive-field feature $X_L$ is concatenated first to emphasize global semantics, followed by the medium-scale $X_M$, and finally the feature $X_S$ containing local structure and edge details. This order leverages a semantic-guided fusion mechanism, using stable high-level features as semantic support before gradually introducing low-level details, mitigating multi-scale semantic inconsistency and improving the discriminative power of feature fusion. Theoretically, $X_L$ with a larger receptive field carries rich contextual information and forms the basis for high-level semantic understanding, while $X_S$ focuses more on local information like texture and boundaries. Prioritizing detail features might cause the model to develop biased responses to non-critical areas during fusion, weakening overall semantic modeling. Specifically, the larger feature map $X_L$ is compressed to the size of the medium-scale map $X_M$ via both adaptive max pooling (MaxPool) and adaptive average pooling (AdaPool), and the contextual information from the two pooling methods is then fused via summation. The smaller feature map $X_S$ is upsampled to the spatial size of $X_M$ using nearest-neighbor interpolation. The three are then concatenated along the channel dimension to form a multi-scale feature vector with consistent feature dimensions. This processing preserves the spatial semantic information of features at each scale without adding extra computational overhead.

$$X_L’ = MaxPool(X_L) + AdaPool(X_L)$$
$$X_S’ = Upsample(X_S)$$
$$Y = Concat(X_L’, X_M, X_S’)$$

Inspired by self-sequence modeling ideas, the dynamic sequential dependencies between features across scales are simulated to further explore the relationships between feature information at different scales. Therefore, the multi-scale feature vector is stacked along the scale dimension to construct a 5D tensor structure $F$ ($B \times C \times 3 \times H \times W$), where the third dimension represents the scale sequence relationship. Then, a 1×1×1 3D convolution kernel performs feature interaction along the scale dimension, with its sliding operation simulating the temporal dependency capture mechanism in sequence modeling. After stabilizing the feature distribution via BatchNorm and introducing non-linearity via the ReLU activation function, a 3×1×1 3D max pooling finally compresses the scale dimension information for output. This output contains the core information from all scales, preserving details while fusing global semantics. This design enhances the model’s ability to fuse multi-scale information by increasing inter-scale information interaction.

$$F = Reshape(Y) \in \mathbb{R}^{B \times C \times 3 \times H \times W}$$
$$F_{out} = MaxPool_{3 \times 1 \times 1} (ReLU(BN(Conv_{1 \times 1 \times 1}(F))))$$

5. Multi-path Downsampling Enhancement Module (MDEM)

The max-pooling downsampling method in the original RT-DETR model may ignore fine-grained subject features when extracting global contextual information, especially in infrared images of solar panels where subtle defect areas are highly similar to background thermal interference. Traditional methods can easily cause target blurring or even loss. To address this, a Multi-path Downsampling Enhancement Module (MDEM) is designed to enhance the expression of defect target information while suppressing background interference, achieving focused modeling of salient regions in complex thermal images.

This module employs parallel strided convolution, depthwise separable convolution, and pooling paths during downsampling to extract feature responses at different scales and orientations. Strided convolution reinforces spatial gradient changes at target edges, depthwise convolution focuses on local differential structures, and max pooling suppresses high-frequency noise in the background. The fused output feature map not only preserves target structural information but also effectively weakens background interference responses, improving the model’s discriminative capability for key target regions.

The MDEM uses a two-stage process for efficient feature extraction and dimensionality reduction of the input image. A 7×7 convolution is used for initial feature extraction of the input feature vector ($C \times H \times W$), outputting an initial feature vector $X_{init}$ with channels compressed to $C/4$ and maintaining the original resolution ($C/4 \times H \times W$).

In the first stage, the module executes two processing paths in parallel: the Stride Path performs 2× downsampling via strided convolution and expands channels to $C/2$ ($C/2 \times H/2 \times W/2$), while the Depth Path uses depthwise separable convolution to complete local feature extraction and channel expansion ($C/2 \times H/2 \times W/2$). Both are then concatenated, processed by a 1×1 convolution and batch normalization. Shallow feature maps often contain significant noise; introducing max-pooling downsampling at this stage would retain noise interference, which is detrimental to subsequent feature learning. Therefore, max-pooling is omitted in this stage.

$$X_{init} = Conv^{7 \times 7}(X) \in \mathbb{R}^{C/4 \times H \times W}$$
$$F = BN(Conv^{1 \times 1}(Concat(Conv^{3 \times 3}(X_{init}), DWConv^{3 \times 3}(X_{init}))))$$

The second stage introduces a three-modal processing mechanism: the Depth Path continues channel expansion to $C$ ($B \times C \times H/4 \times W/4$) via depthwise separable convolution; the Stride Path performs downsampling while preserving original information ($C \times H/4 \times W/4$); and a newly added Pooling Path extracts high-response region features via max pooling ($C \times H/4 \times W/4$). Finally, the concatenated three-path features are fused via a 1×1 convolution. This operation integrates information across the channel dimension while possessing a certain implicit ability for inter-channel weight modeling. During training, the backpropagation mechanism automatically optimizes the channel weight parameters in the fusion convolutional layer, achieving adaptive adjustment of the importance of features from each path. Compared to introducing explicit path weight coefficients or attention modules, this implicit fusion strategy avoids additional parameter redundancy and model size increase, balancing computational efficiency and feature completeness through adaptive channel expansion ($C/4 \rightarrow C/2 \rightarrow C$).

$$V = BN(Conv^{1 \times 1}(Concat(Conv^{3 \times 3}(F), DWConv^{3 \times 3}(F), MaxPool(F))))$$

Experimental Results and Analysis

1. Experimental Setup

All experiments were conducted on an NVIDIA 4080 GPU using Python 3.10 and the PyTorch 1.12.1 deep learning framework. For fair comparison, no pre-trained weights were used for any model in the experiments. The batch size was set to 8, training epochs to 200, early stopping patience to 50, and the initial learning rate to 0.0001. Other training hyperparameters used default values.

2. Dataset Description

The experimental dataset was collected from a PV power plant. Infrared images of solar panels from different power units were captured by a UAV equipped with an infrared camera, totaling 3,694 infrared images with a resolution of 640×640. The dataset contains four types of PV component defects: diode short circuit (Hotspot), PID effect (Golden-spot), minor micro-cracks (Light-golden-spot), and shading (Shadow). The dataset was annotated using the Labelme tool and divided into training, validation, and test sets with approximately 2,586, 382, and 726 images, respectively, in a ratio of about 7:1:2.

3. Evaluation Metrics

Precision (P), Recall (R), and mean Average Precision (mAP) were selected as metrics to evaluate the improvement in model detection accuracy. Higher values for these three metrics indicate higher detection accuracy and better performance.

Precision (P): The proportion of correctly predicted positive samples among all predicted positive samples. $$P = \frac{TP}{TP + FP}$$
Recall (R): The proportion of correctly predicted positive samples among all actual positive samples. $$R = \frac{TP}{TP + FN}$$
Average Precision (AP): The average precision at different recall levels for a specific class. $$AP = \int_0^1 P(R) dR$$
mean Average Precision (mAP): The average of AP over all classes. $$mAP = \frac{1}{n_c} \sum_{i=1}^{n_c} AP_i$$ where $TP$ is the number of true positives, $FP$ is the number of false positives, $FN$ is the number of false negatives, and $n_c$ is the number of classes.

Additional metrics include the Number of Parameters, inference speed in Frames Per Second (FPS), computational demand in GFLOPs (Giga Floating Point Operations), and model weight file size (Weight).

4. Ablation Studies

4.1 MPMA-DETR Ablation

A series of improvements were made based on the RT-DETR model, and the effectiveness of the proposed algorithmic modules was verified through ablation experiments. The results are shown in the table below.

EMA-Pconv	MFAPN	MDEM	P(%)	R(%)	mAP50(%)	Parameters(M)	GFLOPs	FPS
–	–	–	73.7	75.9	75.0	43.0	129.6	100
√	–	–	72.3	68.7	73.3	16.9	51.5	151
–	√	–	73.3	80.4	76.9	43.4	142.2	88
–	–	√	74.5	83.4	76.4	43.1	50.2	154
√	√	–	75.1	75.3	76.6	35.7	131.2	72
√	–	√	75.1	73.7	75.5	34.7	47.5	70
–	√	√	75.7	76.4	75.8	43.2	53.4	149
√	√	√	76.3	79.6	78.3	35.8	50.6	135

The results indicate that the EMA-PConv module reduces parameters by 60.7% but leads to a drop in detection accuracy when used alone. This phenomenon may be due to: (1) EMA-PConv replacing the strongly representative BasicBlock residual structure, weakening inter-channel information interaction and feature accumulation; (2) Channel attention mechanisms rely on stable high-order feature expression. Under lightweight settings with fewer convolved channels, feature distribution variability increases, affecting attention weight learning efficiency; (3) The thermal diffusion effect in infrared images causes blurred boundaries between defects and background, requiring long-range context modeling. Limiting convolution to 1/4 channels may restrict spatial modeling capacity, directly impacting detection accuracy. The MDEM structure was thus proposed to partially compensate for the feature loss from model lightweighting, improving mAP50 by 1.4%. Finally, introducing MFAPN further enhanced model performance. Compared to the original RT-DETR-R50 model, precision improved by 1.1%, recall by 0.6%, mAP50 by 3.3%, while the parameter count decreased by 16.8%, proving the effectiveness of the improvements.

4.2 EMA-PConv Channel Ratio Ablation

To explore the specific impact of the channel partitioning strategy in the EMA-PConv module on model performance, ablation experiments were designed with different channel convolution ratios (1/8, 1/4, 1/2). The results are shown below.

Channel Ratio	P(%)	R(%)	mAP50(%)	Parameters(M)	GFLOPs	FPS
1/2	72.1	67.6	73.1	18.1	54.3	149
1/4	72.3	68.7	73.3	16.9	51.5	151
1/8	69.2	67.4	72.9	16.6	50.7	154

As the proportion of channels involved in convolution decreases, model parameters and GFLOPs drop significantly, and FPS increases, indicating the channel pruning strategy positively improves operational efficiency. In terms of accuracy, the 1/4 ratio achieved the best mAP50 (73.3%) while maintaining good precision and recall, showing strong comprehensive detection capability. Theoretically, a larger channel proportion helps extract high-frequency details but can introduce redundancy and increase computation; too small a proportion leads to insufficient feature expression, affecting minor defect recognition. The 1/4 ratio balances detection accuracy with model efficiency while maintaining model lightweightness. Therefore, this configuration is used as the default in all experiments.

4.3 MFAPN Concatenation Order Ablation

To further analyze the impact of concatenation order for different resolution feature maps in the MFAPN module on model performance, comparative experiments with different orders were designed.

Order	P(%)	R(%)	mAP50(%)	Parameters(M)	GFLOPs	FPS
L,M,S	73.3	80.4	76.9	43.4	142.2	88
L,S,M	73.1	75.1	76.6	43.4	142.2	88
M,L,S	72.8	76.0	74.3	43.1	142.2	87
M,S,L	72.9	73.6	75.2	43.1	142.2	87
S,L,M	71.5	78.5	73.6	43.1	142.2	88
S,M,L	68.8	78.0	74.1	43.1	142.2	88

When concatenating in the order large–medium–small (L,M,S), the model achieved the highest mAP50 of 76.9%, with precision and recall of 73.3% and 80.4%, respectively. In contrast, performance declined with other orders (e.g., S-L-M or S-M-L), dropping to as low as 73.6% or 74.1% mAP50. This shows that a reasonable concatenation order helps build a semantic hierarchy from global to local, enhancing the model’s ability to identify targets in complex backgrounds. Since all experimental groups maintained consistent parameter scale, computational complexity, and inference speed, it further verifies that performance differences mainly stem from the feature concatenation strategy itself. Therefore, the L,M,S order is used as the default setting.

5. Comparative Experiments

To further verify the advancement and superiority of the proposed algorithm for defect detection in solar panels, evaluation was conducted from the aspects of detection accuracy, model complexity, and computational efficiency, comparing it with currently popular algorithms. The comparative algorithms included YOLOv10x, YOLOv11, YOLOv12, Faster-RCNN, RT-DETR-R18, RT-DETR-R101, and DE-DETR. All algorithms were trained on the same experimental platform using identical training and validation sets. The results are shown in the table below.

Model	P(%)	R(%)	mAP50(%)	mAP50-95(%)	Params(M)	GFLOPs	Weight(M)	FPS
RT-DETR-R50	73.7	75.9	75.0	44.3	43.0	129.6	86.1	100
RT-DETR-R18	72.4	71.9	71.5	41.6	20.0	57.0	40.5	192
RT-DETR-R101	73.4	82.8	74.4	44.4	74.0	247.1	153.8	71
Faster-RCNN	68.3	71.3	71.9	40.3	41.3	133.9	161.9	42
YOLOv10x	75.3	68.4	75.3	44.8	31.6	169.8	64.1	161
YOLOv11	75.2	79.4	76.4	45.1	2.6	6.3	5.5	435
YOLOv12	73.6	74.7	73.4	42.0	2.7	6.4	5.6	357
DE-DETR	71.2	74.3	72.2	41.9	40.3	86.0	175.0	167
OURS (MPMA-DETR)	76.3	79.6	78.3	45.5	35.8	50.6	76.1	135

In terms of Precision, the improved model achieved 0.763, the highest among all compared models, and superior to models with similar parameter counts like Faster-RCNN (0.683), RT-DETR-R101 (0.734), DE-DETR (0.712), and YOLOv10x (0.753), indicating an advantage in reducing false detections. On the mAP50 metric, the improved model achieved 0.783, representing improvements of 3.9% and 3.0% over RT-DETR-R101 (0.744) and YOLOv10x (0.753), respectively. It even outperformed models with higher computational demands (RT-DETR-R101, YOLOv10x), demonstrating its effectiveness in ensuring accurate target detection. For Recall, the improved model reached 0.796, showing decent positive instance identification capability, though still behind RT-DETR-R101 (0.828). This is primarily due to RT-DETR-R101’s greater network depth providing stronger semantic representation, its larger parameter count enabling more sufficient context learning, and its use of Deformable Attention for dynamic spatial perception. Analyzing model complexity, the improved model’s parameter count is 35.8M, significantly lower than RT-DETR-R101’s 74M and 16.8% lower than the initial RT-DETR-R50’s 43.0M, demonstrating better adaptability in hardware resource-constrained environments. In computational efficiency, the improved model’s demand is 50.6 GFLOPs, lower than DE-DETR (86 GFLOPs), YOLOv10x (169.8 GFLOPs), and RT-DETR-R101 (247.1 GFLOPs). This low computation implies the model can perform fast inference. Other models like YOLOv11 and YOLOv12 have relatively small parameters (2.6M, 2.7M) but their comprehensive detection performance is significantly lower than the improved model, especially in precision and mAP50. From an operational efficiency perspective, the improved model maintains high detection accuracy while achieving an inference speed of 135 FPS, better than RT-DETR-R101 and Faster R-CNN. This result verifies that the model achieves high frame-rate output despite a significantly reduced theoretical computational complexity (only 50.6 GFLOPs), reflecting good structural lightness and inference parallelism. Overall, with its best mAP50 (0.783), good recall (0.796), and lower parameter count and computational complexity, the proposed model performs outstandingly in comparative experiments.

6. Visualization Analysis

To intuitively demonstrate the detection effectiveness of the proposed algorithm on solar panel defects, visualization examples of detection results are provided, combined with qualitative analysis using heatmaps generated by Grad-CAM. The Grad-CAM method visualizes the output layer of the last multi-head attention module in the Transformer decoder, which is close to the detection head and can fully reflect the model’s attention distribution during the target discrimination stage. In the experiment, gradient information for the activation maps of this layer is obtained via backpropagation and combined with its feature maps via weighted summation to generate the final heatmap.

The analysis uses heatmaps without bounding boxes (first three rows) and heatmaps with predicted bounding boxes (last three rows). Heatmaps without boxes reveal the model’s response level to different image regions during the feature extraction stage, helping analyze its receptive field and contextual dependency capabilities. Heatmaps with boxes allow observation of the model’s focus on target areas during the prediction process, validating the rationality of detection results. Experimental results show that the improved model can more accurately focus on defect regions while effectively suppressing background noise. Specifically, in heatmaps without boxes, the improved model shows highly concentrated responses in key defect areas, whereas RT-DETR-R50’s responses are more dispersed, increasing the risk of false detection. When comparing with RT-DETR using only the MDEM module (first three rows): the original RT-DETR-R50 model without MDEM shows heatmap high-response areas often diffusing into non-target regions with obvious background interference. After adding the MDEM module, the heatmap highlight areas concentrate on target areas containing defects, with clear response boundaries, and background area responses are significantly weakened.

In heatmaps with bounding boxes, it can be observed that the improved model’s high-response areas are mostly concentrated inside the targets, indicating good spatial perception. For example, although both detect the defect area in the fourth row, RT-DETR-R50 pays more attention to background noise; in the fifth and sixth rows, RT-DETR-R50 shows missed defects. Comparing RT-DETR-MDEM with RT-DETR-R50 in the fourth and sixth rows, the model’s attention response range is more concentrated on the real defect areas, and background noise response is effectively weakened. However, in the fifth row, the MDEM module did not effectively suppress background noise, possibly because features extracted by each path were not effectively aligned in semantic space, leading to retention of redundant or non-discriminative information, affecting the model’s ability to suppress background noise. Combined with heatmap comparative analysis, it can be concluded that the improved model’s heatmap highlight areas are more accurately localized, capture more critical features, and significantly outperform the initial RT-DETR-R50 model in feature extraction concentration, background suppression capability, robustness in complex scenes, and small target detection.

Conclusion

This paper addresses the difficulty of detecting defects in infrared images of solar panels against complex backgrounds by proposing an MPMA-DETR detection algorithm. This method introduces an EMA attention mechanism into the backbone network, constructing an efficient EMA-PConv module that effectively reduces the model’s parameter count and computational overhead. Simultaneously, a Multi-path Downsampling Enhancement Module (MDEM) is designed to improve feature extraction capability. Furthermore, a Multiscale Feature Adaptive Pyramid Network (MFAPN) is constructed, enabling fast and efficient multi-scale feature fusion, further enhancing the model’s feature expression ability. Experimental results show that this method not only improves detection accuracy but also optimizes model parameters and computational efficiency, verifying its effectiveness and practicality in infrared image defect detection tasks. In summary, the algorithm proposed in this paper to some extent ameliorates the difficulty of detecting target defects in infrared images of solar panels against complex backgrounds, providing a new solution for intelligent inspection in photovoltaics.