Improved Infrared Defect Detection for Solar Panels Using Multi-Path Downsampling and Multi-Scale Feature Fusion

Solar panels are critical components in photovoltaic power stations, and regular inspection is essential to ensure safe and efficient operation. With the increasing deployment of solar energy systems, the need for reliable defect detection methods has become paramount. Infrared imaging technology offers a robust solution for identifying anomalies in solar panels, as defects often manifest as thermal irregularities. However, detecting these defects in aerial infrared images with complex backgrounds presents significant challenges, including low contrast, small target sizes, and environmental noise. Traditional methods often struggle with these issues, leading to high rates of missed detections and false positives. In this paper, we propose an enhanced approach based on the RT-DETR-R50 model to address these limitations and improve the accuracy of defect detection in solar panels.

The proposed method incorporates several key innovations to enhance feature extraction and fusion. First, we introduce the EMA-PConv module, which combines partial convolution (PConv) with an efficient multi-head attention (EMA) mechanism. This module reduces the computational complexity and parameter count of the model while maintaining strong feature representation capabilities. The EMA-PConv module processes only a subset of input channels, leveraging the redundancy in channel dimensions to achieve efficiency. The EMA mechanism dynamically weights important regions, focusing on defect areas in solar panels. The operation can be expressed as:

$$Z = \text{EMA}(\text{ReLU}(\text{LayerNorm}(\text{PConv}{3 \times 3}(Z{C/4}) + Z_{3C/4})))$$

where ( Z_{C/4} ) represents one-fourth of the input channels, and ( \text{PConv}_{3 \times 3} ) denotes the partial convolution operation. The EMA mechanism works by grouping features and applying spatial and local branches to generate attention weights, formulated as:

$$Y = \sigma \left( \text{MatMul} \left( \text{Softmax}(\text{AvgPool}(X_1)), \text{Softmax}(\text{AvgPool}(X_2)) \right) \right) + X$$

Here, ( X_1 ) and ( X_2 ) are features from spatial and local branches, respectively, and ( \sigma ) is the sigmoid function.

Second, we propose the Multi-path Downsampling Enhancement Module (MDEM) to replace standard max-pooling layers. MDEM uses parallel paths—strided convolution, depthwise separable convolution, and pooling—to capture multi-scale features while suppressing background noise. This module enhances the model’s ability to focus on defect regions in solar panels by integrating global and local information. The output of MDEM is computed as:

$$V = \text{BN} \left( \text{Conv}{1 \times 1} \left( \text{Concat} \left( \text{Conv}{3 \times 3}(F), \text{DWConv}_{3 \times 3}(F), \text{MaxPool}(F) \right) \right) \right)$$

where ( F ) is the input feature, BN is batch normalization, and DWConv denotes depthwise separable convolution.

Third, we design the Multiscale Feature Adaptive Pyramid Network (MFAPN) to improve feature fusion across different scales. MFAPN combines features from large, medium, and small receptive fields by resizing them to a common resolution and concatenating them in a semantically guided order. This network leverages 3D convolutions to model scale-wise dependencies, enhancing the detection of small defects in solar panels. The process is defined as:

$$Y = \text{Concat} \left( \text{MaxPool}(X_L) + \text{AdaPool}(X_L), X_M, \text{Upsample}(X_S) \right)$$

$$F_{\text{out}} = \text{MaxPool}{3 \times 1 \times 1} \left( \text{ReLU} \left( \text{BN} \left( \text{Conv}{1 \times 1 \times 1}(F) \right) \right) \right)$$

where ( X_L, X_M, X_S ) are large, medium, and small scale features, and ( F ) is the reshaped multi-scale tensor.

To evaluate our method, we conducted experiments on a dataset of aerial infrared images of solar panels. The dataset includes 3694 images with four types of defects: hotspot, golden-spot, light-golden-spot, and shadow. We split the data into training, validation, and test sets with a ratio of 7:1:2. The experiments were performed on an NVIDIA 4080 GPU using PyTorch, with a batch size of 8 and 200 training epochs. We used precision (P), recall (R), mean average precision (mAP), parameters, GFLOPs, and FPS as evaluation metrics.

Ablation studies were conducted to validate the effectiveness of each component. The results show that the EMA-PConv module reduces parameters by 60.7% but may slightly decrease accuracy when used alone. However, when combined with MDEM and MFAPN, the model achieves a balance between efficiency and performance. For instance, the EMA-PConv module with a 1/4 channel ratio achieves the best mAP50 of 73.3%. The MFAPN module with large-medium-small feature concatenation order yields the highest mAP50 of 76.9%. The complete model (MPMA-DETR) improves mAP50 by 3.3% compared to the baseline RT-DETR-R50, while reducing parameters by 16.8%.

Table 1: Ablation Study Results
Model	P (%)	R (%)	mAP50 (%)	Parameters (M)	GFLOPs	FPS
Baseline	73.7	75.9	75.0	43.0	129.6	100
+ EMA-PConv	72.3	68.7	73.3	16.9	51.5	151
+ MFAPN	73.3	80.4	76.9	43.4	142.2	88
+ MDEM	74.5	83.4	76.4	43.1	50.2	154
Full Model	76.3	79.6	78.3	35.8	50.6	135

We also compared our method with state-of-the-art models, including YOLOv10x, YOLOv11, YOLOv12, Faster-RCNN, RT-DETR variants, and DE-DETR. Our model achieves the highest precision (76.3%) and mAP50 (78.3%), with competitive recall (79.6%). It maintains a low parameter count (35.8M) and computational demand (50.6 GFLOPs), making it suitable for real-time applications in solar panel inspection. The following table summarizes the comparison results.

Table 2: Comparison with State-of-the-Art Models
Model	P (%)	R (%)	mAP50 (%)	Parameters (M)	GFLOPs	FPS
YOLOv10x	75.3	68.4	75.3	31.6	169.8	161
YOLOv11	75.2	79.4	76.4	2.6	6.3	435
YOLOv12	73.6	74.7	73.4	2.7	6.4	357
Faster-RCNN	68.3	71.3	71.9	41.3	133.9	42
RT-DETR-R101	73.4	82.8	74.4	74.0	247.1	71
DE-DETR	71.2	74.3	72.2	40.3	86.0	167
Our Model	76.3	79.6	78.3	35.8	50.6	135

Visualization analysis using Grad-CAM heatmaps demonstrates that our model effectively focuses on defect regions in solar panels while suppressing background noise. Compared to the baseline, our method shows more concentrated responses on target areas, reducing false detections. For example, in complex scenarios, the model accurately identifies small defects without being distracted by environmental factors. This confirms the robustness of our approach for real-world solar panel inspection.

In conclusion, we have developed an advanced defect detection method for solar panels that addresses the challenges of complex backgrounds in aerial infrared images. By integrating the EMA-PConv module, MDEM, and MFAPN, our model achieves higher accuracy with reduced computational costs. The improvements are validated through extensive experiments, showing significant gains in precision and mAP. This work contributes to the field of solar energy maintenance by providing a reliable tool for automated inspection. Future research could explore the integration of additional sensor data or adaptive learning techniques to further enhance performance in diverse conditions. The continuous advancement of such methods is crucial for the sustainability and efficiency of solar power systems.