The large-scale deployment of solar panels is a cornerstone of the global transition towards sustainable energy. Ensuring their operational efficiency and longevity necessitates robust and automated inspection systems. While solar panels are designed for durability, prolonged exposure to harsh environmental conditions inevitably leads to various defects, such as hot spots caused by cell cracks, faulty bypass diodes, partial shading, and other anomalies. Timely detection of these faults in solar panels is critical to prevent significant power loss and, in severe cases, fire hazards that can lead to substantial economic damage.

Traditional inspection methods for solar panels, such as electrical characterization, require extensive sensor networks, increasing the cost and complexity for large-scale photovoltaic farms. In contrast, infrared (IR) thermography, often conducted via drones, provides a non-contact, efficient means to visualize thermal anomalies indicative of defects in solar panels. However, automating the analysis of these IR images presents significant challenges. The backgrounds are often complex (e.g., grass, desert, water), the signal-to-noise ratio is low, and the defects themselves exhibit large variations in size and shape, posing a classic multi-scale object detection problem. Small, subtle faults in solar panels can be easily missed by conventional algorithms.
Early machine learning approaches relied on handcrafted features and classifiers like Support Vector Machines (SVM) or Random Forests. These methods, while effective for specific cases, suffer from poor generalization and require extensive domain expertise. The advent of deep learning, particularly Convolutional Neural Networks (CNNs), has revolutionized this field by enabling automatic feature learning. Models like Faster R-CNN, YOLO, and Mask R-CNN have been adapted for defect detection in solar panels. However, their performance is often hampered by the inherent difficulties in IR imagery: the standard Feature Pyramid Network (FPN) may lose fine-grained details crucial for small defects, and the complex background can lead to false positives.
To address these limitations, we propose an enhanced deep learning framework based on Faster R-CNN, specifically designed for the robust detection of multi-scale defects in solar panels under complex infrared backgrounds. Our core contribution is a novel network architecture termed the Multi-scale Adaptive Fusion Feature Pyramid Network (MAFPN). This architecture integrates two key modules: a Feature Enhancement Module (FEM) and an Adaptive Feature Fusion Module (AFM). The FEM expands the receptive field and enriches contextual information using dilated convolutions and a local self-attention mechanism. The AFM intelligently fuses features from different scales by employing a polarized self-attention mechanism, effectively suppressing irrelevant background noise while preserving critical semantic and detailed information pertaining to faults in solar panels. Extensive experiments on a real-world infrared dataset of solar panels demonstrate that our method achieves superior detection accuracy compared to existing state-of-the-art approaches.
Methodology: The Proposed MAFPN Framework
Our approach builds upon the two-stage Faster R-CNN detector, renowned for its high accuracy. We replace its backbone and feature fusion neck with our designed components to better handle defects in solar panels. The overall pipeline is as follows: An input infrared image is fed into a ResNet-50 backbone to extract hierarchical feature maps at four different scales, corresponding to stride levels of 4, 8, 16, and 32. These feature maps, denoted as {C2, C3, C4, C5}, carry different information; shallow features (C2, C3) are rich in spatial detail for locating small defects, while deep features (C4, C5) contain high-level semantic information for recognizing faults. These multi-scale features are then processed by our proposed MAFPN to generate a new set of enhanced feature pyramids {P2, P3, P4, P5}. These refined features are used by the Region Proposal Network (RPN) to generate candidate object regions (proposals), which are then classified and precisely localized by the subsequent detection head.
The standard FPN fuses features through a top-down pathway with simple lateral connections (typically a 1×1 convolution followed by element-wise addition). This process can dilute important details and is susceptible to background clutter. Our MAFPN innovates this structure by incorporating the FEM at each lateral connection and the AFM at each fusion node.
Feature Enhancement Module (FEM)
The lateral connection in FPN primarily reduces channel dimensions and can lead to a loss of nuanced defect features. To counteract this, we design the FEM to augment the feature map’s representational power before fusion. The FEM operates on an input feature map $x_{in} \in \mathbb{R}^{H \times W \times C_1}$ and produces an output $x_{out} \in \mathbb{R}^{H \times W \times C_2}$.
The structure of the FEM is pivotal for capturing the diverse manifestations of defects in solar panels. It consists of two parallel branches whose outputs are summed:
$$ x_{out} = M_k(x_{in}) + M_l(x_{in}) $$
Here, $M_k(\cdot)$ represents the output of a multi-branch dilated convolution block, and $M_l(\cdot)$ represents the output of a Local Self-Attention (LSA) block.
1. Dilated Convolution Branch ($M_k$): This branch aims to capture multi-scale contextual information by employing parallel dilated convolutions with different dilation rates. Given an input $x_{in}$, it first applies a 1×1 convolution to adjust channels. Then, three parallel 3×3 convolutions with dilation rates of 1, 2, and 3 are applied. The outputs are summed:
$$ M_k(x_{in}) = f_{3\times3}^{d=1}(f_{1\times1}(x_{in})) + f_{3\times3}^{d=2}(f_{1\times1}(x_{in})) + f_{3\times3}^{d=3}(f_{1\times1}(x_{in})) $$
where $f_{k\times k}^{d=r}$ denotes a convolution with kernel size $k$ and dilation rate $r$. This allows the network to gather context from different receptive fields simultaneously, which is crucial for identifying defects in solar panels that may have varying thermal spreads.
2. Local Self-Attention Branch ($M_l$): While convolutions capture local patterns, the LSA branch is designed to model long-range dependencies within a localized window, enhancing the model’s understanding of the global structure of a potential defect area. The input is first normalized using LayerNorm (LN). The core operation is the Multi-Head Self-Attention (MHSA) applied within non-overlapping windows that partition the feature map. For a feature map of size $H \times W$, it is divided into $\frac{H}{m} \times \frac{W}{n}$ windows of size $m \times n$. Within each window, the standard self-attention mechanism is applied:
$$ \text{Attention}(Q, K, V) = \text{Softmax}(\frac{QK^T}{\sqrt{d_k}})V $$
$$ \text{MultiHead}(Q, K, V) = \text{Concat}(head_1, …, head_h)W^O $$
$$ \text{where } head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$
$$ M_l(x_{in}) = f_{1\times1}(\text{MultiHead}(\text{LN}(x_{in}))) $$
The window-based approach keeps computational cost manageable while allowing the model to relate distant pixels within a suspected fault region on the solar panels, which is less computationally intensive than global attention.
Adaptive Feature Fusion Module (AFM)
After enhancement via the FEM, the deep feature $y$ (from a higher pyramid level) is upsampled and added to the shallow feature $x$ (from the current level) to get a preliminary fused feature $z = x + \text{Upsample}(y)$. The standard FPN stops here. Our AFM takes this a step further by performing an adaptive, attention-guided fusion to recalibrate $z$.
The AFM leverages Polarized Self-Attention (PSA) to generate separate channel and spatial attention weights. The process can be summarized as:
$$ x_{out} = M_c \odot \text{Upsample}(y) + M_s \odot x $$
where $\odot$ denotes element-wise multiplication, $M_c$ is the channel-refined weight map, and $M_s$ is the spatial-refined weight map.
1. Channel Attention Weight ($A_{ch}$): This branch focuses on “what” is important. The feature $z$ is transformed via two 1×1 convolutions to produce queries $Q_c$ and values $V_c$. After reshaping, a channel attention map is computed by matrix multiplication followed by a softmax operation along the spatial dimension. This map is then convolved and activated by a sigmoid function to produce the final channel weight $A_{ch} \in \mathbb{R}^{1 \times 1 \times C}$.
$$ A_{ch} = \sigma(f_{1\times1}(\delta_{sm}(\delta_1(f_{1\times1}(z))^T \otimes \delta_2(f_{1\times1}(z))))) $$
$$ M_c = A_{ch} \otimes \text{Upsample}(y) $$
This operation allows the model to emphasize feature channels that are most relevant to defects in solar panels.
2. Spatial Attention Weight ($A_{sp}$): This branch focuses on “where” the important regions are. Similarly, $z$ is transformed to $Q_s$ and $V_s$. $Q_s$ undergoes Global Average Pooling (GAP) and is reshaped. A spatial attention map is computed via matrix multiplication and softmax, then reshaped back and activated by sigmoid to produce $A_{sp} \in \mathbb{R}^{H \times W \times 1}$.
$$ A_{sp} = \sigma(\delta_3(\delta_{sm}(f_{GAP}(f_{1\times1}(z))) \otimes \delta_2(f_{1\times1}(z)))) $$
$$ M_s = A_{sp} \otimes x $$
This highlights spatial locations most likely to contain anomalies on the solar panels, suppressing background clutter like grass or sky.
The final output $x_{out}$ is the sum of the channel-weighted deep feature and the spatial-weighted shallow feature. This ensures that rich semantic information from deep layers and precise location details from shallow layers are both preserved and emphasized, leading to more accurate detection of multi-scale faults in solar panels.
Experimental Results and Analysis
To validate the effectiveness of our proposed method for inspecting solar panels, we conducted comprehensive experiments on a real-world infrared dataset collected from operational photovoltaic farms via drone. The dataset contains 2,110 images with complex backgrounds and includes four common defect types in solar panels: cell faults, diode faults, shading, and other anomalies. The dataset was split into 1,688 images for training and 422 for validation. We used standard evaluation metrics common in object detection: mean Average Precision (mAP) at Intersection over Union (IoU) thresholds of 0.5 ($AP_{50}$) and 0.75 ($AP_{75}$), and the average mAP over IoU from 0.5 to 0.95 with a step of 0.05 ($AP_{50:95}$). We also report model complexity in terms of parameters (Params), computational load (GFLOPs), and inference speed (FPS).
Ablation Study
We performed an ablation study to dissect the contribution of each proposed component. The baseline is Faster R-CNN with a ResNet-50 backbone. The results are summarized in the table below.
| Backbone | AFM | FEM | $AP_{50}$ (%) | $AP_{75}$ (%) | $AP_{50:95}$ (%) | Params (M) | GFLOPs | FPS |
|---|---|---|---|---|---|---|---|---|
| RN50 | – | – | 65.5 | 23.8 | 30.5 | 40.0 | 116.33 | 62.4 |
| RN50-FPN | – | – | 72.2 | 33.6 | 37.5 | 41.4 | 157.65 | 44.8 |
| RN50-FPN | ✓ | – | 74.5 | 35.8 | 38.9 | 41.8 | 164.06 | 41.7 |
| RN50-FPN | – | ✓ | 75.1 | 36.7 | 38.2 | 49.50 | 292.58 | 28.0 |
| RN50-FPN (MAFPN) | ✓ | ✓ | 76.2 | 39.7 | 41.3 | 49.63 | 299.30 | 26.6 |
The results clearly demonstrate the progressive improvement brought by each module. Adding the standard FPN to the baseline provided a significant boost (+6.7% in $AP_{50}$), confirming the importance of multi-scale feature fusion for detecting defects in solar panels. Incorporating the AFM module alone further improved $AP_{50}$ by 2.3%, showing its efficacy in performing intelligent, attention-guided fusion to suppress background noise. The FEM module alone provided the largest single-module gain in $AP_{50}$ (+2.9%), underscoring the critical role of enhancing feature context and detail before fusion. Finally, the complete MAFPN, integrating both FEM and AFM, achieved the best performance across all mAP metrics, with a $AP_{50}$ of 76.2%. This represents a substantial 4.0% improvement over the standard FPN, validating the synergistic effect of our proposed modules. The increase in GFLOPs and slight decrease in FPS are expected trade-offs for the gained accuracy, yet the inference speed remains practical for automated inspection of solar panels.
Comparison with State-of-the-Art Methods
We compared our full MAFPN-based model against a range of popular one-stage and two-stage object detectors on the same solar panel defect dataset. The results are presented below.
| Model | Backbone | $AP_{50}$ (%) | $AP_{75}$ (%) | $AP_{50:95}$ (%) | Params (M) | FPS |
|---|---|---|---|---|---|---|
| YOLOv3-spp | Darknet-53 | 52.5 | 16.0 | 22.9 | 62.6 | 92 |
| Sparse R-CNN | R50-FPN | 53.7 | 29.6 | 30.8 | 105.95 | 41.3 |
| SSD | RN50 | 54.6 | 21.5 | 26.4 | 23.75 | 16 |
| YOLOvX | Darknet-53 | 66.9 | 25.9 | 33.7 | 8.94 | 89 |
| RetinaNet | R50-FPN | 67.6 | 30.3 | 34.8 | 32.24 | 40 |
| Cascade R-CNN | R101-FPN | 71.1 | 39.5 | 39.4 | 87.93 | 38.5 |
| YOLOv5s | CSPDarknet | 72.2 | – | 38.6 | 7.02 | 101 |
| Our Model (MAFPN) | R50-MAFPN | 76.2 | 39.7 | 41.3 | 49.63 | 26.6 |
Our method achieves the highest $AP_{50}$ and $AP_{50:95}$ among all compared models. It outperforms the efficient one-stage detector YOLOv5s by 4.0% in $AP_{50}$ and 2.7% in $AP_{50:95}$, and surpasses the powerful two-stage Cascade R-CNN (with a larger backbone) by 5.1% in $AP_{50}$ and 1.9% in $AP_{50:95}$. This demonstrates that the architectural innovations in MAFPN are highly effective for the specific challenges posed by infrared-based defect detection in solar panels, offering a better balance between accuracy and model complexity than simply using a larger network or a different detection paradigm.
Conclusion
In this work, we have addressed the critical and challenging problem of automatically detecting defects in solar panels from infrared aerial imagery. The primary obstacles—complex backgrounds, low contrast, and large variation in defect scale—were tackled through a novel neural network design. We proposed the Multi-scale Adaptive Fusion Feature Pyramid Network (MAFPN), which integrates a Feature Enhancement Module (FEM) and an Adaptive Feature Fusion Module (AFM). The FEM strengthens the feature maps by capturing multi-scale context and long-range dependencies, making the network more sensitive to the diverse signatures of faults in solar panels. The AFM refines the feature fusion process itself, using an attention mechanism to selectively emphasize informative channels and spatial regions while suppressing irrelevant background noise.
Extensive experimental evaluations on a real-world dataset confirm the effectiveness of our approach. The ablation study proves the individual and combined value of the proposed modules, and the comparison with state-of-the-art detectors shows that our method achieves superior accuracy. While the model incurs a moderate increase in computational cost, the achieved gain in detection precision is significant for ensuring the reliability and safety of photovoltaic systems. This work provides a robust and effective solution for the automated inspection of solar panels, contributing to the maintenance and optimization of large-scale solar energy installations.
