With the continuous growth in global energy demand and increasing focus on environmental sustainability, photovoltaic power generation, as a crucial source of clean energy, is steadily becoming a key technology in the global energy transition. Solar panels, being the core component of photovoltaic power generation systems, the stability and reliability of their performance are paramount to the efficiency and safety of the entire system. However, during long-term operation, solar panels are susceptible to various external factors such as climate change, dust accumulation, humidity, temperature fluctuations, and mechanical stress, which can lead to potential issues like soiling, physical damage, loose connectors, and cell aging. These defects not only reduce the lifespan of the solar panels but can also impact local power supply, leading to significant economic losses. Therefore, regular inspection of solar panels is essential to ensure the stability and high efficiency of photovoltaic power generation systems.
Traditional inspection of solar panels primarily relies on manual patrols and visual checks, which present several shortcomings. Firstly, manual inspection requires substantial human resources and time, making it impractical for large-scale solar farms. Secondly, the results are prone to human error and subjective bias, leading to uncertainties and inconsistencies in defect identification. In contrast, Unmanned Aerial Vehicle (UAV) based inspection offers significant advantages, including high speed, efficiency, precision, safety, and cost-effectiveness. Integrating UAV technology with advanced computer vision and deep learning algorithms for target detection presents a powerful solution to enhance the quality and efficiency of solar panel inspection.
Current mainstream target detection algorithms can be broadly categorized into two-stage detectors, such as the Faster R-CNN series, and single-stage detectors like SSD and the YOLO family. While Faster R-CNN offers high accuracy and stability through its Region Proposal Network (RPN), its computational demands are substantial, resulting in slower inference speeds unsuitable for real-time applications. SSD, as a single-stage detector, generates multiple anchor boxes for each pixel, often leading to high overlap and redundant computation. YOLOv5, renowned for its speed and efficiency, adopts an end-to-end approach, prioritizing non-overlapping anchor boxes for rapid detection, making it ideal for real-time scenarios. However, its detection accuracy, particularly for small and complex targets often encountered in aerial imagery of solar panels, can be insufficient. To address these limitations, this work proposes a comprehensive and efficient method for solar panel inspection by synergistically combining deep learning, computer vision, and UAV technology. The core innovation lies in the design of a refined detection head for small targets and the integration of attention mechanisms to broaden the network’s receptive field, thereby enhancing feature representation capability, model interpretability, and ultimately achieving comprehensive and regular defect detection and fault diagnosis for solar panels.
1. UAV-Based Inspection System for Solar Panels
The UAV inspection system for solar panels consists of three main components: a UAV equipped with a high-resolution camera, an onboard AI computing terminal, and a backend visualization platform. The inspection workflow is illustrated in the conceptual diagram.
On-site at the photovoltaic plant, the UAV, carrying a high-resolution camera, autonomously navigates and captures images of the solar panel arrays. These images are subsequently transmitted in real-time to the onboard AI computing terminal. Here, a pre-trained deep learning model performs rapid and accurate defect identification and fault detection. This integration of UAV technology, high-resolution imaging, and deep learning algorithms enables effective, automated inspection of solar panels, allowing for timely maintenance actions. The detection results, along with relevant metadata, are transmitted via a 4G/5G network to a cloud-based visualization platform. This platform enables online monitoring and remote observation by technicians, facilitating data analysis and report generation. This approach drastically improves inspection efficiency by eliminating the need for manual, panel-by-panel checks. Furthermore, the use of high-resolution cameras ensures the precise capture of subtle defects and faults on the solar panels, guaranteeing the optimal performance and longevity of the photovoltaic system.
2. Target Detection Algorithm with Attention Mechanism
The overall network architecture of the proposed attention mechanism-based target detection algorithm is composed of three parts: a Backbone network, a Neck network, and a Detection Head, responsible for shared feature extraction, feature fusion, and target detection, respectively.
Extracting shared features serves multiple purposes: it reduces storage and computational costs by representing the original data with less information; it captures the core information of the data while ignoring unimportant variations or noise; and it aids in better model generalization to unseen data by focusing on essential patterns rather than sample-specific peculiarities. To mitigate the negative impact of scale variation in targets, the backbone network extracts multi-scale features at four different resolutions (e.g., 20×20, 40×40, 80×80, 160×160) and feeds them into the Neck network. This multi-scale approach allows for the fusion of edge and texture information from shallow layers with semantic information from deeper layers, achieving a more comprehensive extraction of shared features.
To fully fuse the extracted shared features, the Neck network employs a combined architecture of Attention mechanisms, Feature Pyramid Network (FPN), and Path Aggregation Network (PAN). The attention mechanism enables the model to focus on the most critical features for the specific task, thereby improving performance. The FPN captures multi-scale information from the image, which is particularly effective for detecting objects of varying sizes. The PAN enhances the flow of information between features at different scales, facilitating interaction between detailed spatial information and high-level semantic context.
To accurately identify small targets captured during UAV flight, a dedicated detection head module for small objects is designed. This module corresponds to four different receptive fields (associated with the 20×20, 40×40, 80×80, 160×160 feature maps), ensuring robust detection of tiny targets and effectively alleviating the adverse effects caused by significant scale variations.
2.1 Backbone Network
The design of the backbone network is based on the CSPDarknet53 framework from YOLOv5, incorporating the Cross Stage Partial (CSP) concept for efficient gradient flow. It is enhanced with the C2f (Convolution to Fully-connected) module and an attention mechanism module. The C2f module introduces more branches and residual connections. It employs a split operation to divide the feature map into multiple sub-feature maps, each containing a subset of channel features. By capturing correlations between channel features, it allows for collaborative processing across channels. This design ensures the main branch extracts low-level, high-resolution features while the additional branches extract higher-level semantic features, leading to more accurate and robust detection.
The structure of the C2f module and its operation can be summarized. Let $F_{in}$ be the input feature map. The module first applies a convolution to reduce channels, then splits the output into two parts. One part goes through $n$ bottleneck blocks (each containing convolutions and residual connections), while the other part is identity. Finally, they are concatenated and processed by another convolution. This process enhances feature diversity and gradient flow.
Furthermore, attention encoding modules (Trans) are incorporated at the end of the backbone network and the beginning of the detection head. Each Trans module contains two sub-layers: a Multi-Head Attention layer and a Multi-Layer Perceptron (MLP) feed-forward layer, with residual connections around each. The multi-head attention layer assigns weights to each feature based on its importance, allowing the network to focus on crucial features. This is particularly beneficial for processing intricate details in complex scenes, thereby improving the accuracy and robustness of solar panel defect detection. The multi-head attention operation for a single head is calculated as:
$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$
where $Q$, $K$, $V$ are the Query, Key, and Value matrices derived from the input features, and $d_k$ is the dimensionality of the key vectors.
The composition of the backbone network modules, including the number of repeats, output dimensions, kernel size, and stride, is detailed in the table below. The C2f modules (at indices corresponding to specific scales) and the final Trans module extract the four-scale features which are then passed to the Neck network for fusion.
| Index | Module | Repeats | Output (W×H×C) | Kernel | Stride |
|---|---|---|---|---|---|
| 0 | Focus | 1 | 320×320×64 | – | – |
| 1 | Conv | 1 | 160×160×128 | 3 | 2 |
| 2 | C2f | 3 | 160×160×128 | – | – |
| 3 | Conv | 1 | 80×80×256 | 3 | 2 |
| 4 | C2f | 6 | 80×80×256 | – | – |
| 5 | Conv | 1 | 40×40×512 | 3 | 2 |
| 6 | C2f | 3 | 40×40×512 | – | – |
| 7 | Conv | 1 | 20×20×1024 | 3 | 2 |
| 8 | SPP | 1 | 20×20×1024 | 5,9,13 | 1 |
| 9 | Trans | 3 | 20×20×1024 | – | – |
2.2 Neck Network
The Neck network adopts an Attention + FPN + PAN structure. It facilitates multi-level feature fusion and context information extraction through top-down feature propagation (FPN) and shortens the information path between low-level and top-level features via bottom-up connections (PAN). A Global Attention Mechanism (GAM) module is embedded at the bottom of each layer. The GAM module amplifies global cross-dimensional channel-spatial interactions, enabling deep fusion of important object features captured at different levels and scales. As a lightweight module, GAM also helps reduce redundant features and optimizes computational efficiency.
The GAM module consists of two sequential sub-modules: Channel Attention and Spatial Attention. In the channel attention sub-module, information across channel, width, and height dimensions is first rearranged. A two-layer MLP is then used to amplify cross-channel dependencies, followed by a transformation back to the original dimensions and a Sigmoid activation to produce the channel attention map. In the spatial attention sub-module, two convolutional layers (with a kernel size of 7) are used for spatial information fusion—first to reduce channels for computational efficiency, then to restore the channel count—followed by a Sigmoid activation to produce the spatial attention map.
The input-output relationship for the GAM module is given by:
$$F_2 = M_c(F_1) \otimes F_1$$
$$F_3 = M_s(F_2) \otimes F_2$$
where $F_1$ is the input feature map, $F_2$ and $F_3$ are intermediate and output feature maps, $M_c$ and $M_s$ are the channel and spatial attention maps respectively, and $\otimes$ denotes element-wise multiplication.
The structure of the Neck network, showing how features from the backbone are integrated and processed through Concat, C2f, GAM, and Trans modules, is summarized in the following table.
| Index | Input From | Module | Repeats | Output (W×H×C) | Kernel | Stride |
|---|---|---|---|---|---|---|
| 10 | 9 | Conv | 1 | 20×20×512 | 1 | 1 |
| 11 | 10 | Upsample | 1 | 40×40×512 | – | – |
| 12 | 11,6 | Concat | 1 | 40×40×1024 | – | – |
| 13 | 12 | C2f | 3 | 40×40×512 | – | – |
| 14 | 13 | GAM | 1 | 40×40×512 | – | – |
| … | … | … | … | … | … | … |
| 23 | 22 | Trans | 1 | 160×160×128 | – | – |
| … | … | … | … | … | … | … |
| 35 | 34 | Trans | 3 | 20×20×1024 | – | – |
2.3 Detection Head Network
To effectively detect small targets in images captured from UAVs, a dedicated prediction head for small objects is added to the detection head. This head, combined with three other heads, forms a structure corresponding to four different resolutions (e.g., 128×128, 256×256, 512×512, 1024×1024), which helps mitigate the negative impact of target scale variation. Although this introduces additional computational and storage overhead, it brings a significant improvement in the precision of detecting tiny objects in aerial solar panel inspection.
Considering that the classification task focuses on texture features while the regression (localization) task focuses on edge features, the detection head is designed with a decoupled structure. Separate branches handle classification and regression, allowing each to capture more refined representations tailored to their specific objectives, thereby improving the network’s representational capacity and classification performance.
The detection head configuration, showing the output channels corresponding to the four different detection scales, is shown below.
| Index | Module | Repeats | Output Channels | Kernel | Stride |
|---|---|---|---|---|---|
| 36 | Conv | 1 | 128, 256, 512, 1024 | 3 | 1 |
| 37 | Conv | 2 | 128, 256, 512, 1024 | 1 | 1 |
| 38 | Conv | 2 | 128, 256, 512, 1024 | 1 | 1 |
2.4 Loss Function
The loss function for training the model comprises three components: classification loss, bounding box regression loss, and object confidence loss. Both the classification loss and the object confidence loss employ the Binary Cross-Entropy (BCE) loss function, which optimizes the model’s ability to correctly identify and classify objects and their presence. For a single sample, the BCE loss is defined as:
$$L_{cls/conf} = -[y \cdot \log(p) + (1 – y) \cdot \log(1 – p)]$$
where $y$ is the ground truth label (1 for positive, 0 for negative) and $p$ is the predicted probability.
The bounding box regression loss is responsible for optimizing the model’s ability to localize objects accurately. It measures the difference between the predicted bounding box and the ground truth box. This work adopts the Complete Intersection over Union (CIoU) loss, which considers overlap area, center point distance, and aspect ratio. Let $B^{gt} = (x_1^{gt}, y_1^{gt}, x_2^{gt}, y_2^{gt})$ be the ground truth box and $B^{pred} = (x_1^{p}, y_1^{p}, x_2^{p}, y_2^{p})$ be the predicted box. The CIoU is calculated as:
$$CIoU = IoU – \frac{\rho^2(b^{gt}, b^{p})}{c^2} – \alpha v$$
where:
$$IoU = \frac{|B^{gt} \cap B^{pred}|}{|B^{gt} \cup B^{pred}|}$$
$$\rho^2(b^{gt}, b^{p}) \text{ is the squared Euclidean distance between the centers of the two boxes.}$$
$$c \text{ is the diagonal length of the smallest enclosing box covering both } B^{gt} \text{ and } B^{pred}.$$
$$v = \frac{4}{\pi^2} \left( \arctan \frac{w^{gt}}{h^{gt}} – \arctan \frac{w^{p}}{h^{p}} \right)^2$$
$$\alpha = \frac{v}{(1 – IoU) + v}$$
The regression loss is then $L_{reg} = 1 – CIoU$.
The total loss is a weighted sum of these three components:
$$L_{total} = \lambda_1 L_{cls} + \lambda_2 L_{reg} + \lambda_3 L_{conf}$$
where typically $\lambda_1:\lambda_2:\lambda_3$ is set to a ratio like 2:7:1, emphasizing accurate localization.
3. Experimental Results and Analysis
3.1 Experimental Setup and Parameters
The dataset used in the experiments consists of 2,000 images of solar panel defects (soiling/dust and physical damage) collected by UAVs. The dataset is split into training, validation, and test sets in a 6:2:2 ratio (1,200, 400, and 400 images respectively). This split aims to maintain a balanced data distribution, prevent overfitting, and provide a reliable assessment of the model’s generalization performance. The training set is used for model learning, the validation set for hyperparameter tuning and model selection, and the test set for the final evaluation.
The experimental environment configuration is detailed below.
| Component | Specification |
|---|---|
| Operating System | Ubuntu 18.04 |
| GPU | NVIDIA GeForce RTX 3090, 24GB Memory |
| CPU | 32GB RAM |
| CUDA Version | 11.3 |
| cuDNN Version | 8.2.1 |
During training, an initial learning rate of 0.01 is used. A learning rate scheduler reduces the rate by a factor of 0.9 every 300 iterations. The batch size is set to 16, and the Stochastic Gradient Descent (SGD) optimizer is employed to update the network weights.
3.2 Evaluation Metrics
The experiment employs widely used evaluation metrics in object detection: Precision, Recall, mean Average Precision (mAP), and F1 Score.
Precision measures the accuracy of positive predictions, while Recall measures the ability to find all positive instances:
$$Precision = \frac{TP}{TP + FP} = \frac{TP}{\text{Predicted Positive}}$$
$$Recall = \frac{TP}{TP + FN} = \frac{TP}{\text{Actual Positive}}$$
where $TP$ (True Positive) is the number of defects correctly detected, $FP$ (False Positive) is the number of non-defects incorrectly detected as defects, and $FN$ (False Negative) is the number of actual defects missed by the detector.
The Precision-Recall (P-R) curve is plotted with Recall on the x-axis and Precision on the y-axis. The Average Precision (AP) is the area under this curve for a specific class. A higher AP indicates better performance for that class. The mean Average Precision (mAP) is the average of AP over all classes. For metrics, mAP@0.5 uses an IoU threshold of 0.5, while mAP@0.5:0.95 averages mAP over IoU thresholds from 0.5 to 0.95 in steps of 0.05.
$$AP = \int_{0}^{1} P(r) dr$$
$$mAP = \frac{1}{N} \sum_{i=1}^{N} AP_i$$
where $N$ is the number of classes (e.g., dust, damage).
The F1 Score is the harmonic mean of Precision and Recall, providing a single metric that balances both:
$$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$
3.3 Model Comparison and Analysis
For the task of detecting defects like dust and damage on solar panels, the original YOLOv5 network and several of its variants were used as baselines for comparison with the proposed method. The performance results are analyzed and presented in the table below.
| Algorithm | Precision | Recall | mAP@0.5 | mAP@0.5:0.95 | F1 Score |
|---|---|---|---|---|---|
| YOLOv5 (Baseline) | 0.859 | 0.768 | 0.809 | 0.419 | 0.811 |
| + Detection Head | 0.899 | 0.774 | 0.821 | 0.435 | 0.831 |
| + Trans (Attention) | 0.866 | 0.795 | 0.815 | 0.428 | 0.829 |
| + CBAM | 0.877 | 0.789 | 0.817 | 0.430 | 0.830 |
| + GAM | 0.878 | 0.792 | 0.823 | 0.437 | 0.832 |
| Proposed Method (OURs) | 0.891 | 0.803 | 0.832 | 0.441 | 0.845 |
Analysis of the table leads to the following conclusions: Compared to the original YOLOv5, the newly added detection head and various attention mechanism modules all improve performance to varying degrees. This demonstrates that the proposed improvement modules effectively enhance the model’s quality. Compared to the Convolutional Block Attention Module (CBAM), the Global Attention Mechanism (GAM) module increases Precision and Recall by 0.001 and 0.003, respectively, and improves mAP@0.5 by 0.7%. Although adding only the detection head achieves the highest Precision (0.899), the proposed comprehensive method exhibits the best performance across other key evaluation metrics. Compared to the best values from other methods, the proposed method achieves improvements of 1.1% in mAP@0.5, 0.9% in mAP@0.5:0.95, and 1.5% in the F1 Score.
The loss curves for the baseline methods and the proposed method plotted against training iterations show that all methods reduce loss as training progresses. However, the proposed method demonstrates the fastest convergence rate and achieves the lowest error loss at comparable stages, indicating superior overall performance and learning efficiency.
To further validate the advantages of the proposed method, a comparison with other mainstream detectors, Faster R-CNN and SSD, was conducted. The results are summarized in the table below.
| Algorithm | mAP@0.5 | mAP@0.5:0.95 | F1 Score | FPS |
|---|---|---|---|---|
| Faster R-CNN | 0.827 | 0.438 | 0.840 | 4 |
| SSD512 | 0.820 | 0.431 | 0.830 | 10 |
| Proposed Method (OURs) | 0.832 | 0.441 | 0.845 | 73 |
The results indicate that in terms of accuracy, Faster R-CNN shows slightly better precision than SSD512. The proposed method achieves mAP@0.5 and mAP@0.5:0.95 scores of 0.832 and 0.441, respectively, outperforming Faster R-CNN by 0.6% and 0.7%, further demonstrating its superior detection capability. More strikingly, in terms of detection speed (Frames Per Second – FPS), the proposed attention mechanism-based target detection algorithm achieves 73 FPS on an RTX 3090 GPU, which is approximately 18 times faster than Faster R-CNN and 7 times faster than SSD512, demonstrating a significant advantage for real-time UAV-based inspection of solar panels.
To validate the robustness and generalization ability of the proposed algorithm, in addition to testing on the custom solar panel defect dataset, we further evaluated it on the publicly available COCO2017 dataset. The performance comparison against the baseline YOLOv5m model on the 80-class COCO dataset is shown below.
| Model | mAP@0.5 | mAP@0.5:0.95 |
|---|---|---|
| YOLOv5m (Baseline) | 64.1% | 45.4% |
| Proposed Method (OURs) | 66.2% | 46.7% |
The results on COCO2017 show that the proposed method achieves mAP@0.5 and mAP@0.5:0.95 scores of 66.2% and 46.7%, representing improvements of 3.3% and 2.9% over the baseline YOLOv5m, respectively. Although the increase in mAP is less pronounced than on the specialized solar panel dataset (due to the greater diversity and complexity of the 80-class COCO dataset), this experiment effectively proves the strong generalization capability of the proposed method and the consistent improvement in detection performance across different domains. Visual comparisons further indicate advantages in detecting small objects and reducing missed detections.
4. Conclusion
This work presents an attention mechanism-based target detection algorithm designed for UAV-based inspection of solar panels, specifically targeting defects such as dust accumulation and physical damage. The algorithm is deployable on edge devices like onboard AI computing terminals, enabling real-time application in UAV inspection tasks, thereby offering high practical value. The proposed method incorporates four detection heads to expand the receptive field for targets of varying sizes. It integrates global attention networks and multi-head self-attention mechanisms within the feature extraction and fusion stages, effectively combining shallow spatial features with deep semantic information. Experimental comparative analysis demonstrates that the proposed method achieves an mAP of 83.2% and an F1 score of 84.5% for detecting dust and damage on solar panels, outperforming YOLOv5 and other improved detection methods. This confirms that the proposed approach possesses higher feature representation capability and accuracy. Future research will explore multi-spectral fusion analysis combining infrared and visible-light imagery to further enhance detection performance and robustness for solar panel inspection under various environmental conditions.

