In the context of the global push towards carbon peak and carbon neutrality goals, the transformation of energy systems has entered a substantive phase. The photovoltaic industry, as a key pillar of low-carbon energy transition, has experienced rapid development in recent years. As of mid-2024, the total installed capacity of photovoltaic power generation in China has exceeded 712.93 GW. However, photovoltaic modules are often deployed in harsh environments, making them susceptible to defects such as cracks, dirt spots, and broken grids, which severely impact power generation efficiency and system stability. Therefore, accurate and efficient defect detection in solar panels is crucial for eliminating safety hazards, optimizing system performance, significantly improving power generation efficiency, and reducing operational and maintenance costs, offering substantial comprehensive benefits.
Traditional methods for photovoltaic module defect detection, such as K-means clustering and SVM-based multi-feature fusion, have made progress in specific scenarios but generally suffer from high computational complexity and insufficient scene adaptability, making it difficult to meet the demands of large-scale inspection. In recent years, deep learning-based techniques, particularly convolutional neural networks (CNNs), have shown significant advancements in photovoltaic panel defect detection. These methods can be broadly categorized into two-stage and single-stage detectors. Two-stage detectors, like the R-CNN series, offer high accuracy but are computationally intensive and challenging to deploy on embedded devices. Single-stage detectors, such as SSD and the YOLO series, provide a better balance between accuracy and real-time performance through end-to-end training optimization, making them more suitable for real-time inspection scenarios.
YOLO11 is among the most efficient object detection algorithms currently available, known for its speed and accuracy across various scenarios. Its lightweight variant, YOLO11n, is particularly suitable for real-time detection on UAV terminals. However, there has been limited research on adapting YOLO11n specifically for solar panel defect detection. Existing improvements on other lightweight models, such as modifications to YOLOv8n or YOLOv5s, have shown that lightweight algorithms can effectively enhance both detection accuracy and real-time performance on embedded systems. Therefore, addressing the challenges of insufficient accuracy and suboptimal real-time performance in detecting small-target defects within infrared images of photovoltaic panels captured by unmanned aerial vehicles (UAVs), this study proposes an improved YOLO11n-based object detection algorithm named HBGF-YOLO.
The proposed HBGF-YOLO algorithm incorporates several key innovations to overcome the limitations of the baseline model. Firstly, an efficient backbone network, Rep-HGNetV2, is designed by integrating hierarchical gradient feature extraction with re-parameterized convolution techniques. This enhances feature representation while reducing model complexity. Secondly, a collaborative architecture combining a Bidirectional Feature Pyramid Network (BiFPN) and a Global-Local Self-Attention (GLSA) mechanism is constructed to effectively capture global thermal distribution patterns and local detailed features. Finally, a Feature Enhancement Fusion Module (FEFM) is adopted, employing a dynamic weight adaptation mechanism to achieve deep cross-level feature fusion and strengthen semantic information transmission in minute defect regions. These improvements aim to enhance the detection precision for small targets while compressing model parameters to boost inference efficiency, effectively meeting the real-time detection requirements of UAV platforms.
In this paper, we present a comprehensive study on the development and evaluation of the HBGF-YOLO algorithm. We begin by discussing the challenges in UAV-based infrared inspection of solar panels and reviewing related work. Then, we detail the architectural improvements of HBGF-YOLO, supported by mathematical formulations and comparative tables. Subsequently, we describe the experimental setup, including a custom dataset of infrared images for photovoltaic panel defects, and present extensive experimental results. These results include ablation studies, comparisons with state-of-the-art models, and validation on a public dataset to demonstrate generalizability. The paper concludes with a summary of contributions and potential future work.

Related Work
Object detection for photovoltaic panel inspection has evolved significantly with the advent of deep learning. Two-stage detectors, such as Faster R-CNN, have been applied for hotspot detection with high accuracy, often enhanced through transfer learning and network optimization. For instance, some studies achieved a hotspot detection accuracy of 97.34% using improved Faster R-CNN models. Further enhancements involved integrating BiFPN and attention mechanisms to boost detection efficacy. However, these models typically have large parameter counts and high computational complexity, limiting their deployment on resource-constrained devices like UAVs.
Single-stage detectors offer a more efficient alternative. The SSD algorithm has been improved with deep residual structures and feature pyramid networks to enhance detection capability. The YOLO series, particularly from YOLOv3 to the latest YOLO11, has been widely adopted due to its speed-accuracy trade-off. For solar panel defect detection, various modifications have been proposed. For example, improved YOLOv8 algorithms through structural optimization and loss function adjustments have shown increased accuracy. Lightweight versions like YOLOv7-tiny and YOLOv5n have been optimized for real-time performance, with techniques such as depthwise separable convolutions, attention modules, and small anchor strategies to better detect small targets in complex infrared backgrounds. Recent work on YOLOv8n introduced small-target detection layers, large separable kernel attention (LSKA), and re-parameterized generalized feature pyramid networks (RepGFPN) to improve defect detection. Similarly, YOLOv5s-based methods incorporated triplet attention mechanisms and bidirectional feature pyramid networks for better precision and speed.
These studies indicate that deep learning algorithms excel in detecting defects in complex backgrounds. Lightweight models, in particular, are crucial for UAV-based real-time inspection. However, YOLO11n has not been extensively tailored for photovoltaic panel defect detection. Our work builds upon YOLO11n, introducing a hybrid backbone with re-parameterization, advanced feature fusion, and attention mechanisms to address the specific challenges of small-target defect detection in infrared images captured by UAVs.
Methodology
The overall architecture of the proposed HBGF-YOLO algorithm is illustrated in a network structure diagram. It retains the basic framework of YOLO11n but replaces key components to enhance performance. The main improvements include: 1) Replacing the original backbone with Rep-HGNetV2 for efficient feature extraction; 2) Incorporating BiFPN for multi-scale feature fusion; 3) Integrating GLSA for global-local attention; and 4) Using FEFM for feature enhancement fusion. Below, we detail each component.
Efficient Backbone Network: Rep-HGNetV2
To address computational constraints and real-time detection latency on UAV-embedded devices, we adopt HGNetV2 as the backbone network, originally used in RT-DETR models. HGNetV2 employs a lightweight design with hierarchical gradient feature extraction, significantly reducing parameters and computational complexity. It consists of an HGStem preprocessing layer and HGBlock data processing modules. The HGStem layer uses efficient convolution structures for initial feature extraction with low parameters. The HGBlock module processes data hierarchically, capturing features at different levels to enhance sensitivity to defect details.
Moreover, depthwise separable convolution (DWConv) is utilized in HGNetV2. Compared to standard convolution, DWConv offers advantages in parameter count and computational load. Let the input feature map have height $H$, width $W$, input channels $C_{in}$, kernel size $K \times K$, and output channels $C_{out}$. The parameter count $P$ and computational load $F$ for standard convolution are:
$$P = C_{in} \cdot C_{out} \cdot K \cdot K$$
$$F = C_{in} \cdot C_{out} \cdot W \cdot H \cdot K \cdot K$$
For DWConv, the parameter count $P_{DW}$ and computational load $F_{DW}$ are:
$$P_{DW} = C_{in} \cdot K \cdot K + C_{in} \cdot C_{out}$$
$$F_{DW} = C_{in} \cdot W \cdot H \cdot K \cdot K + C_{in} \cdot C_{out} \cdot W \cdot H$$
From this, we derive the ratio:
$$\frac{P_{DW}}{P} = \frac{F_{DW}}{F} = \frac{1}{C_{out}} + \frac{1}{K^2}$$
This shows that DWConv’s parameters and computations are only $\frac{1}{C_{out}} + \frac{1}{K^2}$ of standard convolution, contributing to HGNetV2’s lightweight nature.
However, initial experiments showed that using HGNetV2 alone reduced parameters but also slightly decreased detection accuracy. To compensate, we introduce re-parameterized convolution (RepConv). During training, RepConv fuses multiple branches: a $3 \times 3$ convolution for local details, a $1 \times 1$ convolution for channel interaction, and a BN branch to preserve feature distribution. During inference, these branches are merged into a single $3 \times 3$ kernel (with $1 \times 1$ convolution zero-padded and BN parameters integrated), maintaining the parameter count of standard convolution while retaining the multi-branch training benefits. Integrating RepConv into HGBlock results in Rep-HGBlock, which enhances fine-grained feature extraction for complex targets without increasing computational load during inference. The structure of RepConv and Rep-HGBlock is depicted in diagrams.
Bidirectional Feature Pyramid Network (BiFPN)
Photovoltaic panel defects exhibit complex and multi-scale features. The original model’s feature fusion method is relatively simple, struggling to integrate features from different depth levels, which limits detection accuracy by failing to balance overall panel features and local thermal anomaly details. Therefore, we adopt the BiFPN from EfficientDet to optimize YOLO11n’s feature fusion mechanism. BiFPN facilitates bidirectional cross-scale connections and weighted feature fusion, promoting top-down flow of high-level semantic features and bottom-up integration of low-level spatial details. Compared to the original PANet, BiFPN strengthens same-layer and cross-layer feature fusion through cross-connections, improving detection accuracy for multi-scale defects in solar panels. A comparison of different feature fusion network structures is shown in a diagram.
Global-Local Self-Attention (GLSA) Mechanism
In infrared image feature fusion, small targets have low distinction from the background, and the lack of a dynamic feature weighting mechanism can weaken small target features, limiting detection accuracy. Thus, we introduce the GLSA mechanism between Rep-HGNetV2 and BiFPN. GLSA processes features extracted by Rep-HGNetV2, performing semantic-level discrimination and enhancement. It reallocates weights based on global semantic importance, highlighting target features and suppressing background interference. This preprocessing allows BiFPN to operate on more precise and discriminative features during fusion, better integrating multi-scale information to grasp large-scale features overall and preserve small-scale details.
Specifically, GLSA consists of two branches: Local Spatial Attention (LSA) focuses on pixel-level details of defect regions, enhancing sensitivity to minute defects; Global Spatial Attention (GSA) models the overall structural semantics of the solar panel, suppressing repetitive textures and noise. By fusing LSA and GSA across scales, GLSA retains local fine features while using global context to strengthen feature discriminability, effectively balancing detail capture and background suppression for small defects in photovoltaic panel infrared images.
The GLSA module first splits the input feature $X$ (with dimensions $C, H, W$) into $X_0$ and $X_1$ along the channel dimension to reduce computational complexity while maintaining feature expressiveness. Then, $X_0$ and $X_1$ are fed into GSA and LSA branches, respectively. GSA captures long-range dependencies between pixels to supplement global context missing in local features, enhancing the model’s ability to distinguish defects from complex backgrounds. LSA focuses on fine feature extraction in the spatial dimension, strengthening the capture of local detail information to mitigate the overlooking of small-scale defect features.
To enhance feature representation, GLSA concatenates the outputs $GSA(X_0)$ and $LSA(X_1)$ along the channel dimension, producing a fused result with both local details and global semantics (dimensions $C \times 2, H, W$). A $1 \times 1$ convolution then compresses the dimension to output the optimized feature $Y$ (dimensions $C, H, W$). The formulas are as follows:
$$X_0, X_1 = \text{Split}(X)$$
$$\text{Att}_G(X_0) = \text{Softmax}(\text{Transpose}(\text{Conv}_{1 \times 1}(X_0)))$$
$$GSA(X_0) = \text{MLP}(\text{Att}_G(X_0) \otimes X_0) + X_0$$
$$\text{Att}_L(X_1) = \text{Sigmoid}(\text{Conv}_{1 \times 1}(\text{DWConv}_{3 \times 3}(\text{Conv}_{1 \times 1}(X_1))))$$
$$LSA(X_1) = \text{Att}_L(X_1) \otimes X_1 + X_1$$
$$Y = \text{Conv}_{1 \times 1}(\text{Concat}(GSA(X_0), LSA(X_1)))$$
Here, $\otimes$ denotes element-wise multiplication, and $\text{DWConv}_{3 \times 3}$ is a depthwise convolution with a $3 \times 3$ kernel.
Feature Enhancement Fusion Module (FEFM)
YOLO11n’s feature fusion primarily relies on simple concatenation (Concat), which averages information weights across different feature maps without dynamically adapting to their importance and relevance. This limits the model’s ability to highlight key features, affecting detection accuracy and generalization. Therefore, we employ the FEFM to perform adaptive weighted fusion of different feature maps, dynamically learning the importance and relevance of each feature map to integrate multi-source feature information more effectively. FEFM uses a two-stage cascade optimization mechanism of edge-semantic enhancement to improve feature discriminability: first, it uses shallow high-resolution edge features to enhance local detail representation, highlighting boundary responses of small defects; then, it introduces deep semantic features to dynamically calibrate fusion weights, suppressing background noise and strengthening defect region features.
Given a low-level feature map $F_L$ and a high-level feature map (either high-level feature $F_H$ or edge feature $F_E$), FEFM first enhances the spatial information of $F_L$ via Coordinate Attention (CA), producing optimized feature $F’_L$. Then, it extracts channel weights $M_{CM}$ from the edge feature $F_E$ using Context Modeling (CM) to guide channel optimization of the low-level features. Finally, it concatenates the optimized feature $F”_L$ with the edge feature $F_E$ to output $F_O$. This process is expressed as:
$$F’_L = \text{CA}(F_L)$$
$$M_{CM} = \text{CM}(F_E)$$
$$F”_L = M_{CM} \otimes F’_L$$
$$F_O = \text{Concat}(F”_L, F_E)$$
where $\otimes$ denotes element-wise multiplication.
In the multi-feature cascade enhancement framework, the first FEFM module uses edge features to strengthen low-level features, highlighting defect-edge-sensitive channels. The second module leverages high-level semantics for secondary optimization of fused features, improving feature expression capability and enhancing detection accuracy and environmental adaptability for UAV inspection.
The CA module enhances defect region localization through direction-aware and position-sensitive feature modeling. It operates in two stages: coordinate feature encoding and attention weight calculation. For input feature map $x$, average pooling is performed along height ($H$) and width ($W$) to generate vertical response $z_h(c)$ and horizontal response $z_w(c)$:
$$z_h(c) = \frac{1}{W} \sum_{0 \leq i < W} x(c, h, i)$$
$$z_w(c) = \frac{1}{H} \sum_{0 \leq j < H} x(c, j, w)$$
These are concatenated and processed through a $1 \times 1$ convolution and activation to produce intermediate feature $f$, which is then split into spatial attention weights $g_h$ and $g_w$ via two separate convolutional branches. The output is:
$$f = \delta(\text{Conv}_{1 \times 1}([z_h, z_w]))$$
$$g_h = \sigma(\text{Conv}_{h}(f))$$
$$g_w = \sigma(\text{Conv}_{w}(f))$$
$$\text{out} = x \otimes g_h \otimes g_w$$
where $\delta$ is an activation function (e.g., ReLU), $\sigma$ is sigmoid, and $\otimes$ is element-wise multiplication.
The CM module is an efficient context modeling structure. It generates an attention map via a $1 \times 1$ convolution and softmax function to capture contextual features in solar panel images. Then, matrix multiplication weights and integrates features across spatial positions. The formula is:
$$\text{out} = \text{input} * \text{softmax}(F(\text{input}))$$
where $F$ denotes a $1 \times 1$ convolution and $*$ denotes matrix multiplication.
Experimental Setup
Dataset
We constructed a specialized dataset for photovoltaic panel defect detection based on samples collected using a DJI M300 RTK UAV equipped with a FLIR XT2 infrared camera. The dataset contains 3225 thermal imaging images covering five categories: large-area hot spots, single hot spots, abnormal low temperature, diode short circuits, and normal state. Each image includes multiple solar panels, and each independent photovoltaic panel component was manually annotated using the Labelimg tool to ensure accurate recording of defect states. To highlight defects, only defect labels are shown in figures, but the dataset fully covers all five categories. Examples of annotated images are provided.
To ensure reliability, the dataset was split into training, validation, and test sets in a 7:2:1 ratio. To improve model generalization, data augmentation was applied to the training set, including random rotation, mirror flipping, and noise addition, expanding it to 6773 images. This dataset considers real-world UAV inspection scenarios and includes common defect types in photovoltaic power plants, providing a reliable foundation for algorithm training and evaluation. The label categories and counts are summarized in Table 1.
| Labels | Description | Count |
|---|---|---|
| dmjrb | Large-area hot spot | 1373 |
| dyrb | Single hot spot | 1678 |
| ycdw | Abnormal low temperature | 1125 |
| ejgdl | Diode short circuit | 1049 |
| zc | Normal state | 4870 |
The total count of annotated solar panel components per category is provided. The “normal” category has a larger proportion, reflecting the real operational state of photovoltaic arrays and avoiding overfitting to idealized scenarios due to artificial balancing.
Experimental Configuration
Experiments were conducted on a high-performance computing platform. The hardware, software, and training parameters are detailed in Tables 2 and 3, optimized to ensure efficient model convergence.
| Environment | Specification |
|---|---|
| Operating system | Windows 11 |
| Processor | Intel Core i5-12400F |
| Programming language | Python 3.9 |
| Deep learning framework | PyTorch 1.12 |
| GPU | RTX 4060Ti |
| Parameters | Value |
|---|---|
| Input-size | 640×640 |
| Batch Size | 16 |
| Epochs | 300 |
| Optimizer | SGD |
| Initial learning rate | 0.01 |
Evaluation Metrics
We use precision (P), recall (R), mean average precision (mAP), parameter count (Params), and computational load (FLOPs) as evaluation metrics. The formulas are:
$$P = \frac{TP}{TP + FP}$$
$$R = \frac{TP}{TP + FN}$$
$$AP = \int_0^1 P(R) \, dR$$
$$mAP = \frac{1}{n} \sum_{i=1}^n AP_i$$
where $TP$ is true positive, $FP$ is false positive, $FN$ is false negative, and $n$ is the number of classes. We report mAP at IoU threshold 0.5 (mAP@0.5) and averaged over IoU thresholds from 0.5 to 0.95 with step 0.05 (mAP@0.5:0.95).
Results and Analysis
Backbone Network Performance Analysis
To validate the rationality and effectiveness of introducing HGNetV2 into YOLO11n for UAV-based infrared solar panel defect detection, we conducted comparative experiments with other lightweight backbone networks. Results are shown in Table 4. HGNetV2 achieved mAP@0.5 of 82.9% and mAP@0.5:0.95 of 70.8%, which, while slightly lower than the original model, significantly reduced parameters and FLOPs. Compared to other lightweight backbones like ShuffleNetV2, EfficientViT, MobileNetV4, StartNet, and FasterNet, HGNetV2 showed superior defect detection performance. Furthermore, by introducing RepConv into HGNetV2 to form Rep-HGNetV2, mAP@0.5 and mAP@0.5:0.95 increased by 1.2% and 1.6%, respectively, validating that RepConv improves target defect recognition and detection accuracy. Compared to the original model, Rep-HGNetV2 increased mAP@0.5 and mAP@0.5:0.95 by 1.0% and 1.4%, while reducing parameters and FLOPs by 19% and 6.3%, respectively, making it suitable for deployment on resource-constrained UAV devices.
| Base Model | Backbone network | P/% | R/% | mAP@0.5/% | mAP@0.5:0.95/% | Params (10^6) | FLOPs (G) |
|---|---|---|---|---|---|---|---|
| YOLO11n | BottleNeck (original) | 84.7 | 78.3 | 83.4 | 70.9 | 2.6 | 6.3 |
| ShuffleNetV2 | 80.3 | 73.8 | 78.7 | 66.9 | 1.6 | 3.6 | |
| EfficientViT | 84.0 | 77.1 | 80.7 | 68.7 | 3.8 | 8.1 | |
| MobileNetV4 | 84.1 | 77.3 | 80.9 | 68.7 | 5.4 | 21.0 | |
| StartNet | 85.6 | 76.5 | 81.2 | 68.9 | 1.9 | 5.0 | |
| FasterNet | 85.8 | 80.2 | 83.1 | 70.5 | 3.9 | 9.2 | |
| HGNetV2 | 84.6 | 77.5 | 82.9 | 70.8 | 2.2 | 5.9 | |
| Rep-HGNetV2 | 84.5 | 81.2 | 84.2 | 71.9 | 2.1 | 5.7 |
Feature Fusion Performance Analysis
Despite backbone improvements, detection accuracy needed further enhancement. Therefore, we introduced BiFPN for efficient multi-scale feature fusion to boost small-target defect detection. To validate the effectiveness of our feature fusion network, we compared different feature fusion networks on the Rep-HGNetV2 backbone, as shown in Table 5. BiFPN achieved mAP@0.5 of 84.7% and mAP@0.5:0.95 of 72.8%, outperforming PANet, indicating its advantage in multi-scale feature fusion for accurately detecting subtle defects in solar panels. Although BiFPN has slightly higher computational complexity, its accuracy improvement justifies its use for photovoltaic panel defect detection. Future work could incorporate lightweight designs or model pruning to further reduce computational burden while maintaining high accuracy.
| Base Model | Neck network | P/% | R/% | mAP@0.5/% | mAP@0.5:0.95/% | Params (10^6) | FLOPs (G) |
|---|---|---|---|---|---|---|---|
| YOLO11n + Rep-HGNetV2 | PANet | 84.5 | 81.2 | 84.2 | 71.9 | 2.1 | 5.7 |
| YOLO11n + Rep-HGNetV2 | BiFPN | 85.1 | 81.5 | 84.7 | 72.8 | 1.8 | 6.1 |
Attention Mechanism Comparison Experiment
To address the issue of small-size defect features being overlooked and missed detections, we integrated the GLSA mechanism to capture both global context and local details. We compared GLSA with other attention mechanisms, as shown in Table 6. Results indicate that GLSA outperforms others in precision, recall, and detection accuracy. While its parameters and FLOPs are slightly higher than MLCA, GLSA offers a better balance of accuracy and efficiency, making it the best overall performer. Thus, GLSA effectively enhances global feature perception and local detail focus, improving recognition of subtle defects in complex backgrounds, proving its value for solar panel defect detection.
| Base Model | Attention | P/% | R/% | mAP@0.5/% | mAP@0.5:0.95/% | Params (10^6) | FLOPs (G) |
|---|---|---|---|---|---|---|---|
| YOLO11n + Rep-HGNetV2 + BiFPN | CAFM | 82.4 | 80.3 | 83.7 | 71.0 | 2.3 | 8.4 |
| CPCA | 83.5 | 79.2 | 84.5 | 71.7 | 1.8 | 7.3 | |
| MLCA | 85.9 | 79.9 | 84.9 | 72.3 | 1.5 | 5.6 | |
| SegNext | 84.2 | 82.2 | 85.1 | 72.3 | 1.7 | 6.4 | |
| GLSA | 86.2 | 82.6 | 85.2 | 72.6 | 1.7 | 6.2 |
Ablation Study
To comprehensively evaluate the effectiveness of our improvements, we designed an ablation study by incrementally adding modules and analyzing their impact on detection performance. Results are shown in Table 7. The modules Rep-HGNetV2 (RH), BiFPN (BF), GLSA (GL), and FEFM (FE) each contribute to performance gains. Introducing Rep-HGNetV2 alone increased mAP@0.5 and mAP@0.5:0.95 by 1.0% and 1.4%, while reducing parameters and FLOPs by 19% and 6.3%. Adding BiFPN further improved accuracy, though with a slight increase in FLOPs. Incorporating GLSA boosted mAP@0.5 by 0.6%. Finally, adding FEFM enhanced feature fusion, increasing mAP@0.5 and mAP@0.5:0.95 by 0.4% and 0.8%, while reducing FLOPs by 6.5%. Overall, compared to the original model, the improved model increased precision and recall by 3.1% and 4.3%, mAP@0.5 and mAP@0.5:0.95 by 2.5% and 3.2%, and reduced parameters and FLOPs by 38.5% and 7.9%, validating the effectiveness of our proposed improvements for photovoltaic panel defect detection.
| RH | BF | GL | FE | P/% | R/% | mAP@0.5/% | mAP@0.5:0.95/% | Params (10^6) | FLOPs (G) |
|---|---|---|---|---|---|---|---|---|---|
| 84.7 | 78.3 | 83.4 | 70.9 | 2.6 | 6.3 | ||||
| √ | 84.5 | 81.2 | 84.2 | 71.9 | 2.1 | 5.7 | |||
| √ | 85.0 | 80.6 | 83.9 | 72.1 | 2.0 | 6.6 | |||
| √ | √ | 85.1 | 81.5 | 84.7 | 72.8 | 1.8 | 6.1 | ||
| √ | √ | 86.3 | 79.9 | 84.4 | 71.6 | 2.1 | 7.1 | ||
| √ | √ | 85.4 | 81.0 | 84.3 | 71.9 | 1.9 | 6.4 | ||
| √ | √ | √ | 84.9 | 82.0 | 84.8 | 72.4 | 2.0 | 6.6 | |
| √ | √ | √ | 86.2 | 82.6 | 85.2 | 72.6 | 1.7 | 6.2 | |
| √ | √ | √ | 86.5 | 81.2 | 85.0 | 72.3 | 1.5 | 5.6 | |
| √ | √ | √ | √ | 87.3 | 81.7 | 85.5 | 73.2 | 1.6 | 5.8 |
Comparison on Custom Dataset
To verify the performance of our improved model, we compared it with other mainstream models on our custom photovoltaic panel dataset. Quantitative results are in Table 8. Our model achieved mAP@0.5 of 85.5% and mAP@0.5:0.95 of 73.2%, improving over the original by 2.5% and 3.2%, while significantly reducing parameters and FLOPs by 38.5% and 7.9%. Compared to other lightweight models, our model offers better detection accuracy and computational efficiency, making it more suitable for real-time operation on resource-constrained UAV platforms. To demonstrate superiority in infrared small-target detection, we compared with related improved algorithms. Results show that our model outperforms YOLOv3-tiny+, YOLOv5n+, YOLOv5sm+, and LS-YOLO in precision and recall, effectively reducing missed and false detections. Specifically, mAP@0.5 is higher by 6.8, 6.4, 4.2, and 1.8 percentage points, and mAP@0.5:0.95 by 4.6, 6.5, 3.5, and 1.1 percentage points, respectively. Additionally, our algorithm has lower parameters and computational complexity, offering high efficiency for real-time applications.
| Model | P/% | R/% | mAP@0.5/% | mAP@0.5:0.95/% | Params (10^6) | FLOPs (G) |
|---|---|---|---|---|---|---|
| Faster R-CNN | 71.2 | 70.9 | 74.3 | 61.8 | 43.6 | 207.0 |
| SSD | 65.8 | 68.0 | 68.7 | 59.5 | 25.2 | 34.3 |
| ShuffleNet v2 | 80.3 | 73.8 | 78.7 | 66.9 | 1.6 | 3.6 |
| MobileNet v4 | 84.1 | 77.3 | 80.9 | 68.7 | 5.4 | 21.0 |
| YOLOv3-tiny+ | 80.5 | 77.2 | 78.7 | 68.6 | 9.4 | 15.7 |
| YOLOv5n | 77.7 | 76.2 | 78.5 | 65.9 | 2.1 | 5.8 |
| YOLOv5n+ | 78.2 | 76.0 | 79.1 | 66.7 | 13.2 | 24.3 |
| YOLOv5sm+ | 81.6 | 79.8 | 81.3 | 69.7 | 17.9 | 38.6 |
| YOLOv8n | 79.1 | 76.8 | 79.3 | 67.1 | 2.7 | 6.8 |
| LS-YOLO | 85.4 | 80.2 | 83.7 | 72.1 | 1.8 | 23.8 |
| YOLOv10n | 80.1 | 79.9 | 81.4 | 70.1 | 2.3 | 6.5 |
| YOLO11n | 84.7 | 78.3 | 83.4 | 70.9 | 2.6 | 6.3 |
| HBGF-YOLO (Ours) | 87.3 | 81.7 | 85.5 | 73.2 | 1.6 | 5.8 |
For visual evaluation, we show Precision-Recall (P-R) curves comparing YOLO11n and HBGF-YOLO algorithms. The curve for HBGF-YOLO is closer to the top-right corner, indicating higher precision at the same recall. To intuitively understand the enhancement effect of GLSA on small targets, we conducted heatmap analysis. After introducing GLSA, the number of detected small targets increased, and the red areas for defects expanded and were more aligned with defect locations. This shows that under the fusion mechanism, global attention effectively locates small targets by integrating scene context, while local attention enhances the contrast between targets and background through detail reinforcement, making small targets more prominent in heatmaps. Thus, GLSA improves small-target detection accuracy through a multi-scale attention协同 mechanism.
Comparison on Public Dataset
To validate the adaptability and generalizability of our improved model in different scenarios, we conducted comparative experiments on the MS COCO dataset. All models were trained with the same parameters. Results are in Table 9. Our algorithm demonstrates clear advantages in complex detection tasks, maintaining detection accuracy while significantly optimizing model structure and computational efficiency. Its lightweight design effectively reduces hardware resource requirements, supporting practical deployment on embedded devices.
| Model | mAP@0.5:0.95/% | Params (10^6) | FLOPs (G) |
|---|---|---|---|
| Faster R-CNN | 36.4 | 42.1 | 207.0 |
| SSD | 25.1 | 27.4 | 33.6 |
| YOLOv5n | 28.0 | 1.9 | 4.5 |
| YOLOv7-tiny | 37.4 | 6.2 | 13.7 |
| YOLOv8n | 37.3 | 3.2 | 8.7 |
| YOLOv9t | 27.5 | 4.5 | 10.5 |
| YOLOv10n | 29.8 | 3.5 | 9.2 |
| YOLO11n | 39.5 | 2.6 | 6.5 |
| HBGF-YOLO (Ours) | 40.3 | 1.7 | 5.9 |
Conclusion
In this paper, we proposed HBGF-YOLO, an improved algorithm based on the YOLO11n framework, to effectively address the challenges of insufficient accuracy and real-time performance in detecting small-target defects in infrared images of solar panels captured by UAVs. Our approach involves replacing the original backbone with Rep-HGNetV2 to enhance accuracy while reducing parameters and computational complexity; incorporating BiFPN and GLSA to improve small-target defect detection precision; and introducing FEFM to strengthen detail feature capture. We constructed a dataset containing five common types of photovoltaic panel defects and conducted extensive experiments. Results show that our algorithm outperforms the original model in accuracy and processing speed, while significantly compressing model size and computational overhead. Compared to mainstream methods, it offers higher stability in complex environments, meeting the dual requirements of real-time processing and high-precision detection for UAV platforms. Validation on the MS COCO dataset further demonstrates its advantages in varied detection tasks. Future work will focus on optimizing the network for higher resolution processing, stronger environmental adaptability, and lower resource consumption, promoting the large-scale application of UAV-based photovoltaic inspection.
