In recent years, the rapid expansion of photovoltaic energy systems has highlighted the critical need for efficient and reliable inspection methods to maintain optimal performance and safety. Solar panels are often deployed in harsh environments, making them susceptible to various defects such as hot spots, cracks, and shading, which can significantly reduce energy output and lead to system failures. Traditional inspection techniques, including manual checks and conventional image processing, struggle with scalability, accuracy, and real-time performance, especially in large-scale solar farms. With the advent of unmanned aerial vehicles (UAVs) equipped with infrared cameras, aerial photography has emerged as a promising solution for large-area inspection. However, detecting small target defects in infrared images remains challenging due to factors like low resolution, complex backgrounds, and the subtle nature of thermal anomalies. These challenges often result in missed detections and false alarms, limiting the effectiveness of automated inspection systems.
To address these issues, I propose HBGF-YOLO, an enhanced object detection algorithm based on YOLO11n, specifically designed for infrared small target defect detection in photovoltaic panels. This approach integrates several innovative components to improve accuracy, reduce computational complexity, and ensure real-time performance on resource-constrained UAV platforms. The core improvements include a lightweight backbone network called Rep-HGNetV2, which combines hierarchical gradient feature extraction with re-parameterized convolution techniques. This design enhances feature representation while minimizing parameters and floating-point operations (FLOPs). Additionally, a bidirectional feature pyramid network (BiFPN) is incorporated to facilitate multi-scale feature fusion, enabling the model to capture both global thermal distribution patterns and local defect details. To further refine feature discrimination, a global-local self-attention (GLSA) mechanism is introduced, which dynamically weights features based on their semantic importance, emphasizing defect regions and suppressing background noise. Finally, a feature enhancement fusion module (FEFM) is employed to achieve deep cross-level feature integration through dynamic weight adaptation, strengthening the semantic information propagation for微小缺陷.
The performance of HBGF-YOLO is evaluated on a custom dataset comprising infrared images of photovoltaic panels, captured by UAVs and annotated with five common defect types: large-area hot spots, single hot spots, abnormal low-temperature areas, diode short circuits, and normal conditions. Experimental results demonstrate that HBGF-YOLO outperforms the baseline YOLO11n and other state-of-the-art methods in terms of precision, recall, and mean average precision (mAP), while significantly reducing model size and computational cost. For instance, compared to YOLO11n, HBGF-YOLO achieves a 3.1% increase in precision, a 4.3% increase in recall, and improvements of 2.5% and 3.2% in mAP@0.5 and mAP@0..
5:0.95, respectively. Moreover, the parameter count and FLOPs are reduced by 38.5% and 7.9%, making it suitable for real-time deployment on UAVs. The algorithm also shows strong generalization on the MS COCO dataset, confirming its adaptability to diverse detection tasks.

In the context of photovoltaic panel inspection, the infrared imagery captured by UAVs often contains small targets that are difficult to detect due to their limited pixel size and low contrast against the background. The proposed HBGF-YOLO algorithm tackles this by optimizing the feature extraction and fusion processes. The Rep-HGNetV2 backbone leverages depthwise separable convolutions to reduce computational overhead. The number of parameters and FLOPs for a standard convolution can be expressed as:
$$P = C_{\text{in}} \cdot C_{\text{out}} \cdot K \cdot K$$
$$F = C_{\text{in}} \cdot C_{\text{out}} \cdot W \cdot H \cdot K \cdot K$$
where \(C_{\text{in}}\) and \(C_{\text{out}}\) are the input and output channels, \(K\) is the kernel size, and \(W\) and \(H\) are the width and height of the feature map. For depthwise separable convolution, the equations become:
$$P_{\text{DW}} = C_{\text{in}} \cdot K \cdot K + C_{\text{in}} \cdot C_{\text{out}}$$
$$F_{\text{DW}} = C_{\text{in}} \cdot W \cdot H \cdot K \cdot K + C_{\text{in}} \cdot C_{\text{out}} \cdot W \cdot H$$
This results in a reduction ratio of approximately \(\frac{1}{C_{\text{out}}} + \frac{1}{K^2}\), making the network more efficient. The RepConv module further enhances this by integrating multiple branches during training, which are merged into a single convolution during inference, preserving multi-scale feature fusion capabilities without extra cost.
The BiFPN component enables bidirectional cross-scale connections, allowing features to flow both top-down and bottom-up. This is crucial for handling the multi-scale nature of defects in photovoltaic panels, as it combines high-level semantic information with low-level spatial details. The feature fusion in BiFPN can be summarized with weighted addition, where each input feature is assigned a learnable weight. For features \(P_i\) at different scales, the output \(O\) is computed as:
$$O = \sum_i w_i \cdot P_i$$
where \(w_i\) are weights normalized through softmax, ensuring that more important features contribute more significantly. This process enhances the detection of small targets by effectively aggregating context from multiple resolutions.
The GLSA mechanism divides the input feature map into two parts for global and local attention. Global attention captures long-range dependencies, while local attention focuses on fine-grained details. Mathematically, for an input feature \(X\), it is split into \(X_0\) and \(X_1\). The global attention output is given by:
$$G_{\text{att}}(X_0) = \text{Softmax}\left(\text{Transpose}(\text{Conv}_{1\times1}(X_0))\right)$$
$$\text{GSA}(X_0) = \text{MLP}(G_{\text{att}}(X_0) \otimes X_0) + X_0$$
where \(\otimes\) denotes element-wise multiplication. The local attention is computed as:
$$L_{\text{att}}(X_1) = \text{Sigmoid}\left(\text{Conv}_{1\times1}(\text{DWconv}_{3\times3}(\text{Conv}_{1\times1}(X_1))) + \epsilon\right)$$
$$\text{LSA}(X_1) = L_{\text{att}}(X_1) \otimes X_1 + X_1$$
The outputs are then concatenated and processed through a 1×1 convolution to produce the final feature map. This dual attention approach ensures that both broad contextual information and intricate local patterns are utilized for defect detection in solar panels.
The FEFM module further refines feature fusion by incorporating coordinate attention (CA) and context modeling (CM). For low-level features \(F_L\) and edge features \(F_E\), the process involves:
$$F_L’ = \text{CA}(F_L)$$
$$M_{\text{CM}} = \text{CM}(F_E)$$
$$F_L” = M_{\text{CM}} \otimes F_L’$$
$$F_O = \text{concat}(F_L”, F_E)$$
This module enhances the representation of defect boundaries and semantic details, which is vital for accurately identifying small anomalies in photovoltaic systems.
To validate the effectiveness of HBGF-YOLO, I conducted extensive experiments on a dataset of 3,225 infrared images, expanded to 6,773 through data augmentation techniques like random rotation and noise addition. The dataset includes five categories of defects, with annotations for each photovoltaic panel component. The models were trained using stochastic gradient descent (SGD) with an initial learning rate of 0.01, a batch size of 16, and 300 epochs on a system with an RTX 4060Ti GPU. Evaluation metrics included precision, recall, mAP, parameter count, and FLOPs.
The following table summarizes the comparison of HBGF-YOLO with other backbone networks on the photovoltaic panel dataset:
| Backbone Network | Precision (%) | Recall (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (10^6) | FLOPs (G) |
|---|---|---|---|---|---|---|
| BottleNeck | 84.7 | 78.3 | 83.4 | 70.9 | 2.6 | 6.3 |
| ShuffleNetV2 | 80.3 | 73.8 | 78.7 | 66.9 | 1.6 | 3.6 |
| EfficientViT | 84.0 | 77.1 | 80.7 | 68.7 | 3.8 | 8.1 |
| MobileNetV4 | 84.1 | 77.3 | 80.9 | 68.7 | 5.4 | 21.0 |
| StartNet | 85.6 | 76.5 | 81.2 | 68.9 | 1.9 | 5.0 |
| FasterNet | 85.8 | 80.2 | 83.1 | 70.5 | 3.9 | 9.2 |
| HGNetV2 | 84.6 | 77.5 | 82.9 | 70.8 | 2.2 | 5.9 |
| Rep-HGNetV2 | 84.5 | 81.2 | 84.2 | 71.9 | 2.1 | 5.7 |
As shown, Rep-HGNetV2 achieves a balance between accuracy and efficiency, with higher mAP scores and reduced computational costs compared to other backbones. This makes it ideal for integration into HBGF-YOLO for photovoltaic panel inspection.
Next, I evaluated the impact of different feature fusion networks and attention mechanisms. The table below presents the results when combining Rep-HGNetV2 with BiFPN and GLSA:
| Feature Fusion Network | Attention Mechanism | Precision (%) | Recall (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (10^6) | FLOPs (G) |
|---|---|---|---|---|---|---|---|
| PANet | None | 84.5 | 81.2 | 84.2 | 71.9 | 2.1 | 5.7 |
| BiFPN | None | 85.1 | 81.5 | 84.7 | 72.8 | 1.8 | 6.1 |
| BiFPN | CAFM | 82.4 | 80.3 | 83.7 | 71.0 | 2.3 | 8.4 |
| BiFPN | CPCA | 83.5 | 79.2 | 84.5 | 71.7 | 1.8 | 7.3 |
| BiFPN | MLCA | 85.9 | 79.9 | 84.9 | 72.3 | 1.5 | 5.6 |
| BiFPN | SegNext | 84.2 | 82.2 | 85.1 | 72.3 | 1.7 | 6.4 |
| BiFPN | GLSA | 86.2 | 82.6 | 85.2 | 72.6 | 1.7 | 6.2 |
GLSA demonstrates superior performance in precision and recall, with manageable parameter and FLOP counts, confirming its effectiveness in highlighting small target defects in photovoltaic panels.
Ablation studies were conducted to isolate the contributions of each component in HBGF-YOLO. The results are summarized in the following table, where “RH” denotes Rep-HGNetV2, “BF” for BiFPN, “GL” for GLSA, and “FE” for FEFM:
| Components | Precision (%) | Recall (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (10^6) | FLOPs (G) |
|---|---|---|---|---|---|---|
| Baseline (YOLO11n) | 84.7 | 78.3 | 83.4 | 70.9 | 2.6 | 6.3 |
| RH | 84.5 | 81.2 | 84.2 | 71.9 | 2.1 | 5.7 |
| RH + BF | 85.1 | 81.5 | 84.7 | 72.8 | 1.8 | 6.1 |
| RH + BF + GL | 86.2 | 82.6 | 85.2 | 72.6 | 1.7 | 6.2 |
| RH + BF + GL + FE | 87.3 | 81.7 | 85.5 | 73.2 | 1.6 | 5.8 |
The incremental improvements validate the synergy between the modules, with the full HBGF-YOLO model achieving the highest accuracy and efficiency. The precision-recall curve also shows that HBGF-YOLO maintains higher precision across recall levels, reducing false positives in defect detection for solar panels.
Furthermore, I tested HBGF-YOLO on the MS COCO dataset to assess its generalization capability. The results, compared to other models, are shown below:
| Model | mAP@0.5:0.95 (%) | Params (10^6) | FLOPs (G) |
|---|---|---|---|
| Faster R-CNN | 36.4 | 42.1 | 207.0 |
| SSD | 25.1 | 27.4 | 33.6 |
| YOLOv5n | 28.0 | 1.9 | 4.5 |
| YOLOv7-tiny | 37.4 | 6.2 | 13.7 |
| YOLOv8n | 37.3 | 3.2 | 8.7 |
| YOLOv9t | 27.5 | 4.5 | 10.5 |
| YOLOv10n | 29.8 | 3.5 | 9.2 |
| YOLO11n | 39.5 | 2.6 | 6.5 |
| HBGF-YOLO | 40.3 | 1.7 | 5.9 |
HBGF-YOLO achieves competitive mAP scores with lower computational demands, underscoring its versatility for various object detection tasks beyond photovoltaic panel inspection.
In conclusion, HBGF-YOLO presents a robust solution for infrared small target defect detection in photovoltaic panels using UAV aerial photography. By integrating Rep-HGNetV2, BiFPN, GLSA, and FEFM, the algorithm enhances detection accuracy while maintaining real-time performance on embedded devices. The experimental results on both custom and public datasets confirm its superiority over existing methods, making it a practical tool for large-scale solar farm inspections. Future work will focus on optimizing the network for higher resolution processing and adapting to more diverse environmental conditions to further improve the reliability of photovoltaic energy systems.
