HBGF-YOLO: Infrared Small Target Defect Detection for Solar Panels in UAV Aerial Imagery

With the rapid advancement of renewable energy technologies, solar panels have become a critical component in global efforts to achieve carbon neutrality. The efficient operation and maintenance of solar panels are essential for maximizing energy output and ensuring system reliability. However, defects such as hot spots, cracks, and dirt accumulation on solar panels can significantly reduce their performance and lifespan. Traditional inspection methods for solar panels often involve manual checks or ground-based systems, which are time-consuming, labor-intensive, and prone to human error. In recent years, unmanned aerial vehicles (UAVs) equipped with infrared cameras have emerged as a promising solution for large-scale inspection of solar panels. These systems can capture thermal images that reveal defects not visible to the naked eye, such as abnormal heat distributions indicative of underlying issues. Despite these advantages, detecting small target defects in infrared images remains challenging due to factors like low resolution, complex backgrounds, and the subtle nature of these anomalies. This paper addresses these challenges by proposing an improved object detection algorithm, HBGF-YOLO, based on the YOLO11n framework, specifically designed for infrared small target defect detection in solar panels.

Existing object detection algorithms, particularly those based on deep learning, have shown remarkable success in various applications. For instance, the YOLO series has been widely adopted for real-time detection tasks due to its balance between speed and accuracy. However, when applied to infrared images of solar panels, standard models like YOLO11n often struggle with small targets and complex environmental conditions. The limitations include insufficient feature extraction for tiny defects, high computational complexity that hinders deployment on resource-constrained UAV platforms, and inadequate fusion of multi-scale features. To overcome these issues, HBGF-YOLO incorporates several key innovations: a lightweight backbone network (Rep-HGNetV2) that enhances feature representation while reducing parameters, a bidirectional feature pyramid network (BiFPN) combined with a global-local self-attention mechanism (GLSA) for effective multi-scale feature integration, and a feature enhancement fusion module (FEFM) that dynamically weights features to emphasize critical defect regions. These improvements collectively enhance the detection precision and recall for small targets in solar panels, making the algorithm suitable for real-time UAV-based inspections.

The backbone network plays a crucial role in feature extraction, and in HBGF-YOLO, we replace the standard backbone with Rep-HGNetV2. This network combines hierarchical gradient feature extraction with re-parameterized convolution techniques, leading to a more efficient model. The HGNetV2 architecture consists of an HGStem preprocessing layer and multiple HGBlock modules. The HGStem layer uses efficient convolution structures to extract initial features with low computational cost, while the HGBlock modules employ a hierarchical processing mechanism to capture features at different scales. A key component is the use of depthwise separable convolution (DWConv), which significantly reduces the number of parameters and computational load compared to standard convolution. For an input feature map with dimensions H (height), W (width), and Cin (input channels), and a kernel size of K×K with Cout output channels, the parameter count P and computational load F for standard convolution are given by:

$$ P = C_{in} \cdot C_{out} \cdot K \cdot K $$

$$ F = C_{in} \cdot C_{out} \cdot W \cdot H \cdot K \cdot K $$

For DWConv, the parameter count P_DW and computational load F_DW are:

$$ P_{DW} = C_{in} \cdot K \cdot K + C_{in} \cdot C_{out} $$

$$ F_{DW} = C_{in} \cdot W \cdot H \cdot K \cdot K + C_{in} \cdot C_{out} \cdot W \cdot H $$

The ratios can be derived as:

$$ \frac{P_{DW}}{P} = \frac{F_{DW}}{F} = \frac{1}{C_{out}} + \frac{1}{K^2} $$

This demonstrates the efficiency of DWConv, as it requires only a fraction of the parameters and computations of standard convolution. To further enhance performance, we integrate RepConv into the HGBlock, forming Rep-HGBlock. During training, RepConv uses multiple branches (e.g., 3×3 convolution, 1×1 convolution, and batch normalization) to learn diverse features, while during inference, these branches are re-parameterized into a single 3×3 convolution kernel. This approach retains the benefits of multi-branch training without increasing inference time, improving the network’s ability to detect fine-grained defects in solar panels.

For feature fusion, HBGF-YOLO employs BiFPN, which facilitates bidirectional cross-scale connections and weighted feature integration. Unlike traditional feature pyramid networks, BiFPN allows both top-down and bottom-up information flow, enabling better fusion of high-level semantic features with low-level spatial details. This is particularly important for solar panels, where defects can vary in size and appearance. The BiFPN structure enhances the model’s ability to detect both large-scale thermal distributions and small-scale anomalies. Additionally, the GLSA mechanism is introduced between the backbone and BiFPN to capture global context and local details simultaneously. GLSA splits the input features into two parts: one processed by a global spatial attention (GSA) branch that models long-range dependencies, and the other by a local spatial attention (LSA) branch that focuses on pixel-level details. The outputs are combined and compressed using a 1×1 convolution, resulting in features that are both context-aware and detail-rich. The GLSA operations can be summarized as follows:

$$ X_0, X_1 = \text{Split}(X) $$

$$ \text{GAtt}(X_0) = \text{Softmax}(\text{Transpose}(\text{Conv}_{1×1}(X_0))) $$

$$ \text{GSA}(X_0) = \text{MLP}(\text{GAtt}(X_0) \otimes X_0) + X_0 $$

$$ \text{LAtt}(X_1) = \text{Sigmoid}(\text{Conv}{1×1}(\text{DWconv}{3×3}(\text{Conv}_{1×1}(X_1))) + \epsilon $$

$$ \text{LSA}(X_1) = \text{LAtt}(X_1) \otimes X_1 + X_1 $$

$$ Y = \text{Conv}_{1×1}(\text{Concat}(\text{GSA}(X_0), \text{LSA}(X_1))) $$

To further refine feature fusion, the FEFM module is incorporated. FEFM uses a two-stage process: first, it enhances low-level features with edge information using coordinate attention (CA), and then it applies context modeling (CM) to generate channel weights for adaptive fusion. Given low-level features F_L and high-level features F_H (or edge features F_E), the steps are:

$$ F_L’ = \text{CA}(F_L) $$

$$ M_{CM} = \text{CM}(F_E) $$

$$ F_L” = M_{CM} \otimes F_L’ $$

$$ F_O = \text{Concat}(F_L”, F_E) $$

The CA module encodes spatial information by performing average pooling along the height and width dimensions, generating attention weights that highlight important regions. For an input feature map x, the vertical and horizontal responses are computed as:

$$ z_h(c) = \frac{1}{W} \sum_{0 \leq i \leq W} x(h, i, c) $$

$$ z_w(c) = \frac{1}{H} \sum_{0 \leq j \leq H} x(j, w, c) $$

These are then combined and processed through convolutional layers to produce the final attention maps. The CM module uses a 1×1 convolution and softmax to generate attention maps that capture contextual relationships, enhancing the fusion of features across different levels.

To evaluate HBGF-YOLO, we conducted experiments on a custom dataset of infrared images of solar panels, collected using a DJI M300 RTK UAV equipped with a FLIR XT2 camera. The dataset includes 3225 images annotated with five types of defects: large-area hot spots, single hot spots, abnormal low temperature, diode short circuits, and normal states. Each image contains multiple solar panels, and annotations were done manually using LabelImg. To address data scarcity, we applied data augmentation techniques such as random rotation, horizontal flipping, and noise addition, expanding the training set to 6773 images. The dataset was split into training (70%), validation (20%), and test (10%) sets. Experimental setup included a Windows 11 OS, Intel i5-12400F processor, Python 3.9, PyTorch 1.12, and an RTX 4060Ti GPU. Training parameters were set to 300 epochs, batch size of 16, input size of 640×640, SGD optimizer, and initial learning rate of 0.01.

Performance was measured using precision (P), recall (R), mean average precision (mAP) at IoU thresholds of 0.5 and 0.5:0.95, parameter count (Params), and computational load (FLOPs). The formulas are:

$$ P = \frac{TP}{TP + FP} $$

$$ R = \frac{TP}{TP + FN} $$

$$ AP = \int_0^1 P(R) dR $$

$$ mAP = \frac{1}{n} \sum_{i=1}^n AP_i $$

where TP, FP, and FN represent true positives, false positives, and false negatives, respectively.

We first compared different backbone networks integrated into YOLO11n. As shown in Table 1, Rep-HGNetV2 achieved the best balance between accuracy and efficiency, with mAP@0.5 of 84.2% and mAP@0.5:0.95 of 71.9%, while reducing parameters by 19% and FLOPs by 6.3% compared to the original YOLO11n. This makes it suitable for deployment on UAV platforms. Next, we evaluated feature fusion networks, and BiFPN outperformed PANet with higher mAP scores, as detailed in Table 2. For attention mechanisms, GLSA showed superior performance in precision and recall compared to alternatives like CAFM and MLCA, as summarized in Table 3. Ablation studies in Table 4 demonstrated that each component of HBGF-YOLO contributes to overall improvement, with the full model achieving P=87.3%, R=81.7%, mAP@0.5=85.5%, and mAP@0.5:0.95=73.2%, while reducing parameters by 38.5% and FLOPs by 7.9% relative to YOLO11n.

Table 1: Comparison of Backbone Networks Integrated into YOLO11n
Backbone Network	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (10^6)	FLOPs (G)
BottleNeck	84.7	78.3	83.4	70.9	2.6	6.3
ShuffleNetV2	80.3	73.8	78.7	66.9	1.6	3.6
EfficientViT	84.0	77.1	80.7	68.7	3.8	8.1
MobileNetV4	84.1	77.3	80.9	68.7	5.4	21.0
Rep-HGNetV2	84.5	81.2	84.2	71.9	2.1	5.7

Table 2: Comparison of Feature Fusion Networks
Neck Network	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (10^6)	FLOPs (G)
PANet	84.5	81.2	84.2	71.9	2.1	5.7
BiFPN	85.1	81.5	84.7	72.8	1.8	6.1

Table 3: Comparison of Attention Mechanisms
Attention Mechanism	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (10^6)	FLOPs (G)
CAFM	82.4	80.3	83.7	71.0	2.3	8.4
CPCA	83.5	79.2	84.5	71.7	1.8	7.3
MLCA	85.9	79.9	84.9	72.3	1.5	5.6
GLSA	86.2	82.6	85.2	72.6	1.7	6.2

Table 4: Ablation Study of HBGF-YOLO Components
Components	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (10^6)	FLOPs (G)
YOLO11n (Baseline)	84.7	78.3	83.4	70.9	2.6	6.3
+ Rep-HGNetV2	84.5	81.2	84.2	71.9	2.1	5.7
+ Rep-HGNetV2 + BiFPN	85.1	81.5	84.7	72.8	1.8	6.1
+ Rep-HGNetV2 + BiFPN + GLSA	86.2	82.6	85.2	72.6	1.7	6.2
+ Rep-HGNetV2 + BiFPN + GLSA + FEFM	87.3	81.7	85.5	73.2	1.6	5.8

Comparative experiments with other state-of-the-art models on the solar panel dataset are presented in Table 5. HBGF-YOLO outperformed methods like Faster R-CNN, SSD, and lightweight variants of YOLO in terms of precision, recall, and mAP scores, while maintaining lower computational costs. For instance, it achieved a 2.5% higher mAP@0.5 than YOLO11n and a 6.8% improvement over YOLOv3-tiny+. Visualizations, such as P-R curves and heatmaps, further confirmed the effectiveness of HBGF-YOLO in detecting small targets, with GLSA enhancing the focus on defect regions. Additionally, tests on the MS COCO dataset, as shown in Table 6, demonstrated the algorithm’s generalization capability, with HBGF-YOLO achieving competitive mAP@0.5:0.95 scores while using fewer parameters and FLOPs than other models.

Table 5: Comparison with Other Models on Solar Panel Dataset
Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (10^6)	FLOPs (G)
Faster R-CNN	71.2	70.9	74.3	61.8	43.6	207.0
SSD	65.8	68.0	68.7	59.5	25.2	34.3
YOLOv3-tiny+	80.5	77.2	78.7	68.6	9.4	15.7
YOLOv5n	77.7	76.2	78.5	65.9	2.1	5.8
YOLO11n	84.7	78.3	83.4	70.9	2.6	6.3
HBGF-YOLO	87.3	81.7	85.5	73.2	1.6	5.8

Table 6: Performance on MS COCO Dataset
Model	mAP@0.5:0.95 (%)	Params (10^6)	FLOPs (G)
Faster R-CNN	36.4	42.1	207.0
SSD	25.1	27.4	33.6
YOLOv5n	28.0	1.9	4.5
YOLOv8n	37.3	3.2	8.7
HBGF-YOLO	40.3	1.7	5.9

In conclusion, HBGF-YOLO presents an effective solution for infrared small target defect detection in solar panels using UAV aerial imagery. By integrating Rep-HGNetV2, BiFPN, GLSA, and FEFM, the algorithm achieves higher precision and recall while reducing computational complexity, making it suitable for real-time applications on embedded devices. Future work will focus on further optimizing the network for higher resolution processing, better environmental adaptability, and reduced resource consumption, ultimately facilitating the widespread adoption of UAV-based inspection systems for solar panels. The continuous improvement of such technologies is vital for maintaining the efficiency and longevity of solar energy systems, contributing to global sustainability goals.