Deep Learning-Based Hot Spot Detection in Aerial Infrared Images of Photovoltaic Panels

With the increasing severity of environmental pollution and energy crises, renewable energy generation technologies have developed rapidly, leading to widespread integration of photovoltaic systems into power grids. In recent years, the cost of photovoltaic industries has continuously decreased, and production capacity has grown swiftly. However, photovoltaic panels in solar power stations are highly susceptible to obstructions such as dust, bird droppings, and fallen leaves. If not cleaned promptly, these obstructions can cause shaded cells to become loads that consume energy, resulting in hot spot faults. Hot spots can significantly reduce the power generation efficiency of solar panels and, in severe cases, damage entire panels or even pose fire hazards. Therefore, detecting hot spots in photovoltaic panels is crucial for the routine maintenance of solar power stations.

Current methods for inspecting defects in photovoltaic panels primarily involve image processing, thermal imaging, and neural networks. Traditional approaches, such as those based on grayscale histogram processing, struggle with reflective noise in aerial infrared images. Other techniques, including infrared image analysis under different working conditions, are highly influenced by environmental factors. While some studies have employed deep convolutional autoencoder networks or transfer learning models for small-sample hot spot recognition, these methods are limited by sample size and incomplete panel conditions. Support vector machine (SVM)-based methods suffer from long training times and are unsuitable for real-time drone inspections. To address these challenges, we propose a rapid detection method for hot spots in photovoltaic panels using deep convolutional neural networks, combined with unmanned aerial vehicle (UAV) inspection technology.

Our approach involves two main stages: photovoltaic panel recognition and hot spot segmentation. For panel recognition, we design a model based on YOLOv4, replacing its backbone feature extraction network with the lightweight MobileNetV2 and substituting standard 3×3 convolutions in the PANet with depthwise separable convolutions. This enables rapid identification of photovoltaic panels from infrared images. For hot spot segmentation, we integrate MobileNetV2 into the DeepLabV3+ model, modify the downsampling factor to mitigate target loss, and change the loss function from cross-entropy to Dice loss to enhance segmentation accuracy. Experimental results demonstrate that our method accurately identifies hot spots in solar panels, with high precision and speed suitable for real-time fault detection.

Data Acquisition and Preprocessing

We collected infrared image data from a photovoltaic power station in Liaoning Province, China, using a DJI Matrice 300 UAV equipped with an XT2 thermal camera. Images were captured at a height of 30 meters, resulting in 2,188 infrared images. After filtering, 1,557 images containing photovoltaic panels were selected for training and testing. To address the limited sample size and lack of diversity, we applied data augmentation techniques, including random rotation (0° to 120°), image compression to simulate varying altitudes and resolutions, and contrast adjustments to mimic different lighting conditions. This expanded the dataset to 7,785 images for panel recognition model training. For hot spot segmentation, 1,610 images with annotated hot spots were used, after distinguishing real hot spots from reflective noise through manual screening.

The data augmentation process enhances the robustness of our models by introducing variability in orientation, scale, and illumination. For instance, random rotation mitigates the consistency in UAV flight paths, while compression and contrast changes simulate real-world operational conditions. The augmented dataset ensures that our deep learning models generalize well to unseen data, reducing overfitting risks.

Photovoltaic Panel Recognition Model

We base our panel recognition on an improved YOLOv4 algorithm. YOLOv4 comprises a backbone feature extraction network (CSPDarknet53), a spatial pyramid pooling (SPP) network, a path aggregation network (PANet), and a prediction head (YOLO-Head). The CSPDarknet53 uses Mish activation function to enhance feature extraction and network stability. The Mish activation function is defined as:

$$ \text{Mish} = x \times \tanh(\ln(1 + e^x)) $$

The loss function employed is CIoU-Loss, which considers overlapping area, center point distance, and aspect ratio consistency. The CIoU loss is expressed as:

$$ \text{Loss}_{ciou} = 1 – \text{IoU} + \frac{\rho^2(b, b^{gt})}{c^2} + \alpha v $$

where:

$$ \alpha = \frac{v}{(1 – \text{IoU}) + v} $$

and:

$$ v = \frac{4}{\pi^2} \left( \arctan \frac{w^{gt}}{h^{gt}} – \arctan \frac{w}{h} \right)^2 $$

Here, $b$ and $b^{gt}$ are the center points of the predicted and ground truth boxes, $\rho$ is the Euclidean distance, $c$ is the diagonal distance of the smallest enclosing box, and $v$ measures the consistency of the aspect ratio.

To reduce computational complexity and model size, we replace the CSPDarknet53 backbone with MobileNetV2, a lightweight network utilizing depthwise separable convolutions. MobileNetV2 employs inverted residual blocks, which first expand the channel dimensions using 1×1 convolutions, apply 3×3 depthwise separable convolutions, and then project back with 1×1 convolutions. This structure significantly decreases parameters while maintaining performance. Additionally, we replace standard 3×3 convolutions in PANet with depthwise separable convolutions, further reducing computational costs. The modified network, termed MobileNetV2-YOLOv4-lite, achieves a balance between accuracy and efficiency.

The backbone feature extraction network processes input images of size 416×416×3, producing preliminary feature maps at scales of 13×13, 26×26, and 52×52. These maps contain semantic information at different levels and are fed into the SPP and PANet for enhanced feature fusion. The SPP performs multi-scale pooling on the 13×13 feature map, followed by concatenation and 3×3 convolution. The PANet then fuses these features with those from 26×26 and 52×52 maps to generate refined outputs. By using depthwise separable convolutions, we minimize the parameter count in these steps.

Finally, the model outputs bounding box coordinates (top, left, bottom, right) to identify and crop photovoltaic panels from the background. This step isolates the panels for subsequent hot spot analysis, eliminating interference from ground clutter.

Hot Spot Segmentation Model

For hot spot segmentation, we adopt an improved DeepLabV3+ model. DeepLabV3+ consists of an encoder-decoder structure, where the encoder extracts features using atrous spatial pyramid pooling (ASPP) and the decoder refines the segmentation by combining low- and high-level features. The original DeepLabV3+ uses Xception as the backbone, but we replace it with MobileNetV2 to reduce parameters and accelerate computation. This modified model is referred to as DeepLabV3+_MobileNetV2.

The encoder utilizes MobileNetV2 with atrous convolutions at rates of 6, 12, and 18 to capture multi-scale contextual information. The features are then merged and compressed via 1×1 convolution to produce high-level features. In the decoder, low-level features from the encoder are reduced in dimension using 1×1 convolution and fused with the high-level features. This fusion helps recover object boundaries, followed by 3×3 convolution and 4× upsampling to generate the final prediction map.

To address the issue of small target loss due to downsampling, we reduce the downsampling factor from 16 to 8. This change preserves finer details of hot spots, which are critical for accurate segmentation. Moreover, since hot spots constitute a small portion of the image, leading to class imbalance, we replace the standard cross-entropy loss with Dice loss. The Dice loss function is defined as:

$$ \text{DL}(p) = 1 – \frac{2 \sum_{i=1}^{I} \sum_{j=1}^{N} p_{ij} g_{ij}}{\sum_{j=1}^{N} p_{ij} + \sum_{j=1}^{N} g_{ij}} $$

where $N$ is the number of pixels, $I$ is the number of classes (set to 2 for hot spot and background), $p_{ij}$ is the predicted probability of pixel $j$ belonging to class $i$, and $g_{ij}$ is the ground truth label.

The Dice loss is particularly effective for imbalanced datasets as it focuses on the overlap between predictions and ground truth, reducing the influence of dominant background pixels.

Experimental Setup

We conducted our experiments on a workstation with an Intel Xeon Bronze 3204 CPU @ 1.90 GHz, 32 GB RAM, and an NVIDIA Quadro P5000 GPU. The operating system was Windows 10 64-bit, and we used the PyTorch framework for model development and training.

For both photovoltaic panel recognition and hot spot segmentation, we employed transfer learning to leverage pre-trained weights, accelerating convergence and improving performance given the limited dataset size. The datasets were split into training and validation sets with ratios of 8:2 for panel recognition and 9:1 for hot spot segmentation.

In the panel recognition model training, we divided the process into frozen and unfrozen stages. During the frozen stage, the backbone network was fixed, and the initial learning rate was set to $1 \times 10^{-3}$ for 50 epochs. In the unfrozen stage, the backbone was trainable, with a learning rate of $1 \times 10^{-4}$ for another 50 epochs. We repeated each experiment three times and averaged the results.

For hot spot segmentation, the frozen stage used a learning rate of $1 \times 10^{-3}$ for 35 epochs, and the unfrozen stage used $1 \times 10^{-4}$ for 35 epochs. Similarly, we conducted three repetitions and reported average metrics.

Evaluation metrics for panel recognition include average precision (AP), recall, frames per second (FPS), and model size. The intersection over union (IoU) threshold was set to 0.5 for correct predictions. Precision and recall are calculated as:

$$ \text{Precision} = \frac{tp}{tp + fp} $$

$$ \text{Recall} = \frac{tp}{tp + fn} $$

where $tp$ is true positive, $fp$ is false positive, and $fn$ is false negative. AP is the integral of the precision-recall curve:

$$ \text{AP} = \int_{0}^{1} \text{precision}(r) \, dr $$

FPS is computed as the number of images processed per second:

$$ \text{FPS} = \frac{s}{t} $$

with $s$ being the number of images and $t$ the processing time.

For hot spot segmentation, we use mean pixel accuracy (MPA) and mean intersection over union (mIoU) to assess performance. MPA and mIoU are defined as:

$$ \text{MPA} = \frac{1}{n} \sum_{i=0}^{n} \frac{R_{ii}}{\sum_{j=0}^{n} R_{ij}} $$

$$ \text{mIoU} = \frac{1}{n} \sum_{i=0}^{n} \frac{R_{ii}}{\sum_{j=0}^{n} R_{ij} + \sum_{j=0}^{n} R_{ji} – R_{ii}} $$

where $R_{ij}$ is the number of pixels of class $i$ predicted as class $j$, and $n$ is the number of classes.

Results and Analysis

Photovoltaic Panel Detection

We compared our MobileNetV2-YOLOv4-lite model with several baseline algorithms, including Faster R-CNN, YOLOv4-Tiny, and YOLOv5s. The results are summarized in the table below.

Model	AP (%)	Recall (%)	FPS	Model Size (MB)
Faster R-CNN	97.9	97.63	7.7	108
YOLOv4-Tiny	96.94	95.42	42	22.4
YOLOv5s	95.58	96.58	20.25	27
YOLOv4	99.66	98.57	13.7	244

As observed, Faster R-CNN has low speed and moderate accuracy, while YOLOv4-Tiny is fast but less accurate. YOLOv5s offers a balance but lags in AP. YOLOv4 achieves the highest AP, making it a suitable base for improvements.

We then replaced the YOLOv4 backbone with MobileNetV1, MobileNetV2, and MobileNetV3, denoted as YOLOv4-V1, YOLOv4-V2, and YOLOv4-V3, respectively. The results are shown in the following table.

Model	AP (%)	Recall (%)	FPS	Model Size (MB)
YOLOv4	99.66	98.57	13.7	244
YOLOv4-V1	99.62	98.03	18.2	51
YOLOv4-V2	99.56	98.91	22.1	46.4
YOLOv4-V3	99.61	97.89	15.9	53.6

Our YOLOv4-V2 model (MobileNetV2-YOLOv4-lite) achieves an AP of 99.56%, recall of 98.91%, FPS of 22.1, and model size of 46.4 MB. It outperforms other variants in speed and model compactness while maintaining high accuracy. The loss function during training converges steadily, as illustrated by the decay curve, indicating stable learning. The model successfully identifies and crops photovoltaic panels from complex backgrounds, providing clean inputs for hot spot segmentation.

Hot Spot Segmentation

We evaluated different semantic segmentation models for hot spot detection, including original DeepLabv3+ and PSPNet. The results are presented below.

Model	MPA (%)	mIoU (%)	FPS	Model Size (MB)
DeepLabv3+	73.26	71.34	13.7	209
PSPNet	70.88	68.54	13.99	188

Both models show similar speed and size, but DeepLabv3+ has higher MPA and mIoU, so we select it for further improvements.

We implemented several modifications to DeepLabv3+, as outlined in the table below.

Model	MPA (%)	mIoU (%)	FPS	Model Size (MB)
A: DeepLabv3+ (original)	73.26	71.34	13.7	209
B: + MobileNetV2	92.03	80.73	26.6	22.3
C: B + downsampling 8×	94.95	84.31	24.1	22.3
D: B + Dice loss	93.21	81.66	26.8	22.3
E: C + Dice loss	95.99	85.58	24.5	22.3

Replacing the backbone with MobileNetV2 (model B) significantly improves MPA by 18.77%, mIoU by 9.36%, and FPS by nearly double, while reducing model size by 186.7 MB. Modifying the downsampling to 8× (model C) further enhances MPA and mIoU by 2.92% and 3.58%, respectively, though with a slight FPS drop. Using Dice loss (model D) boosts MPA by 1.18% and mIoU by 0.93% compared to model B. The combined model E achieves the best performance with MPA of 95.99%, mIoU of 85.58%, FPS of 24.5, and a compact size of 22.3 MB. The loss function curve shows faster convergence and lower loss values with Dice loss, confirming its effectiveness for imbalanced segmentation tasks.

Visual results demonstrate that our method accurately segments hot spots, even in the presence of reflective noise. The segmented images clearly highlight hot spot regions, enabling precise fault localization. This capability is vital for maintaining the efficiency and safety of photovoltaic systems.

Discussion

Our proposed method addresses key challenges in photovoltaic panel inspection, such as background clutter and reflective noise. By combining efficient panel recognition with precise hot spot segmentation, we achieve a robust solution for UAV-based real-time monitoring. The use of lightweight networks like MobileNetV2 ensures that the models are deployable on resource-constrained devices, without compromising accuracy.

The improvement in downsampling factor and loss function selection proves critical for small target segmentation. The Dice loss effectively handles class imbalance, while reduced downsampling preserves spatial details. These modifications are particularly beneficial for hot spot detection, where targets are small and irregular.

Compared to existing methods, our approach offers superior speed and accuracy, making it suitable for large-scale photovoltaic farms. The integration of deep learning with UAV technology streamlines the inspection process, reducing manual effort and time.

Conclusion

We have developed a comprehensive deep learning-based framework for detecting hot spots in aerial infrared images of photovoltaic panels. Our method involves a two-stage process: first, rapidly identifying and cropping solar panels using an improved YOLOv4 model with MobileNetV2 backbone and depthwise separable convolutions; second, accurately segmenting hot spots using an enhanced DeepLabV3+ model with MobileNetV2, modified downsampling, and Dice loss. Experimental results validate the effectiveness of our approach, achieving high precision and speed in both stages. This method meets the demands of real-time fault detection in photovoltaic power stations, contributing to improved maintenance and operational efficiency. Future work will focus on classifying hot spots by their causes and further optimizing model performance for diverse environmental conditions.