In recent years, the rapid advancement of renewable energy technologies, particularly solar power, has led to a significant increase in the deployment of solar panels across distributed power grids. However, solar panels are prone to various faults, with hot spots being one of the most critical issues. Hot spots occur when certain cells within a solar panel become shaded by debris, dust, or bird droppings, causing them to act as loads and dissipate energy as heat. This not only reduces the efficiency of solar panels but also poses fire hazards and can lead to permanent damage. Traditional inspection methods for solar panels are often labor-intensive, time-consuming, and inefficient, especially for large-scale solar farms. With the rise of unmanned aerial vehicle (UAV) technology, aerial infrared imaging has emerged as a powerful tool for monitoring solar panels, as it can capture thermal signatures that reveal temperature anomalies indicative of hot spots. Nevertheless, analyzing these infrared images manually is challenging due to the vast amounts of data and the presence of noise, such as reflections from the solar panels. To address these challenges, we propose a deep learning-based approach for fast and accurate hot spot detection in aerial infrared images of solar panels. Our method leverages convolutional neural networks to first identify solar panels from complex backgrounds and then segment hot spots within the identified panels, enabling real-time fault detection for UAV-based inspections.
The core of our approach lies in combining two deep learning models: an improved YOLOv4 algorithm for solar panel recognition and an enhanced DeepLabV3+ model for hot spot segmentation. We focus on optimizing these models for speed and accuracy, making them suitable for deployment on resource-constrained devices like UAVs. By using lightweight networks and modifying key components, such as the backbone feature extraction network and loss functions, we achieve a balance between computational efficiency and detection performance. In this study, we detail the data acquisition and preprocessing steps, describe the architectural improvements to the models, and present experimental results that demonstrate the effectiveness of our method. Through this work, we aim to contribute to the maintenance and reliability of solar energy systems by providing an automated solution for hot spot detection in solar panels.

To begin, we discuss the data collection process for aerial infrared images of solar panels. The images were captured using a DJI Matrice 300 UAV equipped with an XT2 thermal camera at a height of 30 meters over a solar power station in China. A total of 2,188 infrared images were initially collected, from which 1,557 images containing solar panels were selected for model training and testing. The diversity of the dataset is crucial for training robust deep learning models, as it must account for various environmental conditions, angles, and panel configurations. However, the original dataset was limited in size and variety, which could lead to overfitting. To mitigate this, we applied data augmentation techniques, including random rotation (0° to 120°), image compression to simulate different altitudes, and contrast adjustments to emulate varying lighting conditions. After augmentation, the dataset for solar panel recognition expanded to 7,785 images, while 1,610 images with hot spots were annotated for segmentation tasks. This preprocessing step ensures that our models can generalize well to real-world scenarios, where solar panels may appear in different orientations or under different thermal conditions.
The first stage of our method involves recognizing solar panels from the aerial infrared images. We base this on the YOLOv4 object detection algorithm, which is known for its high accuracy and speed. However, the standard YOLOv4 uses a CSPDarknet53 backbone network, which has a large number of parameters and computational costs, making it less suitable for real-time applications on UAVs. To address this, we replace the backbone with lightweight networks, specifically MobileNetV1, MobileNetV2, and MobileNetV3, and compare their performance. MobileNet networks utilize depthwise separable convolutions, which significantly reduce parameters and computations. The depthwise separable convolution decomposes a standard convolution into two steps: a depthwise convolution that filters each input channel separately, and a pointwise convolution that combines the outputs. This can be expressed mathematically as follows: for an input feature map of size $H \times W \times C$, a standard convolution with kernel size $K \times K$ and output channels $C’$ has computational cost of $H \times W \times C \times C’ \times K^2$. In contrast, a depthwise separable convolution reduces this to $H \times W \times C \times K^2 + H \times W \times C \times C’$, leading to substantial savings. We experiment with different backbones and find that MobileNetV2 performs best in terms of speed and accuracy for our dataset.
MobileNetV2 introduces inverted residual blocks, which consist of expansion and projection layers. The block first uses a 1×1 convolution to expand the channel dimension, followed by a 3×3 depthwise convolution, and then a 1×1 convolution to project the channels back. This structure enhances feature extraction while maintaining efficiency. In our improved YOLOv4 model, which we call MobileNetV2-YOLOv4-lite, we also replace the standard 3×3 convolutions in the PANet (Path Aggregation Network) with depthwise separable convolutions to further reduce parameters. The overall architecture includes the backbone for feature extraction, SPP (Spatial Pyramid Pooling) for multi-scale feature fusion, PANet for feature aggregation, and the YOLO head for prediction. The loss function used is CIoU Loss, which improves bounding box regression by considering overlap area, center distance, and aspect ratio. The CIoU Loss is defined as:
$$ \text{Loss}_{CIoU} = 1 – IoU + \frac{\rho^2(b, b^{gt})}{c^2} + \alpha v $$
where $IoU$ is the intersection over union, $\rho$ is the Euclidean distance between the centers of the predicted box $b$ and ground truth box $b^{gt}$, $c$ is the diagonal length of the smallest enclosing box, $\alpha$ is a weight parameter, and $v$ measures aspect ratio consistency, given by:
$$ v = \frac{4}{\pi^2} \left( \arctan \frac{w^{gt}}{h^{gt}} – \arctan \frac{w}{h} \right)^2 $$
Here, $w$ and $h$ are the width and height of the predicted box, and $w^{gt}$ and $h^{gt}$ are those of the ground truth. This loss function helps in accurately localizing solar panels in the infrared images.
After recognizing the solar panels, we crop them from the background to focus on the regions of interest. This step is crucial because it eliminates irrelevant background noise and simplifies the subsequent hot spot segmentation. The cropping is done using the bounding box coordinates obtained from the detection model, and the extracted solar panel images are placed on a black canvas for uniformity. This preprocessing ensures that the segmentation model only processes the relevant areas, improving both accuracy and efficiency.
The second stage of our method involves segmenting hot spots within the cropped solar panel images. We use the DeepLabV3+ semantic segmentation model, which is effective for pixel-level classification. However, the standard DeepLabV3+ employs an Xception backbone, which is computationally heavy. To make it suitable for real-time detection, we replace the backbone with MobileNetV2, resulting in a model we refer to as DeepLabV3+_MobileNetV2. This replacement drastically reduces the model size while retaining strong feature extraction capabilities. Additionally, we modify the decoder part to address the issue of small target loss caused by downsampling. In the original DeepLabV3+, the feature maps are downsampled by a factor of 16, which can lead to the loss of fine details for small hot spots. We change the downsampling factor to 8 to preserve more spatial information, enhancing the segmentation of hot spots in solar panels.
Furthermore, we modify the loss function to handle class imbalance, as hot spots typically occupy a small portion of the image compared to the background. Instead of the standard cross-entropy loss, we use Dice Loss, which is better suited for imbalanced datasets. The Dice Loss is defined as:
$$ DL(p) = 1 – \frac{2 \sum_{i=1}^{I} \sum_{j=1}^{N} p_{ij} g_{ij}}{\sum_{i=1}^{I} \sum_{j=1}^{N} p_{ij} + \sum_{i=1}^{I} \sum_{j=1}^{N} g_{ij}} $$
where $p_{ij}$ is the predicted probability that pixel $j$ belongs to class $i$, $g_{ij}$ is the ground truth label (1 if pixel $j$ belongs to class $i$, 0 otherwise), $N$ is the total number of pixels, and $I$ is the number of classes (2 for our case: hot spot and background). This loss function maximizes the overlap between predicted and ground truth masks, making it effective for segmenting small hot spots in solar panels.
The encoder part of our DeepLabV3+_MobileNetV2 model uses atrous convolutions with rates of 6, 12, and 18 to capture multi-scale context without losing resolution. The decoder combines low-level features from the backbone with high-level features from the encoder through concatenation and upsampling. The final output is a segmentation map that highlights hot spot regions. To validate our improvements, we conduct experiments comparing different variants of the model, as summarized in the following tables.
For the solar panel recognition model, we evaluate several algorithms, including Faster R-CNN, YOLOv4-Tiny, YOLOv5s, and our improved versions. The metrics used are Average Precision (AP), Recall, Frames Per Second (FPS), and model size. AP is calculated as the area under the precision-recall curve, with precision and recall defined as:
$$ \text{Precision} = \frac{TP}{TP + FP} $$
$$ \text{Recall} = \frac{TP}{TP + FN} $$
where $TP$ is true positives, $FP$ is false positives, and $FN$ is false negatives. FPS measures the detection speed, and model size indicates the parameter count. The results are shown in Table 1.
| Model | AP (%) | Recall (%) | FPS (frames/s) | Model Size (MB) |
|---|---|---|---|---|
| Faster R-CNN | 97.9 | 97.63 | 7.7 | 108 |
| YOLOv4-Tiny | 96.94 | 95.42 | 42 | 22.4 |
| YOLOv5s | 95.58 | 96.58 | 20.25 | 27 |
| YOLOv4 | 99.66 | 98.57 | 13.7 | 244 |
| YOLOv4-V1 (MobileNetV1) | 99.62 | 98.03 | 18.2 | 51 |
| YOLOv4-V2 (MobileNetV2) | 99.56 | 98.91 | 22.1 | 46.4 |
| YOLOv4-V3 (MobileNetV3) | 99.61 | 97.89 | 15.9 | 53.6 |
From Table 1, we observe that our MobileNetV2-based YOLOv4 model (YOLOv4-V2) achieves a balance of high AP (99.56%), high recall (98.91%), fast FPS (22.1 frames/s), and small model size (46.4 MB). This makes it ideal for real-time detection of solar panels in aerial infrared images. The reduction in parameters is due to the depthwise separable convolutions, which we also apply in the PANet. The training process involves two phases: a frozen phase where the backbone is fixed, and an unfrozen phase where all layers are trainable. We use transfer learning with pre-trained weights to accelerate convergence. The loss curve during training shows steady decrease, indicating effective learning.
For the hot spot segmentation model, we compare different versions of DeepLabV3+ with modifications. The evaluation metrics include Mean Pixel Accuracy (MPA), Mean Intersection over Union (mIoU), FPS, and model size. MPA and mIoU are defined as:
$$ MPA = \frac{1}{n} \sum_{i=0}^{n} \frac{R_{ii}}{\sum_{j=0}^{n} R_{ij}} $$
$$ mIoU = \frac{1}{n} \sum_{i=0}^{n} \frac{R_{ii}}{\sum_{j=0}^{n} R_{ij} + \sum_{j=0}^{n} R_{ji} – R_{ii}} $$
where $R_{ij}$ is the number of pixels of class $i$ predicted as class $j$, and $n$ is the number of classes. The results are presented in Table 2, where we test five models: A (standard DeepLabV3+), B (MobileNetV2 backbone), C (MobileNetV2 backbone with 8x downsampling), D (MobileNetV2 backbone with Dice Loss), and E (MobileNetV2 backbone with 8x downsampling and Dice Loss).
| Model | MPA (%) | mIoU (%) | FPS (frames/s) | Model Size (MB) |
|---|---|---|---|---|
| A: DeepLabV3+ (Xception) | 73.26 | 71.34 | 13.7 | 209 |
| B: DeepLabV3+_MobileNetV2 | 92.03 | 80.73 | 26.6 | 22.3 |
| C: B with 8x downsampling | 94.95 | 84.31 | 24.1 | 22.3 |
| D: B with Dice Loss | 93.21 | 81.66 | 26.8 | 22.3 |
| E: B with 8x downsampling and Dice Loss | 95.99 | 85.58 | 24.5 | 22.3 |
Table 2 demonstrates that model E, which incorporates all our improvements, achieves the best performance with an MPA of 95.99%, mIoU of 85.58%, FPS of 24.5 frames/s, and a compact size of 22.3 MB. The use of Dice Loss significantly enhances segmentation accuracy by addressing class imbalance, while the reduced downsampling factor preserves details of hot spots. The training involves a similar two-phase approach, and the loss curves show that Dice Loss converges faster and to a lower value compared to cross-entropy loss. These results confirm that our method can effectively segment hot spots in solar panels, even in the presence of reflective noise, which often appears as false hot spots in infrared images.
The integration of the two models forms a complete pipeline for hot spot detection. First, the MobileNetV2-YOLOv4-lite model identifies and crops solar panels from the aerial infrared images. Then, the DeepLabV3+_MobileNetV2 model segments hot spots within the cropped images. This two-step process ensures that only relevant regions are analyzed, reducing computational load and improving accuracy. We test the pipeline on a validation set of infrared images, and the results show that solar panels are recognized with high precision, and hot spots are accurately segmented. For instance, in sample images, the model successfully distinguishes between real hot spots and reflective artifacts, which is critical for reliable fault detection in solar panels.
To further illustrate the effectiveness of our approach, we provide additional analysis on the computational benefits. The use of depthwise separable convolutions in both models reduces the total number of parameters dramatically. For example, in the YOLOv4 variant, the standard CSPDarknet53 has approximately 244 MB parameters, while our MobileNetV2-based version has only 46.4 MB, a reduction of over 80%. Similarly, in the segmentation model, replacing Xception with MobileNetV2 cuts the model size from 209 MB to 22.3 MB. This parameter reduction translates to faster inference times, as shown by the FPS metrics. Such efficiency is essential for UAV-based real-time monitoring, where processing power is limited.
Moreover, we discuss the robustness of our method to various challenges. Solar panels in aerial images can appear at different scales, orientations, and under varying thermal conditions due to weather or time of day. Our data augmentation strategy, including rotation and contrast adjustment, helps the models generalize to these variations. Additionally, the modification of the loss function to Dice Loss improves the handling of imbalanced data, where hot spots occupy only a small fraction of the image. This is particularly important for solar panels, as hot spots are often tiny compared to the overall panel area. The mIoU score of 85.58% indicates that our model achieves high overlap with ground truth masks, validating its segmentation capability.
In terms of practical application, our method can be deployed on UAVs equipped with thermal cameras for automated inspection of solar farms. The fast detection speed (22.1 FPS for panel recognition and 24.5 FPS for hot spot segmentation) allows for real-time analysis during flight, enabling immediate identification of faulty solar panels. This can significantly reduce maintenance costs and prevent potential damages. Furthermore, the lightweight nature of the models means they can run on embedded systems, making the solution scalable for large-scale deployments.
We also consider potential limitations and future work. While our method performs well on the collected dataset, it may need adaptation for different types of solar panels or environmental conditions. For example, the thermal signatures of hot spots can vary based on panel material or age. In future research, we plan to expand the dataset to include more diverse solar panels and explore multi-task learning for simultaneous detection and classification of hot spots. Classifying hot spots by severity or cause (e.g., shading vs. cell damage) could provide deeper insights for maintenance. Additionally, we aim to integrate our models with UAV navigation systems for autonomous inspection paths, optimizing coverage and efficiency.
In conclusion, we have presented a deep learning-based approach for hot spot detection in aerial infrared images of solar panels. Our method combines an improved YOLOv4 model for solar panel recognition and an enhanced DeepLabV3+ model for hot spot segmentation, both optimized for speed and accuracy using lightweight networks and modified components. The experimental results demonstrate that our approach achieves high detection rates, with solar panel recognition accuracy of 99.56% and hot spot segmentation accuracy of 95.99%, at speeds suitable for real-time UAV applications. By addressing challenges such as background noise and reflective artifacts, our method provides a reliable solution for monitoring solar panels, contributing to the sustainability and efficiency of solar energy systems. As the demand for renewable energy grows, automated inspection techniques like ours will play a crucial role in ensuring the health and performance of solar panels across the globe.
To summarize the key equations used in our models, we list them below for reference. The CIoU Loss for object detection is given by:
$$ \text{Loss}_{CIoU} = 1 – IoU + \frac{\rho^2(b, b^{gt})}{c^2} + \alpha v $$
with $v$ defined as:
$$ v = \frac{4}{\pi^2} \left( \arctan \frac{w^{gt}}{h^{gt}} – \arctan \frac{w}{h} \right)^2 $$
The Dice Loss for segmentation is:
$$ DL(p) = 1 – \frac{2 \sum_{i=1}^{I} \sum_{j=1}^{N} p_{ij} g_{ij}}{\sum_{i=1}^{I} \sum_{j=1}^{N} p_{ij} + \sum_{i=1}^{I} \sum_{j=1}^{N} g_{ij}} $$
These loss functions, along with the architectural innovations, enable effective learning for detecting and segmenting hot spots in solar panels. Through continuous refinement and adaptation, we believe that deep learning methods will become indispensable tools in the maintenance of solar energy infrastructure, ensuring that solar panels operate at peak efficiency and reliability.
