Advancing Solar Panel Inspection: An Improved Semantic Segmentation Framework

The global shift towards renewable energy sources is no longer a future aspiration but a present necessity. Among these, solar energy, harnessed through photovoltaic (PV) systems, stands as a cornerstone of sustainable power generation. The widespread deployment of solar panel arrays across diverse terrains, from vast deserts to residential rooftops, underscores their critical role. However, the efficient operation of these systems is paramount to realizing their economic and environmental promise. Solar panels are susceptible to various defects during their operational lifetime, including hot spots, micro-cracks, diode failures, and soiling. These faults not only degrade the power output of individual panels but can also lead to cascading failures, safety hazards, and significant financial losses. Therefore, developing robust, automated, and accurate methods for solar panel inspection is a crucial technological challenge.

Traditional inspection methods, often relying on manual visual checks or electrical characterization, are labor-intensive, time-consuming, and subject to human error, making them unsuitable for large-scale solar farms. The advent of unmanned aerial vehicles (UAVs) equipped with high-resolution visible-light and, more importantly, thermal infrared cameras has revolutionized this field. Infrared thermography can visually reveal temperature anomalies caused by electrical faults (like hot spots) or physical defects. The core task then becomes the automated analysis of these captured images to first locate the solar panels within the scene and then identify any anomalies on them. This is where the field of computer vision, particularly deep learning, offers transformative potential.

While object detection algorithms like the YOLO or R-CNN families can directly locate fault regions, they often struggle in complex real-world scenarios. A significant challenge arises from cluttered backgrounds—metal roof supports, HVAC units, skylights, or even other sun-heated surfaces in infrared imagery can exhibit thermal signatures misleadingly similar to a faulty solar panel. This leads to false positives, reducing the reliability of automated inspection systems. A more sophisticated approach is to decompose the problem: first, perform precise solar panel semantic segmentation to extract pixel-perfect masks of every panel, completely stripping away the background; second, apply fault detection algorithms solely within these masked regions. This two-stage pipeline drastically reduces false alarms by eliminating background interference.

Semantic segmentation assigns a class label (e.g., “solar panel” or “background”) to every pixel in an image. Among the state-of-the-art models, DeepLabV3+ has established itself as a powerful architecture due to its encoder-decoder structure and Atrous Spatial Pyramid Pooling (ASPP) module, which effectively captures multi-scale contextual information. However, in the specific context of segmenting solar panels from infrared imagery, standard DeepLabV3+ exhibits notable shortcomings. Its boundary segmentation is often imprecise, resulting in wavy, irregular edges or adhesion between adjacent panels. Furthermore, it can misclassify background objects with similar textures or thermal profiles as part of a solar panel. These inaccuracies at the segmentation stage directly propagate errors into the subsequent fault detection phase.

To address these critical limitations, this work proposes a comprehensively improved DeepLabV3+ model specifically tailored for high-precision solar panel semantic segmentation. Our contributions are multifaceted, targeting enhanced feature extraction, boundary refinement, and computational efficiency to create a model suitable for real-world, drone-based inspection systems.

Architectural Innovations for Enhanced Segmentation

The standard DeepLabV3+ architecture, while powerful, uses a computationally heavy backbone like Xception. For UAV applications where processing power and battery life are constrained, a lighter yet effective model is essential. Therefore, our first major modification replaces the backbone network with MobileNetV2. MobileNetV2 employs depthwise separable convolutions and inverted residual blocks with linear bottlenecks, dramatically reducing the number of parameters and computational cost (FLOPs) while maintaining strong representational capacity. This shift makes the model significantly more deployable on edge computing devices.

A key insight driving our improvements is the critical importance of low-level features for accurate boundary delineation. The shallow layers of a convolutional neural network capture fine-grained details like edges, textures, and gradients—precisely the information needed to separate one solar panel from another or from a complex background. In the standard flow, these features are simply passed through a 1×1 convolution in the decoder. We enhance this pathway in two significant ways.

First, we explicitly reinforce edge information by integrating a classical Canny edge detection algorithm. We extract the feature map from the fourth layer of MobileNetV2 (Bottleneck 3, output depth 32) and process it through the Canny algorithm. The Canny detector is a multi-stage process: it applies Gaussian filtering to reduce noise, computes intensity gradients, performs non-maximum suppression to thin edges, and uses double thresholding with hysteresis to finalize the edge map. The gradient magnitude $G$ and direction $\theta$ at each pixel are computed using Sobel operators $G_x$ and $G_y$:

$$G = \sqrt{G_x^2 + G_y^2}$$

$$\theta = \arctan\left(\frac{G_y}{G_x}\right)$$

The resulting edge map, rich in boundary semantics, is treated as our enhanced shallow feature. This map is then fed forward through two parallel 1×1 convolutional channels instead of one. This “parallel dual-channel” design increases the richness and diversity of the low-level information reaching the decoder, allowing the network to learn more complex transformations of the boundary features, thereby improving its ability to reconstruct sharp and accurate solar panel contours.

Second, we redesign the core ASPP module to be more sensitive to channel-wise dependencies and finer spatial details. The original ASPP uses parallel atrous convolutions with large dilation rates (e.g., 6, 12, 18) to capture long-range context, but this can come at the cost of losing finer local patterns, especially detrimental for boundary precision. We propose the SE-ASPP module. We modify the dilation rates to a smaller set (3, 6, 9), which provides a better balance between capturing context and preserving local detail relevant for solar panel edges. Crucially, after each atrous convolution branch, we integrate a Squeeze-and-Excitation (SE) attention block.

The SE mechanism operates by explicitly modeling inter-channel relationships. It first “squeezes” global spatial information from a feature map $U$ with dimensions $H \times W \times C$ via global average pooling, producing a channel descriptor $z \in \mathbb{R}^C$ where the $c$-th element is:

$$z_c = F_{sq}(u_c) = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} u_c(i, j)$$

This descriptor is then “excited” through a simple gating mechanism with sigmoid activation $\sigma$:

$$s = F_{ex}(z, W) = \sigma(W_2 \delta(W_1 z))$$

where $\delta$ is the ReLU function, and $W_1$ and $W_2$ are weights of two fully-connected layers that first reduce and then restore dimensionality. The resulting vector $s$ contains channel-wise activation weights. Finally, these weights are used to recalibrate the original feature map $U$:

$$\tilde{x}_c = F_{scale}(u_c, s_c) = s_c \cdot u_c$$

By embedding SE blocks into each ASPP branch, the SE-ASPP module allows the network to adaptively emphasize informative feature channels and suppress less useful ones within each receptive field scale. This enhances the module’s ability to focus on features crucial for distinguishing solar panel regions from challenging backgrounds and for defining clear boundaries.

The overall architecture of our improved model integrates these components cohesively. The MobileNetV2 backbone extracts hierarchical features. The enhanced shallow features from the Canny-edge-augmented fourth layer are routed through dual channels. The deep features from the encoder output are fused with a downsampled version of these shallow features and then processed by the SE-ASPP module. In the decoder, features from the SE-ASPP and the dual shallow-feature channels are progressively fused and upsampled to produce the final high-resolution segmentation mask for the solar panel.

Experimental Methodology and Performance Analysis

To validate the effectiveness of our proposed model, we conducted comprehensive experiments using a dedicated dataset of solar panel infrared images. The dataset was constructed by capturing imagery from a DJI M300 RTK UAV equipped with a HT20 thermal camera, surveying solar installations in varied terrain and under different weather conditions. The original 1,820 images were augmented via rotation to create a robust dataset of 3,640 images, containing over 11,000 distinct solar panel instances. Each image was meticulously annotated at the pixel level using LabelMe to generate ground truth masks. The dataset was split into training (70%), validation (20%), and test (10%) sets.

All models were trained and evaluated under identical conditions: an input image size of 512×512, an Adam optimizer with an initial learning rate of 0.0001, a batch size of 16, and 100 training epochs. The hardware platform consisted of an NVIDIA RTX 4090 GPU. We employed standard semantic segmentation metrics for evaluation: Precision (P), Recall (R), mean Intersection over Union (mIoU), and the F1-score. Their formulas are given below, where TP, FP, and FN represent True Positives, False Positives, and False Negatives for the “solar panel” class, respectively.

$$P = \frac{TP}{TP + FP} \times 100\%$$

$$R = \frac{TP}{TP + FN} \times 100\%$$

$$mIoU = \frac{1}{N} \sum_{i=1}^{N} \frac{TP}{TP + FP + FN} \times 100\% \quad (N=2)$$

$$F1 = 2 \times \frac{P \times R}{P + R} \times 100\%$$

We benchmarked our improved DeepLabV3+ model against several prominent semantic segmentation architectures: PSPNet, HRNet, U-Net, and the original DeepLabV3+. The quantitative results are compelling and are summarized in the table below.

Model	Precision (P) %	mIoU %	Recall (R) %	F1-Score %
PSPNet	96.31	96.83	97.37	96.84
HRNet	97.01	97.29	97.62	97.31
U-Net	96.92	97.15	97.46	97.08
DeepLabV3+ (Original)	97.26	97.63	98.04	97.83
Our Improved Model	99.50	99.21	99.61	99.55

The results clearly demonstrate the superiority of our approach. Our model achieves leading scores across all metrics, with precision at 99.50%, mIoU at 99.21%, recall at 99.61%, and an F1-score of 99.55%. Compared to the original DeepLabV3+, this represents significant improvements of 2.24%, 1.58%, 1.57%, and 1.72%, respectively. This indicates that our modifications not only improve the accuracy of identifying solar panel pixels (higher precision) but also enhance the completeness of the segmentation (higher recall), leading to a more balanced and robust performance.

Qualitative analysis provides even more striking evidence of the improvements. Visual inspection of segmentation results on challenging test images reveals that baseline models like PSPNet, U-Net, and the original DeepLabV3+ frequently exhibit critical failures: they missegment background objects like metal structures or vehicles as solar panels; they produce adhesive boundaries where gaps between adjacent panels are lost; and they yield irregular, wavy, or corroded edges on the panels themselves. In contrast, our improved model consistently produces clean, precise masks. It successfully ignores distracting background clutter, maintains clear separation between tightly packed panels, and outputs smooth, continuous boundaries that accurately reflect the true geometry of each solar panel. This visual fidelity is paramount for reliable downstream fault analysis.

To dissect the contribution of each proposed component, we conducted a systematic ablation study, starting from the original DeepLabV3+ baseline. The results are presented in the following table.

Experiment	MobileNetV2	Canny Edge	SE-ASPP	Dual Channel	P (%)	mIoU (%)	R (%)	F1 (%)
1 (Baseline)					97.26	97.63	98.04	97.83
2	✓				98.10	98.95	99.07	98.94
3	✓	✓			98.54	99.01	99.52	99.01
4	✓	✓	✓		98.84	99.03	99.52	99.02
5 (Full Model)	✓	✓	✓	✓	99.50	99.21	99.61	99.55

The ablation study confirms the positive impact of each innovation. Replacing the backbone with MobileNetV2 (Exp 2) provides a substantial boost across all metrics, proving its effectiveness as a feature extractor for this task. Adding the Canny edge guidance (Exp 3) further improves precision and recall, validating its role in enhancing boundary awareness. Incorporating the SE-ASPP module (Exp 4) yields an additional gain in precision, demonstrating the benefit of channel-wise feature recalibration at multiple scales. Finally, introducing the parallel dual-channel processing for shallow features (Exp 5) pushes the performance to its peak, particularly in precision and F1-score, highlighting the importance of enriching the low-level feature flow to the decoder. The cumulative effect of all components results in the superior performance of the full model.

Conclusion and Future Outlook

In this work, we have presented a significantly improved DeepLabV3+ architecture specifically designed for the precise semantic segmentation of solar panels in infrared inspection imagery. By integrating a lightweight MobileNetV2 backbone, explicit edge reinforcement via the Canny algorithm, a channel-attentive SE-ASPP module, and a dual-channel shallow feature pathway, we have successfully addressed the key limitations of traditional models: inaccurate boundaries, adhesion between panels, and misclassification of background objects.

The experimental results are unequivocal. Our model achieves exceptional quantitative scores, surpassing 99% in precision, recall, mIoU, and F1-score, and delivers qualitatively superior segmentation masks with clean, precise boundaries. This high-fidelity segmentation acts as a perfect pre-processing filter, isolating the solar panel regions and eliminating background noise. When integrated into a two-stage inspection pipeline, this capability will dramatically increase the accuracy and reliability of automated fault detection systems for photovoltaic plants, reducing operational costs and maintenance downtime.

Future work will focus on several avenues. First, we aim to explore real-time optimization of the model for deployment on UAV onboard computers, potentially using techniques like quantization and pruning. Second, extending the model to perform multi-class segmentation—differentiating between panels, mounting structures, and different types of background—could provide even richer context for inspection systems. Finally, developing an end-to-end framework that jointly optimizes the segmentation and fault detection tasks could lead to further performance gains. The continued advancement of such intelligent vision systems is essential for supporting the global scale-up and sustainable management of solar energy infrastructure.