The widespread adoption of distributed photovoltaic (PV) power generation has led to a massive global installed capacity. Ensuring the operational efficiency of these vast solar farms is paramount for maximizing return on investment and contributing to clean energy goals. A critical factor affecting the performance of solar panels is surface contamination. During long-term outdoor exposure, solar panels are susceptible to various forms of soiling and defects, including bird droppings, dust accumulation, snow cover, and physical or electrical damage. These contaminants cast shadows or create resistive pathways, significantly reducing the energy conversion efficiency of the affected modules. Timely identification and remediation of such issues are therefore essential for maintaining optimal power output from solar installations.
Traditional inspection methods for solar panel arrays are often labor-intensive, costly, and potentially hazardous. The deployment of Unmanned Aerial Vehicles (UAVs) equipped with visible-light or thermal imaging cameras has revolutionized this process, enabling rapid, large-scale inspections. However, this shift presents a new computational challenge. The embedded devices carried by UAVs have limited processing power, memory, and battery life. Consequently, the object detection models used to analyze the captured imagery must be not only accurate but also highly efficient—lightweight in terms of parameters and computational complexity (FLOPs), and fast in inference speed to enable real-time or near-real-time analysis during flight.

Existing deep learning-based approaches for solar panel defect detection have shown promising results. Two-stage detectors like Faster R-CNN offer high accuracy but are often too computationally heavy for edge deployment. Single-stage detectors, such as the You Only Look Once (YOLO) family and the Single Shot MultiBox Detector (SSD), provide a better speed-accuracy trade-off. The SSD algorithm, in particular, performs detection at multiple feature map scales within a single network pass, making it a strong candidate for balanced performance. However, the standard SSD, often paired with backbone networks like VGG or ResNet, still carries a significant parameter and computation burden, hindering its direct application on UAV platforms for comprehensive solar panel inspection.
To address this critical gap for in-situ solar panel assessment, this work proposes a lightweight and efficient object detection method based on an improved SSD architecture, specifically designed for the task of identifying various types of solar panel surface contamination. Our primary objective is to drastically reduce the model’s footprint and computational demands while maintaining, or even enhancing, its detection accuracy. The core improvements involve three key modifications: 1) Replacing the conventional heavy backbone network with MobileNetV3, a network specifically engineered for mobile and embedded vision applications; 2) Integrating a lightweight Coordinate Attention (CA) mechanism into the feature extraction layers to boost discriminative power without substantial overhead; and 3) Employing advanced data augmentation techniques, specifically Mosaic augmentation, on a custom-built dataset to improve the model’s robustness and generalization to diverse real-world scenarios of solar panel soiling.
1. Architectural Design of the Improved SSD-MobileNetV3 Framework
The foundation of our approach is the Single Shot MultiBox Detector (SSD) framework, renowned for its direct bounding box and class prediction from multiple feature maps. Our significant modification lies in the backbone network. We discard the computationally expensive backbones typically used with SSD and adopt MobileNetV3-Large as our primary feature extractor. MobileNetV3 is built upon depthwise separable convolutions and incorporates efficient architectural blocks like inverted residuals with linear bottlenecks and squeeze-and-excitation channels, making it exceptionally parameter- and compute-efficient. This replacement forms the core of our lightweight design, termed SSD-MobileNetV3, tailored for efficient solar panel inspection.
The overall architecture of our proposed model is designed for processing images in the context of solar farm surveillance. The input solar panel image is first resized to a fixed resolution of 300×300 pixels. It is then fed into the MobileNetV3 backbone, which consists of a series of efficient convolutional layers and bottleneck blocks. These layers progressively down-sample the spatial dimensions while expanding the channel depth, hierarchically extracting features that are crucial for identifying defects ranging from small bird droppings to large areas of snow cover on solar panels.
The feature maps from the final layer of the MobileNetV3 backbone are used as the starting point for the SSD detection head. Following the standard SSD design, we append several auxiliary convolutional layers to this backbone. These additional layers produce a set of feature maps at multiple scales. The scales we utilize are specifically chosen to capture defects on solar panels of varying sizes: 19×19, 10×10, 5×5, 3×3, 2×2, and 1×1. The smaller, higher-resolution feature maps (e.g., 19×19) are responsible for detecting small contaminants on the solar panel surface, such as localized bird excrement or minor physical damage. Conversely, the larger, lower-resolution feature maps (e.g., 1×1) are adept at identifying larger defects, like extensive snow coverage or major panel discoloration.
At each location on these multi-scale feature maps, the network predicts a set of default bounding boxes (priors) and the offsets for these boxes, along with confidence scores for each object class (e.g., clean solar panel, bird dropping, dirt, electrical damage). This multi-scale prediction strategy is key to handling the large variance in target sizes present in aerial imagery of solar panel arrays. Finally, non-maximum suppression (NMS) is applied to filter out overlapping and low-confidence detections, yielding the final set of predicted bounding boxes and labels for the contaminated solar panels.
2. Core Components for Efficiency and Accuracy
2.1 Inverted Residual Bottleneck with Linear Bottleneck
The fundamental building block of the MobileNetV3 backbone is the inverted residual bottleneck layer. This structure is pivotal for achieving a lightweight model suitable for inspecting vast fields of solar panels. Its design philosophy stands in contrast to traditional residual blocks. As illustrated in the conceptual diagram, it first applies a 1×1 pointwise convolution to expand the channel dimension. This high-dimensional representation is then processed by a 3×3 depthwise separable convolution, which applies a single filter per input channel, drastically reducing parameters. Finally, another 1×1 pointwise convolution compresses the channel dimension back down.
The “linear bottleneck” refers to the use of a linear activation function in the final compression layer, as opposed to a non-linear one like ReLU6. This design choice helps prevent information loss that can occur when non-linearities collapse a low-dimensional manifold of interest—a crucial consideration for preserving the subtle textures and patterns indicative of different solar panel defects. The mathematical flow of this block for an input tensor \( \mathbf{X} \) can be summarized as:
$$ \mathbf{Y} = \text{Conv}_{1\times1}^{\,linear}(\,\text{DWConv}_{3\times3}^{\,nonlin}(\,\text{Conv}_{1\times1}^{\,nonlin}(\mathbf{X})\,)\,) $$
where \( \text{DWConv} \) denotes depthwise convolution and \( nonlin \) typically represents the hard-swish or ReLU activation. This sequence of “expand-transform-compress” with depthwise convolution and linear bottlenecks is the primary driver behind the network’s efficiency, making deep feature extraction feasible on embedded hardware tasked with scanning hundreds of solar panels.
2.2 Integration of Coordinate Attention Mechanism
While MobileNetV3 provides an efficient feature extraction skeleton, we further enhance its discriminative power specifically for solar panel defect detection by integrating a lightweight attention mechanism. We employ the Coordinate Attention (CA) block due to its ability to capture long-range spatial dependencies with minimal computational cost—a perfect fit for our lightweight design goal. Unlike channel-only attention (e.g., Squeeze-and-Excitation) which ignores location information, CA encodes both channel relationships and precise positional information, which is vital for locating defects on the structured, grid-like surface of a solar panel.
The CA mechanism operates by decomposing the global spatial pooling into two one-dimensional, feature-specific pooling operations. For an input feature map \( \mathbf{X} \in \mathbb{R}^{C \times H \times W} \), it first performs average pooling along the horizontal and vertical axes separately, generating two direction-aware feature maps: \( \mathbf{z}^h \in \mathbb{R}^{C \times H \times 1} \) and \( \mathbf{z}^w \in \mathbb{R}^{C \times 1 \times W} \). These aggregated features are then concatenated and transformed via a shared 1×1 convolution, followed by a nonlinear activation (e.g., sigmoid), to produce an intermediate feature map that encodes spatial information. This map is subsequently split back into two separate tensors for the height and width dimensions. Another pair of 1×1 convolutions and sigmoid activations generate the final attention weights \( \mathbf{g}^h \) and \( \mathbf{g}^w \). The output of the CA block is computed by element-wise multiplication of the original input with these spatially-aware weights:
$$
\begin{aligned}
\mathbf{z}^h_c(h) &= \frac{1}{W} \sum_{0 \leq j < W} \mathbf{X}_c(h, j), \\
\mathbf{z}^w_c(w) &= \frac{1}{H} \sum_{0 \leq i < H} \mathbf{X}_c(i, w), \\
\mathbf{f} &= \delta(\, \text{Conv}_{1\times1}(\,[\mathbf{z}^h, \mathbf{z}^w]\,) \,), \\
\mathbf{g}^h &= \sigma(\text{Conv}_{1\times1}^h(\mathbf{f}^h)), \quad \mathbf{g}^w = \sigma(\text{Conv}_{1\times1}^w(\mathbf{f}^w)), \\
\mathbf{Y}_c(i, j) &= \mathbf{X}_c(i, j) \times \mathbf{g}^h_c(i) \times \mathbf{g}^w_c(j).
\end{aligned}
$$
By embedding this CA block into critical stages of the MobileNetV3 backbone, our model learns to “pay attention” to the most relevant regions along the rows and columns of the solar panel image. This is particularly effective for distinguishing defects from regular panel textures, grid lines, or mounting hardware, thereby improving localization accuracy and reducing false positives during the inspection of solar panel arrays.
3. Dataset Curation and Augmentation Strategy for Solar Panels
A significant challenge in applying deep learning to niche industrial inspection tasks like solar panel defect detection is the lack of large, publicly available, annotated datasets. To this end, we constructed a custom dataset focusing on common visible-light defects found on solar panels in operational environments. Our dataset comprises images collected from various sources and annotated according to the PASCAL VOC format. The categories are specifically chosen to cover a range of operational issues: Bird Dropping, Clean Panel (as a reference/negative class), Dirt/Stain, Electrical Damage (e.g., hot spots visible under certain conditions), Physical Damage (cracks, fractures), and Snow Cover.
The dataset was meticulously split into training, validation, and test sets with an 8:1:1 ratio to ensure unbiased evaluation. The distribution of images per category is summarized in the following table, which underscores the dataset’s focus on common solar panel contaminants.
| Split | Total Images | Bird Dropping | Clean Panel | Dirt/Stain | Electrical Damage | Physical Damage | Snow Cover |
|---|---|---|---|---|---|---|---|
| Training Set | 6136 | 1459 | 1124 | 1703 | 374 | 297 | 1179 |
| Validation Set | 638 | 167 | 119 | 105 | 34 | 24 | 189 |
| Test Set | 638 | 180 | 135 | 83 | 35 | 33 | 172 |
To combat overfitting and enhance the model’s generalization capability—especially given the limited initial dataset size—we employed the Mosaic data augmentation technique during training. This method randomly selects four training images, applies scaling, cropping, and color jittering to each, and then stitches them into a single composite image. This synthetic image, along with its adjusted bounding boxes, is used as a training sample. Mosaic augmentation effectively increases the batch diversity within a single iteration, exposes the model to multiple solar panels and defect contexts simultaneously, and improves its robustness to variations in object scale and spatial arrangement. This is highly beneficial for a UAV-based system that encounters solar panels at different distances (scales) and under various partial views.
4. Experimental Evaluation and Comparative Analysis
All experiments were conducted in a controlled environment to ensure fair comparison. The models were trained for 300 epochs with a batch size of 32, using Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and weight decay. An initial learning rate of 0.01 was used in conjunction with a cosine annealing scheduler to converge smoothly to a robust solution for solar panel defect identification.
We employ several metrics to comprehensively evaluate the performance of our model, keeping in mind the constraints of UAV deployment. Accuracy and mean Average Precision (mAP) are used to gauge detection precision. To measure efficiency, we report the number of parameters, computational complexity in Giga FLOPs (GFLOPS), and the model’s size on disk. Most critically for real-time inspection, we measure the inference speed in Frames Per Second (FPS) on a standard edge-computing GPU.
4.1 Ablation Study: Validating the Improvement Components
To rigorously validate the contribution of each proposed modification to the final performance in detecting solar panel contamination, we conducted a systematic ablation study. The results are presented in the table below.
| Experiment | Backbone | CA Module | Mosaic Aug. | mAP (%) | Accuracy (%) | Params | GFLOPS |
|---|---|---|---|---|---|---|---|
| A | ResNet50 | No | Yes | 72.68 | 78.43 | 44.55M | 30.0 |
| B | MobileNetV3 | No | Yes | 77.50 | 81.03 | 14.11M | 13.7 |
| C | MobileNetV3 | Yes | No | 78.41 | 84.11 | 14.11M | 13.7 |
| D (Ours) | MobileNetV3 | Yes | Yes | 82.71 | 92.28 | 14.11M | 13.7 |
Analysis: Comparing Experiment A (SSD-ResNet50) and Experiment B (SSD-MobileNetV3 without CA) clearly demonstrates the massive efficiency gain from the backbone swap. The parameter count and GFLOPS are reduced by approximately 68% and 54%, respectively, while the mAP actually increases by 4.82%. This confirms that MobileNetV3 provides a much more efficient foundation for feature extraction from solar panel images without sacrificing accuracy. The comparison between Experiment B and Experiment C shows the positive impact of the Coordinate Attention mechanism. Adding CA boosts mAP by 5.21% and accuracy by a substantial 11.25%, with a negligible increase in parameters and zero change in GFLOPS. This highlights CA’s effectiveness in refining features for precise defect localization on the structured surface of a solar panel. Finally, comparing Experiment C and our final model (Experiment D) reveals the benefit of Mosaic augmentation. It contributes a 4.3% mAP and 8.1% accuracy gain, proving crucial for improving the model’s robustness and ability to generalize across varied scenes of solar panel soiling. The final model achieves an excellent balance, offering high accuracy with a very lightweight profile.
4.2 Comparative Analysis with State-of-the-Art Detectors
We benchmark our final improved SSD-MobileNetV3 model against several representative object detection frameworks, including a two-stage detector (Faster R-CNN) and other single-stage detectors (YOLOv3, SSD-ResNet50). All models were trained and evaluated on our custom solar panel contamination dataset under identical conditions. The results are summarized in the following comparative table.
| Model | mAP (%) | Accuracy (%) | Params | Model Size (MB) | GFLOPS | FPS |
|---|---|---|---|---|---|---|
| Faster R-CNN | 80.39 | 89.6 | 191.39M | 157.5 | 240.0 | 3.2 |
| YOLOv3 | 72.63 | 82.9 | 61.52M | 100.6 | 20.6 | 21.1 |
| SSD-ResNet50 | 72.68 | 78.4 | 44.55M | 52.4 | 30.5 | 23.9 |
| Our Model | 82.71 | 94.2 | 14.11M | 18.3 | 13.8 | 45.6 |
Analysis: The results unequivocally demonstrate the superiority of our improved model for the task of solar panel inspection under computational constraints. While Faster R-CNN achieves respectable accuracy, its massive parameter count (191M), computational load (240 GFLOPS), and slow speed (3.2 FPS) render it impractical for real-time, on-device analysis on a UAV. Our model outperforms it in mAP (+2.3%) and accuracy (+4.6%) while being over 13x lighter in parameters and 17x more efficient in computation.
Compared to YOLOv3, our model achieves a remarkable 10.1% higher mAP and 11.3% higher accuracy. More importantly, it does so with only 23% of YOLOv3’s parameters and 67% of its GFLOPS, resulting in a more than 2x faster inference speed (45.6 vs. 21.1 FPS). The comparison with the standard SSD-ResNet50 further solidifies our architectural choices. Our model delivers a massive 15.8% boost in accuracy and nearly doubles the inference speed (45.6 vs. 23.9 FPS), all while reducing the model size by 65%.
5. Conclusion and Outlook
This work successfully addresses the critical need for efficient and accurate vision-based inspection of solar panels, particularly in the context of UAV-mounted edge computing. By fundamentally re-engineering the SSD detector, we have developed a highly optimized framework for detecting surface contamination on solar panels. The strategic replacement of the backbone with MobileNetV3 provided the foundational efficiency, dramatically reducing parameters and computations. The integration of the lightweight Coordinate Attention mechanism then significantly enhanced the feature representation power, enabling precise discrimination and localization of various defects on the solar panel surface without adding computational burden. Furthermore, the use of Mosaic data augmentation proved essential for robust learning from a limited dataset, improving the model’s ability to generalize to the diverse and unpredictable conditions encountered in real-world solar farms.
The experimental results are compelling. Our final model achieves a top-tier mAP of 82.71% and an accuracy of 94.2% on the solar panel contamination test set. More importantly, it attains this high precision with a remarkably low computational footprint of only 13.8 GFLOPS and a small model size of 18.3 MB, culminating in a real-time inference speed of 45.6 FPS. This combination of high accuracy, low resource consumption, and fast processing makes the proposed model exceptionally well-suited for deployment on drones for automated, large-scale solar panel array inspection. It enables the rapid identification of soiling issues like bird droppings, dirt, snow, and physical damage, allowing for timely maintenance that directly contributes to sustaining the peak power generation efficiency of solar energy installations.
Future work may explore further optimization for specific ultra-low-power hardware (e.g., Jetson Nano), investigate knowledge distillation techniques to create even smaller models, or extend the framework to perform simultaneous defect classification and severity assessment on the identified solar panel anomalies.
