In recent years, the rapid expansion of distributed photovoltaic power stations has highlighted the critical need for efficient maintenance and inspection of solar panels. As a key component in renewable energy systems, solar panels are susceptible to various surface contaminants such as bird droppings, dust accumulation, physical damage, electrical faults, and snow cover, which can significantly impair their energy conversion efficiency. Traditional inspection methods, including electrical characteristic analysis and visual techniques, often fall short in terms of accuracy and speed, especially when deployed in large-scale solar farms. With the advent of unmanned aerial vehicles (UAVs) equipped with embedded devices, there is a growing demand for lightweight, real-time detection models that can operate under computational constraints. This paper addresses these challenges by proposing an enhanced single-shot multibox detector (SSD) algorithm tailored for solar panel defect detection. By integrating MobileNetV3 as the backbone network, incorporating a coordinate attention mechanism, and employing Mosaic data augmentation, our approach achieves a balance between high precision and computational efficiency, making it suitable for UAV-based inspections of solar panels.
The core of our methodology lies in optimizing the SSD framework for lightweight performance without compromising detection accuracy. The original SSD algorithm leverages multiple feature maps at different scales to predict object categories and locations, but its computational demands can be prohibitive for embedded systems. To mitigate this, we replace the conventional backbone with MobileNetV3, which utilizes depthwise separable convolutions and inverted residual blocks to reduce parameters and floating-point operations. The inverted residual structure, as illustrated in our design, first expands the channel dimensions using 1×1 convolutions, applies 3×3 depthwise convolutions for spatial feature extraction, and then reduces dimensions with another 1×1 convolution. This process minimizes computational overhead while preserving essential features through linear activation functions. The overall architecture processes input images resized to 300×300 pixels, extracting multi-scale features from layers with dimensions such as 19×19, 10×10, 5×5, 3×3, 2×2, and 1×1 to handle objects of varying sizes on solar panels, from small bird droppings to large panel surfaces.
To further enhance feature representation, we integrate a coordinate attention (CA) mechanism into the network. Unlike standard attention modules that rely on global pooling, CA encodes spatial information along height and width directions, allowing the model to focus on critical regions adaptively. For an input feature map ( X \in \mathbb{R}^{H \times W \times C} ), average pooling is performed independently in horizontal and vertical directions, producing feature vectors that capture positional context. These vectors are concatenated and processed through convolutional layers followed by a sigmoid activation to generate attention weights ( A_h ) and ( A_w ). The output ( Y ) is computed as ( Y = X \cdot A_h \cdot A_w ), enabling the model to emphasize relevant areas for defect detection on solar panels. This attention mechanism significantly boosts the model’s ability to discern subtle defects amidst complex backgrounds, such as discoloration or cracks on solar panels.

Data augmentation plays a vital role in improving model generalization, especially given the scarcity of annotated datasets for solar panel defects. We employ Mosaic augmentation, which combines four images through random scaling, cropping, and rotation, creating composite training samples that simulate diverse environmental conditions. This technique enriches the dataset with varied perspectives and occlusion scenarios, helping the model learn robust features for detecting contaminants on solar panels. Our custom dataset, structured similarly to Pascal VOC, includes categories like clean panels, bird droppings, dirt, electrical damage, physical damage, and snow, split into training, validation, and test sets. The Mosaic augmentation not only increases data diversity but also enhances the model’s resilience to positional and scale variations, crucial for real-world applications involving solar panels.
Experimental evaluations were conducted on a system equipped with an NVIDIA GeForce RTX 3060 GPU and Intel Xeon CPU, using PyTorch 1.10 and Python 3.8. The model was trained for 300 epochs with a batch size of 32, momentum of 0.9, and stochastic gradient descent optimizer. The initial learning rate was set to 0.01, with a cosine annealing schedule to adjust it cyclically, promoting convergence and reducing overfitting. Performance metrics included mean average precision (mAP), accuracy, parameter count, computational complexity in GFLOPS, model size, and frames per second (FPS). The results demonstrate that our improved SSD algorithm achieves a mAP of 82.71% and an accuracy of 94.2% on the test set, with a computational cost of 13.8 GFLOPS and a model size of 18.3 MB, enabling real-time detection at 45.6 FPS.
Ablation studies were performed to validate the contributions of each component. As summarized in Table 1, using MobileNetV3 as the backbone reduces parameters by 68.9% compared to ResNet50, while the addition of CA attention increases mAP by 5.21%. Mosaic augmentation further improves mAP by 4.3%, underscoring its importance in handling diverse defect types on solar panels. Comparative analysis with other models, such as Faster R-CNN and YOLOv3, highlights the superiority of our approach: it outperforms Faster R-CNN in both accuracy and speed, and surpasses YOLOv3 by 10.0% in mAP while using only 22.9% of its parameters. These findings confirm that our lightweight design is well-suited for deployment on UAVs for inspecting solar panels.
| Experiment | Backbone | CA Mechanism | Mosaic Augmentation | mAP (%) | Accuracy (%) | Parameters | GFLOPS |
|---|---|---|---|---|---|---|---|
| A | ResNet50 | No | Yes | 72.68 | 78.43 | 44,553,636 | 30.0 |
| B | MobileNetV3 | No | Yes | 77.50 | 81.03 | 14,114,584 | 13.7 |
| C | MobileNetV3 | Yes | No | 78.41 | 84.11 | 14,114,890 | 13.7 |
| D | MobileNetV3 | Yes | Yes | 82.71 | 92.28 | 14,114,890 | 13.7 |
The mathematical formulation of our approach can be expressed through the loss function used in SSD, which combines localization and confidence losses. For a predicted bounding box with coordinates ( (c_x, c_y, w, h) ) and class probabilities ( p ), the total loss ( L ) is given by:
$$ L = \frac{1}{N} (L_{conf} + \alpha L_{loc}) $$
where ( N ) is the number of matched default boxes, ( L_{conf} ) is the cross-entropy loss for classification, ( L_{loc} ) is the Smooth L1 loss for regression, and ( \alpha ) is a weighting parameter. In our implementation, we optimize this loss to enhance detection of solar panel defects, with the CA mechanism refining feature maps to reduce false positives. The depthwise separable convolution in MobileNetV3 further decomposes standard convolution into depthwise and pointwise operations, reducing computational complexity from ( O(K^2 \cdot C_i \cdot C_o) ) to ( O(K^2 \cdot C_i + C_i \cdot C_o) ), where ( K ) is the kernel size, and ( C_i ), ( C_o ) are input and output channels. This efficiency is critical for processing high-resolution images of solar panels on resource-constrained devices.
In comparison to existing methods, our model demonstrates significant improvements, as detailed in Table 2. The lightweight architecture not only accelerates inference but also maintains high accuracy across various defect types on solar panels. For instance, in detecting bird droppings and snow cover, the model achieves precision rates above 90%, attributed to the multi-scale feature fusion and attention mechanism. The use of Mosaic augmentation ensures that the model generalizes well to unseen data, reducing overfitting and enhancing robustness in diverse environmental conditions. These advancements make our approach a practical solution for automated inspection systems targeting solar panels.
| Model | mAP (%) | Accuracy (%) | Parameters | Model Size (MB) | GFLOPS | FPS |
|---|---|---|---|---|---|---|
| Faster R-CNN | 80.39 | 89.6 | 191,385,860 | 157.5 | 240.0 | 3.2 |
| YOLOv3 | 72.63 | 82.9 | 61,524,355 | 100.6 | 20.6 | 21.1 |
| SSD-ResNet50 | 72.68 | 78.4 | 44,553,636 | 52.4 | 30.5 | 23.9 |
| Our Improved Model | 82.71 | 94.2 | 14,114,890 | 18.3 | 13.8 | 45.6 |
To quantify the impact of the coordinate attention mechanism, consider the feature enhancement process. Let ( F ) be the input feature map from a convolutional layer. The CA mechanism computes attention weights as follows: first, global average pooling along height and width yields ( z_h ) and ( z_w ), which are then transformed via convolutional layers with weights ( W_h ) and ( W_w ). The attention maps are obtained as ( A_h = \sigma(W_h * [z_h, F]) ) and ( A_w = \sigma(W_w * [z_w, F]) ), where ( \sigma ) denotes the sigmoid function and ( * ) represents convolution. The refined feature map ( Y ) is then ( Y = F \odot A_h \odot A_w ), with ( \odot ) indicating element-wise multiplication. This process allows the model to dynamically adjust focus on defective regions of solar panels, improving detection precision for small objects like bird droppings or fine cracks.
In conclusion, our improved SSD algorithm offers a lightweight and efficient solution for detecting surface contaminants on solar panels. By leveraging MobileNetV3, coordinate attention, and Mosaic data augmentation, we achieve a balance between computational efficiency and high accuracy, making the model ideal for UAV-based inspections. The reduction in parameters and GFLOPS, coupled with a significant boost in FPS, ensures real-time performance on embedded systems. Future work could explore further optimizations, such as neural architecture search or hybrid models, to enhance detection capabilities for emerging defect types in solar panels. This research contributes to the maintenance of photovoltaic systems, promoting sustainable energy generation through reliable and automated inspection technologies.
