In the manufacturing process of solar panels, production techniques can lead to various defects such as cracks, broken grids,残缺 (incomplete parts), and black spots. These defects not only reduce power generation efficiency but also pose fire hazards. Traditional methods for defect detection include manual inspection and machine vision approaches. However, manual inspection is subjective and prone to errors, while machine vision struggles to extract limited defect information from vast datasets, failing to meet the demand for efficient multi-category defect identification in solar panels. Deep learning-based defect detection has emerged as a viable solution, with object detection algorithms like YOLO and Faster R-CNN offering direct category probability and location coordinates. Recently, Transformer-based models have gained traction in visual tasks due to their self-attention mechanisms, which capture global relationships in images. For instance, Swin-Transformer has been applied to surface defect detection, but it suffers from high computational costs and missed detections for small targets. To address these issues in multi-scale, multi-object solar panel defect detection, we propose a novel network based on a multi-scale extended attention mechanism. This network enhances defect detection by expanding the receptive field, improving efficiency through lightweight design, and incorporating advanced loss functions for small targets.
The core of our approach involves modifying the Swin-Transformer backbone by replacing Swin-Transformer Blocks with MEAN-Transformer Blocks, which integrate multi-scale extended attention mechanisms to better capture defects of varying sizes. Additionally, we enhance the feature fusion network by incorporating a 128×128 shallow-scale feature map and using partial convolutions (PConv) to reduce computational load. To tackle the challenge of small defect detection, we replace the traditional IoU loss with the Normalized Gaussian Wasserstein Distance (NWD) metric. Experimental results demonstrate that our method achieves a detection accuracy of 85.7%, a parameter count of 124.7 MB, and a processing speed of 42.1 FPS, meeting real-time requirements while improving multi-scale and multi-category recognition.

Defect detection in solar panels is critical for ensuring product quality and safety. Common defects include fine lines, black spots, scratches, broken grids, and残缺, which can arise from material imperfections or manufacturing errors. Traditional computer vision methods often rely on handcrafted features and threshold-based segmentation, but they lack the adaptability to handle diverse defect types and scales. Deep learning models, particularly convolutional neural networks (CNNs), have shown promise in automating this process. One-stage detectors like YOLO series networks directly map input images to output bounding boxes and class probabilities, offering high speed, while two-stage detectors like Faster R-CNN generate region proposals first, providing higher accuracy at the cost of speed. Transformer-based models, such as Swin-Transformer, leverage self-attention to model long-range dependencies, but they can be computationally intensive and inefficient for small objects in solar panels. Our work builds on these foundations to develop a lightweight, efficient network tailored for solar panel inspection.
The Swin-Transformer backbone consists of multiple Swin-Transformer Blocks, each combining depthwise separable convolution and window-based self-attention. This architecture partitions images into windows and computes attention within them, reducing computational complexity compared to global attention. However, it may miss fine-grained details in multi-scale defects. The path aggregation network (PANet) integrates features from different layers, combining top-down semantic information from FPN with bottom-up localization features from PANet to form a multi-scale pyramid. This enhances feature representation but can be optimized for solar panels by incorporating additional shallow layers and lightweight convolutions.
In our proposed MEAN-Transformer network, we introduce several key innovations. First, the MEAN-Transformer Block replaces the standard Swin Block to capture multi-scale defect information more effectively. The multi-scale extended attention mechanism (MEAM) groups feature maps and applies sparse attention within sliding windows centered on query patches, using a dilation rate to control focus. This allows the network to aggregate local and global contexts efficiently. Mathematically, for a query at position (i, j), MEAM selects keys and values from a window of size w × w with dilation r, computed as:
$$M = \text{MEAM}(Q, K, V, r)$$
where Q, K, V are query, key, and value matrices. Each head’s output is refined progressively, and the final representation is concatenated and linearized. The MEAN-Transformer Block structure includes depthwise convolution for local feature extraction, MEAM for multi-scale attention, and an MLP with GELU activation for non-linearity, formulated as:
$$M = \text{DwConv}(\hat{M}) + \hat{M}$$
$$Y = \text{MSDA}(\text{Norm}(M)) + M$$
$$Z = \text{MLP}(\text{Norm}(Y)) + Y$$
Second, we redesign the feature fusion network by adding a 128×128 feature map to the PANet structure, improving the detection of small defects in solar panels. We replace standard convolutions with PConv, which applies filters only to a subset of input channels, reducing computational cost to approximately one-fourth. This lightweight approach maintains spatial feature extraction while minimizing parameters. The PConv operation can be summarized as applying convolution to a fraction of channels, leaving others unchanged, thus enhancing efficiency without significant performance loss.
Third, to address small target detection issues, we adopt the NWD loss function. Unlike IoU-based losses, which are sensitive to minor positional deviations in small objects, NWD models bounding boxes as 2D Gaussian distributions and measures similarity using the Wasserstein distance. For a predicted box P = (cx_p, cy_p, w_p, h_p) and ground truth box G = (cx_g, cy_g, w_g, h_g), the Gaussian distributions are defined, and the NWD is computed as:
$$W_2^2(N_p, N_g) = \left| \left[ cx_p, cy_p, \frac{w_p}{2}, \frac{h_p}{2} \right]^T, \left[ cx_g, cy_g, \frac{w_g}{2}, \frac{h_g}{2} \right] \right|_2^2$$
$$\text{NWD}(N_p, N_g) = \exp\left( -\frac{W_2^2(N_p, N_g)}{C} \right)$$
$$L_{\text{NWD}} = 1 – \text{NWD}(N_p, N_g)$$
where C is a dataset-dependent constant. This loss function improves robustness to small object localization errors in solar panels.
We conducted experiments on a dataset of approximately 3000 images of solar panels, categorized into five defect types: fine lines, black spots, scratches, broken grids, and残缺. The dataset was split into training, validation, and test sets in a 6:2:2 ratio. Our experimental setup used a Windows 10 system with an NVIDIA 3060Ti GPU, PyTorch 1.6 framework, and Python programming. Training parameters included a batch size of 32, 50 epochs, an initial learning rate of 0.01 adjusted to 0.001 after 20 epochs, and weight decay of 0.0002. Evaluation metrics included mean average precision (mAP), frames per second (FPS), and parameter count (Params).
Ablation studies were performed to validate each component of our network. Starting from the baseline Swin-Transformer, we incrementally added MEAM, the enhanced PANet with 128×128 feature map, PConv, and NWD loss. Results are summarized in Table 1, showing that each modification contributes to improved accuracy and efficiency. For instance, MEAM alone increased mAP by 0.3% while reducing parameters slightly, and adding PConv further optimized speed with minimal accuracy trade-off. The full model with NWD loss achieved the best balance, highlighting the importance of each component for solar panel defect detection.
| Group | MEAM | PANet | PConv | NWD | mAP (%) | Params (MB) | FPS |
|---|---|---|---|---|---|---|---|
| Baseline | – | – | – | – | 84.6 | 133.6 | 39.5 |
| A | ✓ | – | – | – | 84.9 | 135.5 | 38.7 |
| B | ✓ | ✓ | – | – | 85.2 | 133.4 | 39.9 |
| C | ✓ | ✓ | ✓ | – | 84.8 | 124.7 | 42.1 |
| D | ✓ | ✓ | ✓ | ✓ | 85.7 | 124.7 | 42.1 |
Comparative experiments with state-of-the-art methods demonstrate the superiority of our approach. As shown in Table 2, our MEAN-Transformer network achieves a mAP of 85.7%, outperforming Faster R-CNN, Swin-Transformer, and YOLOv5. While YOLOv5 offers competitive speed, our method provides higher accuracy and better handling of multi-scale defects in solar panels. Faster R-CNN, with its two-stage design, has higher computational costs and lower FPS, making it less suitable for real-time applications. Visual results confirm that our network reduces missed detections for small targets, such as fine cracks and black spots, ensuring comprehensive defect coverage in solar panels.
| Method | mAP (%) | Params (MB) | FPS |
|---|---|---|---|
| Faster R-CNN | 79.7 | 180.4 | 16.4 |
| Swin-Transformer | 84.6 | 133.6 | 39.5 |
| YOLOv5 | 84.5 | 107.7 | 42.7 |
| Our Method | 85.7 | 124.7 | 42.5 |
The multi-scale attention mechanism in MEAM allows the network to dynamically adjust to defects of different sizes in solar panels. By using dilated convolutions and grouped attention, it captures both fine details and broader contexts. The lightweight PConv implementation reduces redundant computations, making the network feasible for deployment in resource-constrained environments. The NWD loss function effectively mitigates the challenges of small object detection, as it is less sensitive to positional errors compared to IoU variants. In solar panel inspection, where defects can range from large cracks to tiny spots, these innovations collectively enhance performance.
In conclusion, we have developed a MEAN-Transformer-based defect detection algorithm for solar panels that addresses multi-scale and small target challenges. By integrating multi-scale extended attention, lightweight feature fusion, and advanced loss metrics, our method achieves high accuracy and real-time speed. Future work could explore further optimizations, such as knowledge distillation or adversarial training, to enhance robustness across diverse solar panel types and conditions. This approach contributes to the automation of quality control in solar panel manufacturing, ensuring reliability and efficiency in renewable energy systems.
The effectiveness of our network is rooted in its ability to balance computational efficiency with detection performance. For solar panels, which often exhibit complex defect patterns, the MEAN-Transformer Block provides a scalable solution. The attention mechanism can be expressed in terms of query, key, and value transformations, where for each head, the output is computed as:
$$X_i^r = \text{Attention}(q_{ij}, K_r, V_r) = \text{Softmax}\left( \frac{q_{ij} K_r^T}{\sqrt{d_k}} \right) V_r$$
$$X’i^r = X_i^r + \Delta X_i^{(r-1)}$$
$$X{i+1} = \text{Concat}([\Delta X_i^r]_{r=1:4})$$
This formulation enables progressive refinement of features, crucial for identifying subtle defects in solar panels. Additionally, the use of PConv in the feature fusion network significantly cuts down parameters, as it operates on a subset of channels. If we denote the input channels as C and the fraction used for convolution as k, the computational cost is reduced by a factor of k compared to standard convolution, where k is typically 1/4. This makes the network more accessible for real-world applications involving solar panels.
In practice, the deployment of such models can lead to substantial improvements in solar panel production lines. By automating defect detection, manufacturers can reduce reliance on manual labor, minimize errors, and increase throughput. Our experiments validate that the proposed network not only meets but exceeds existing methods in key metrics, paving the way for wider adoption in industrial settings. As solar energy continues to grow, ensuring the quality of solar panels through advanced AI-driven methods will be paramount for sustainable development.
