Solar panels are critical components in photovoltaic power generation systems, directly influencing energy conversion efficiency and operational safety. However, existing defect detection algorithms often fail to fully leverage the complementary strengths of Convolutional Neural Networks (CNNs) and Transformers, limiting overall performance. To address this, I propose a Global and Local Feature Enhanced YOLOX (GLF-YOLOX) algorithm for solar panel defect detection. This method integrates a dual-branch backbone network combining CNN and Transformer architectures, a Global and Local Enhanced Attention Mechanism (GLE-AM), and a detection head with Transformer Encoder Layers (TEL) to improve feature extraction, fusion, and classification accuracy. Extensive experiments demonstrate that GLF-YOLOX achieves a mean Average Precision (mAP) of approximately 93.10%, outperforming mainstream methods by about 4.5% in mAP, validating its effectiveness and robustness in detecting defects in solar panels.

Introduction
The rapid adoption of solar panels in renewable energy systems underscores the need for efficient defect detection to ensure longevity and safety. Defects such as cracks, black cores, and finger interruptions can significantly reduce the efficiency of solar panels and pose risks like electrical faults. Traditional methods, including manual inspection and machine vision techniques, are often time-consuming and prone to errors. While deep learning-based approaches, particularly CNNs, have shown promise, they struggle with capturing global contextual information due to their localized receptive fields. Conversely, Transformers excel at modeling long-range dependencies but may overlook fine-grained details. In this work, I introduce GLF-YOLOX, which synergizes CNN and Transformer capabilities to enhance defect detection in solar panels. The key contributions include a dual-branch backbone for robust feature extraction, GLE-AM for dynamic feature fusion, and a TEL-based detection head for improved classification. This approach addresses limitations in current methods by balancing local and global feature representation, leading to superior performance in complex scenarios involving solar panels.
Methodology
The GLF-YOLOX framework is designed to optimize defect detection in solar panels by integrating local and global features. The architecture comprises three main components: a dual-branch backbone network, the GLE-AM module, and an enhanced detection head. Below, I detail each component with mathematical formulations and structural insights.
Dual-Branch Backbone Network
The backbone combines CSPDarknet53 (CNN branch) and Swin Transformer (Transformer branch) to extract multi-scale features from solar panel images. The CNN branch processes input images through convolutional layers, generating feature maps at resolutions of $\frac{H}{4} \times \frac{W}{4}$, $\frac{H}{8} \times \frac{W}{8}$, $\frac{H}{16} \times \frac{W}{16}$, and $\frac{H}{32} \times \frac{W}{32}$ with channel dimensions of 64, 128, 256, and 512, respectively. This branch excels at capturing local details like edges and textures in solar panels. The Transformer branch employs a hierarchical Swin Transformer, where input images are divided into non-overlapping patches via Patch Partition. Each patch is transformed into tokens with a linear embedding layer, and Swin Transformer Blocks process them across stages. The resolutions align with the CNN branch, and window-based multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA) mechanisms enable global context modeling. The output features from both branches are fused to leverage complementary information. The fusion process can be summarized as:
$$ \text{Fused Feature} = \text{CNN Feature} \oplus \text{Transformer Feature} $$
where $\oplus$ denotes element-wise addition or concatenation, optimized for solar panel defect characteristics.
Global and Local Enhanced Attention Mechanism (GLE-AM)
GLE-AM dynamically integrates global and local features to enhance focus on defect regions in solar panels. It consists of three sub-modules: Global Feature Enhancement Module (GFEM), Local Feature Enhancement Module (LFEM), and a Squeeze-and-Excitation (SE) block. Given an input feature map $X \in \mathbb{R}^{C \times H \times W}$, GFEM computes global average pooling (GAP) to obtain a compressed representation $X_1 \in \mathbb{R}^{C}$:
$$ X_1 = \text{GAP}(X) = \frac{1}{C} \sum_{i=1}^{C} X_i $$
The absolute difference between $X$ and $X_1$ is calculated to capture global disparities:
$$ X_{\text{abs}} = |X – X_1| $$
LFEM extracts local salient features by applying max and average pooling, followed by subtraction:
$$ X_m = \text{MaxPool}(X), \quad X_a = \text{AvgPool}(X) $$
$$ X_2 = X_m – X_a $$
The SE module generates channel-wise attention weights using 1D convolutions $C_1$ and $C_2$, batch normalization (BN), and Mish activation $M(\cdot)$:
$$ \text{SE}(X) = C_2(M(BN(C_1(\text{AvgPool1D}(X))))) + C_2(M(BN(C_1(\text{MaxPool1D}(X))))) $$
The final attention weight matrix $\theta$ and optimized feature $X_{\text{out}}$ are computed as:
$$ \theta = \sigma(\text{SE}(X)) \otimes \sigma(X_{\text{abs}} – X_2) $$
$$ X_{\text{out}} = X + \theta $$
where $\sigma$ is the Sigmoid function, and $\otimes$ denotes element-wise multiplication. This mechanism ensures that global features concentrate on target areas while local features refine details, crucial for detecting subtle defects in solar panels.
Detection Head with Transformer Encoder Layer
The detection head incorporates a Classification Branch (CLS Branch) with a TEL module to improve classification accuracy for solar panel defects. The TEL module includes positional encoding and Transformer blocks to model global context and spatial relationships. For each feature layer, the output is a tensor of shape $(H \times W, 4 + 1 + \text{num_classes})$, where 4 represents bounding box parameters $(x, y, w, h)$, 1 indicates the presence of a defect, and num_classes corresponds to defect types in solar panels. The Transformer block uses multi-head self-attention to capture dependencies:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
where $Q$, $K$, and $V$ are query, key, and value matrices. This enhances feature representation, reducing misclassification in overlapping or complex backgrounds common in solar panel imagery.
Experiments and Results
To evaluate GLF-YOLOX, I conducted experiments on a dataset of solar panel electroluminescence (EL) images, comprising 3,700 samples with five defect types: crack, black core, thick line, finger, and star crack. The dataset was split into 2,960 training and 740 validation images. Training used SGD optimization with a learning rate of 0.001, weight decay of 0.0005, and 200 epochs. Performance metrics included mAP, precision (PR), recall (RC), FLOPs, parameters, and inference time.
Ablation Studies
Ablation experiments assessed the contribution of each component in GLF-YOLOX for solar panel defect detection. Results are summarized in Table 1 and Table 2.
Combination | Backbone | Attention | Head | mAP (%) | FLOPs (M) | Parameters (M) |
---|---|---|---|---|---|---|
(a) | CNN | None | Default | 87.30 | 6791.23 | 4.82 |
(b) | Transformer | None | Default | 86.44 | 12534.22 | 10.12 |
(c) | Dual | None | Default | 88.72 | 19752.83 | 15.43 |
(d) | Dual | SE | Default | 89.43 | 22315.57 | 16.09 |
(e) | Dual | CBAM | Default | 90.28 | 23741.28 | 16.67 |
(f) | Dual | GLE-AM | Default | 91.80 | 24528.65 | 17.36 |
(g) | Dual | GLE-AM | TEL | 93.10 | 29365.11 | 21.93 |
Combination | mAP (%) | AP Crack (%) | AP Black Core (%) | AP Thick Line (%) | AP Finger (%) | AP Star Crack (%) | FLOPs (M) | Parameters (M) |
---|---|---|---|---|---|---|---|---|
(a) | 87.30 | 86.11 | 88.12 | 86.37 | 87.20 | 88.16 | 6791.23 | 4.82 |
(b) | 86.44 | 85.60 | 85.89 | 85.23 | 86.44 | 87.00 | 12534.22 | 10.12 |
(c) | 88.72 | 87.20 | 88.92 | 88.16 | 88.72 | 90.00 | 19752.83 | 15.43 |
(d) | 89.43 | 89.77 | 88.63 | 88.83 | 89.43 | 90.50 | 22315.57 | 16.09 |
(e) | 90.28 | 89.26 | 90.15 | 89.75 | 90.28 | 91.20 | 23741.28 | 16.67 |
(f) | 91.80 | 91.52 | 91.87 | 91.66 | 91.80 | 92.50 | 24528.65 | 17.36 |
(g) | 93.10 | 92.22 | 93.04 | 93.18 | 93.10 | 94.00 | 29365.11 | 21.93 |
The ablation studies show that the dual-branch backbone alone improves mAP to 88.72%, while adding GLE-AM boosts it to 91.80%. Incorporating TEL in the detection head achieves the highest mAP of 93.10%, demonstrating the synergy of components for solar panel defect detection. The computational cost increases but remains feasible for real-time applications.
Comparison with State-of-the-Art Methods
I compared GLF-YOLOX with mainstream detectors and specialized solar panel defect detection algorithms. Results in Table 3 highlight the superiority of GLF-YOLOX in mAP, precision, and recall for solar panels.
Method | mAP (%) | RC (%) | PR (%) | Time (ms) | FLOPs (M) | Parameters (M) |
---|---|---|---|---|---|---|
RetinaNet | 62.88 | 62.17 | 61.23 | 116.72 | 77966.19 | 30.04 |
YOLOv3 | 64.49 | 63.92 | 62.82 | 19.48 | 77607.16 | 61.57 |
EfficientDet | 65.77 | 64.71 | 63.98 | 17.71 | 3624.17 | 3.83 |
CenterNet | 66.51 | 65.72 | 64.82 | 20.23 | 54650.80 | 32.66 |
Faster R-CNN | 69.63 | 68.88 | 67.91 | 148.94 | 473284.69 | 28.36 |
YOLOv4 | 73.65 | 72.83 | 71.92 | 17.66 | 70788.35 | 63.98 |
YOLOv5 | 78.18 | 77.59 | 76.77 | 16.87 | 8222.87 | 7.09 |
YOLOX | 87.57 | 86.47 | 85.92 | 14.69 | 15377.12 | 10.59 |
Dn-YOLOv7 | 89.34 | 85.72 | 86.33 | 18.62 | 16037.82 | 17.84 |
Gbh-YOLOv5 | 90.03 | 88.79 | 87.94 | 18.95 | 16922.41 | 17.92 |
ESD-YOLOv8 | 90.45 | 89.68 | 88.55 | 17.24 | 18893.54 | 20.13 |
RAFBSD | 90.86 | 89.02 | 89.57 | 18.21 | 20785.72 | 23.75 |
YOLOv8-MNS | 91.28 | 89.21 | 88.75 | 16.53 | 15521.12 | 14.50 |
GLF-YOLOX (Ours) | 93.10 | 91.57 | 92.43 | 19.01 | 29365.11 | 21.93 |
GLF-YOLOX achieves the highest mAP of 93.10% with a reasonable inference time of 19.01 ms, making it suitable for real-time solar panel inspection. The precision and recall rates of 92.43% and 91.57%, respectively, indicate robust detection capabilities across various defect types in solar panels.
Analysis of Dual-Branch Backbone Combinations
I evaluated different CNN and Transformer combinations in the dual-branch backbone for solar panel defect detection. Results in Table 4 show that CSPDarknet53 with Swin Transformer yields the best performance.
CNN Branch | Transformer Branch | mAP (%) | FLOPs (M) | Parameters (M) |
---|---|---|---|---|
DarkNet53 | PvT | 88.65 | 35660.65 | 25.96 |
DarkNet53 | CvT | 88.93 | 37906.67 | 28.87 |
DarkNet53 | Swin | 90.27 | 36275.34 | 27.41 |
ResNet50 | PvT | 88.81 | 25823.48 | 18.74 |
ResNet50 | CvT | 89.33 | 28169.55 | 21.65 |
ResNet50 | Swin | 90.93 | 26538.68 | 20.19 |
DenseNet121 | PvT | 89.39 | 22119.52 | 12.65 |
DenseNet121 | CvT | 90.27 | 24465.54 | 15.56 |
DenseNet121 | Swin | 91.14 | 22834.74 | 14.10 |
CSPDarknet53 | PvT | 90.59 | 27153.27 | 20.49 |
CSPDarknet53 | CvT | 91.37 | 29500.24 | 23.40 |
CSPDarknet53 | Swin | 93.10 | 29365.11 | 21.93 |
The CSPDarknet53 and Swin Transformer combination achieves a balance of high mAP and manageable complexity, emphasizing its efficacy for solar panel applications.
Conclusion
In this work, I presented GLF-YOLOX, an enhanced defect detection algorithm for solar panels that integrates global and local feature extraction. By combining CNN and Transformer architectures in a dual-branch backbone, employing GLE-AM for dynamic feature fusion, and incorporating TEL in the detection head, the method achieves significant improvements in accuracy and robustness. Experimental results on solar panel EL images demonstrate a mAP of 93.10%, outperforming existing methods. Future work will focus on incorporating multimodal data and model compression to further enhance performance and deployment efficiency in real-world solar panel inspection scenarios.