Enhanced Solar Panel Defect Detection via Global and Local Feature Fusion

1. Introduction

Solar panels, as critical components of photovoltaic power generation systems, directly influence energy conversion efficiency and operational safety. However, existing defect detection algorithms often struggle to balance local feature extraction and global context modeling, limiting their performance in complex scenarios such as low-contrast images, blurred boundaries, or cluttered backgrounds. Traditional convolutional neural networks (CNNs) excel at capturing local details like edges and textures but lack the ability to model long-range dependencies. Conversely, Transformers leverage self-attention mechanisms to capture global relationships but underperform in low-level feature extraction. To address these limitations, we propose Global and Local Feature Enhanced YOLOX (GLF-YOLOX), a novel solar panel defect detection framework that synergizes the strengths of CNNs and Transformers. Our method achieves a mean Average Precision (mAP) of 93.10%, surpassing state-of-the-art approaches by 4.5%, while maintaining real-time inference capabilities.


2. Methodology

2.1 Dual-Branch Backbone Network

The backbone integrates CSPDarknet53 (CNN-based) and Swin Transformer to extract complementary features:

  • CNN Branch: Captures fine-grained local features (e.g., cracks, texture irregularities) through hierarchical convolutions. Output resolutions are reduced progressively to H4×W44H​×4W​, H8×W88H​×8W​, H16×W1616H​×16W​, and H32×W3232H​×32W​, with channel dimensions increasing to 64, 128, 256, and 512.
  • Transformer Branch: Processes input via shifted window multi-head self-attention (SW-MSA) to model global dependencies. Token dimensions are scaled from C=64C=64 to 512 across four stages, matching the CNN’s resolution hierarchy.

The fusion of these branches ensures robust multi-scale feature representation, critical for detecting solar panel defects in diverse scenarios.

2.2 Global and Local Enhanced Attention Mechanism (GLE-AM)

GLE-AM dynamically calibrates feature discrepancies between global and local representations:

  1. Global Feature Extraction Module (GFEM):
    • Computes absolute differences between input features X∈RC×H×WX∈RC×H×W and global average-pooled values:xabs=∣X−GAP(X)∣xabs​=∣X−GAP(X)∣
  2. Local Feature Extraction Module (LFEM):
    • Derives local saliency via max-average pooling differences:x2=MaxPool(X)−AvgPool(X)x2​=MaxPool(X)−AvgPool(X)
  3. Squeeze-and-Excitation (SE):
    • Optimizes channel-wise weights using Mish activation:SE(X)=C2(C1(AvgPool(X)))+C2(C1(MaxPool(X)))SE(X)=C2​(C1​(AvgPool(X)))+C2​(C1​(MaxPool(X)))
    • Final attention weights:θ=σ(SE(X))⊗(σ(xabs)×σ(x2))θ=σ(SE(X))⊗(σ(xabs​)×σ(x2​))
    • Enhanced output:xout=X+θxout​=X+θ

This mechanism enhances defect region focus while suppressing background noise.

2.3 Transformer-Enhanced Detection Head

To address misclassification of similar defects, we design a Class-Attention Transformer Encoder Layer (CLs TEL):

  • Positional Encoding: Embeds spatial context for precise defect localization.
  • Transformer Block: Optimizes global feature relationships via multi-head attention.
  • Output dimensions: (H,W,4+1+num_classes)(H,W,4+1+num_classes), where 44 denotes bounding box coordinates, 11 is the defect confidence score, and num_classesnum_classes represents defect categories.

3. Experiments

3.1 Dataset and Training

  • Dataset: 3,700 electroluminescence (EL) images of solar panels, annotated with five defect types: crackblack corethick linefinger, and star crack.
  • Split: 2,960 training and 740 validation images.
  • Hardware: NVIDIA RTX 3090 GPU, Intel i7-11700K CPU.
  • Hyperparameters: SGD optimizer, learning rate 0.0010.001, weight decay 0.00050.0005, batch size 8.

3.2 Ablation Study

Table 1 validates the incremental contributions of GLF-YOLOX components:

ConfigurationmAP (%)FLOPs (M)Params (M)
CNN Only87.306,791.234.82
Transformer Only86.4412,534.2210.12
Dual-Branch88.7219,752.8315.43
Dual-Branch + SE89.4322,315.5616.09
Dual-Branch + CBAM90.2823,741.2816.67
Dual-Branch + GLE-AM91.8024,528.6517.36
Full GLF-YOLOX93.1029,365.1121.93

GLE-AM and CLs TEL collectively improve mAP by 4.38%, demonstrating their efficacy in solar panel defect detection.

3.3 Comparative Analysis

GLF-YOLOX outperforms existing methods across critical metrics (Table 2):

MethodmAP (%)Recall (%)Precision (%)Inference Time (ms)
RetinaNet62.8862.1761.23116.72
YOLOv578.1877.5976.7716.87
YOLOX87.5786.4785.9214.69
Gbh-YOLOv590.0388.7987.9418.95
ESD-YOLOv890.4589.6888.5517.24
GLF-YOLOX93.1091.5792.4319.01

Our method achieves superior accuracy while maintaining real-time performance, crucial for industrial solar panel inspections.


4. Mathematical Formulation

4.1 Swin Transformer Window Attention

The shifted window self-attention in Swin Transformer is defined as:z′=W-MSA(LN(zl−1))+zl−1,zl=MLP(LN(z′))+z′,z′′=SW-MSA(LN(z′))+z′,zl+1=MLP(LN(z′′))+z′′.zzlz′′zl+1​=W-MSA(LN(zl−1))+zl−1,=MLP(LN(z′))+z′,=SW-MSA(LN(z′))+z′,=MLP(LN(z′′))+z′′.​

where W-MSA and SW-MSA denote window-based and shifted window multi-head self-attention.

4.2 Loss Function

The total loss LtotalLtotal​ combines classification (LclsLcls​), regression (LregLreg​), and objectness (LobjLobj​) losses:Ltotal=λ1Lcls+λ2Lreg+λ3Lobj,Ltotal​=λ1​Lcls​+λ2​Lreg​+λ3​Lobj​,

with λ1,λ2,λ3λ1​,λ2​,λ3​ as balancing coefficients.


5. Conclusion

We present GLF-YOLOX, a hybrid framework that synergizes CNNs and Transformers for solar panel defect detection. By integrating dual-branch feature extraction, GLE-AM, and a Transformer-enhanced detection head, our method achieves state-of-the-art performance (93.10% mAP) with real-time inference. Future work will focus on lightweight deployment and multi-modal data fusion to further enhance robustness in challenging environments.

Scroll to Top