In recent years, the growing energy shortage and environmental issues have highlighted the importance of renewable energy sources. Solar panels, as a key component of photovoltaic power generation, have been widely deployed globally. However, due to long-term outdoor operation, solar panels are prone to defects such as cracks, dirt spots, and broken grids. These defects not only reduce efficiency and lifespan but also pose safety risks to entire power stations. Therefore, defect detection in solar panels is crucial to ensure optimal energy conversion, extend service life, and lower maintenance costs. With advancements in computer hardware and algorithms, deep learning-based methods for solar panel defect detection have emerged. These models leverage convolutional neural networks (CNNs) to automatically learn and extract features, enabling defect localization and classification with robustness and generalization. However, challenges remain, particularly in aerial inspections where small defect targets are difficult to detect, and detection speeds are often slow due to computational complexity. To address these issues, I propose an improved Detection Transformer (DETR) method specifically tailored for solar panel defect detection. This approach integrates relative position encoding, dynamic sparse attention, and Focal Loss to enhance performance on small targets, reduce computational overhead, and improve classification of challenging samples.

The core of my method builds upon the DETR framework, which formulates object detection as a set prediction problem using a Transformer architecture. Traditional DETR employs absolute position encoding in its self-attention mechanisms, which may not adequately capture relative spatial relationships, especially for small defect targets in aerial images of solar panels. Additionally, the self-attention computation in Transformers scales quadratically with sequence length, leading to slow detection speeds. Furthermore, difficult-to-classify samples, such as blurred or occluded defects, are often mishandled due to class imbalance. My improvements tackle these limitations systematically, resulting in a more efficient and accurate detector for solar panel inspections.
First, to enhance detection of small defects in solar panels, I introduce relative position encoding (RPE) into the Transformer. In standard self-attention, the attention score between elements at positions \(i\) and \(j\) is computed with absolute position vectors. Let \(E_{x_i}\) and \(E_{x_j}\) be token embeddings, and \(U_i\) and \(U_j\) be absolute position vectors. The attention score \(A^{\text{abs}}_{i,j}\) is given by:
$$A^{\text{abs}}_{i,j} = E_{x_i}^T W_q^T W_k E_{x_j} + E_{x_i}^T W_q^T W_k U_j + U_i^T W_q^T W_k E_{x_j} + U_i^T W_q^T W_k U_j$$
where \(W_q\) and \(W_k\) are query and key weight matrices. To incorporate relative position, I replace \(U_j\) with a relative position vector \(R_{i-j}\), which depends on the offset between positions \(i\) and \(j\). This yields a modified attention score \(A^{\text{rel}}_{i,j}\):
$$A^{\text{rel}}_{i,j} = E_{x_i}^T W_q^T W_{k,E} E_{x_j} + E_{x_i}^T W_q^T W_{k,R} R_{i-j} + u^T W_{k,E} E_{x_j} + v^T W_{k,R} R_{i-j}$$
Here, \(W_{k,E}\) and \(W_{k,R}\) are weight matrices for embeddings and relative positions, respectively, and \(u\) and \(v\) are learnable vectors that replace position-dependent terms. This formulation allows the model to better perceive spatial relationships among small defect features in solar panel images, improving detection accuracy for targets like diode failures that appear as tiny hot spots.
Second, to accelerate detection speed, I integrate a Dynamic Sparse Attention (DSA) module. The self-attention mechanism in DETR requires computing pairwise interactions across all tokens, leading to \(O(n^2)\) complexity for sequence length \(n\). The DSA module dynamically sparsifies attention patterns based on input features, reducing computations while preserving essential information. Specifically, a lightweight predictor learns to select a subset of key tokens for each query, converting dense attention into sparse matrix multiplications. The sparse output is then processed through a dense multiplication step to generate final attention outputs. This adaptive sparsity lowers computational costs significantly, enabling faster inference during aerial inspections of solar panels without substantial accuracy loss.
Third, to address difficult-to-classify defect samples in solar panels, I replace the standard cross-entropy loss in DETR with Focal Loss. In aerial images, some defects, such as cracks or dirt spots, may be ambiguous or partially obscured, making them hard to classify. Focal Loss down-weights easy samples and focuses on hard negatives by modulating the loss function. For a binary classification case, Focal Loss is defined as:
$$\text{FL}(p_t) = -(1 – p_t)^\gamma \log(p_t)$$
where \(p_t\) is the model’s estimated probability for the true class, and \(\gamma\) is a focusing parameter (set to 2 in my experiments). For multi-class defect detection in solar panels, I extend this to a weighted sum over classes, emphasizing challenging samples and improving overall classification performance.
The overall loss function for my improved DETR combines Focal Loss for classification and a smoothed L1 loss for bounding box regression. The total loss \(L\) is:
$$L = \lambda_{\text{cls}} \sum_{i=1}^{N} \text{FL}(p_i) + \lambda_{\text{box}} \sum_{i=1}^{N} \mathbb{1}_{\{\text{object}\}} L_{\text{box}}(b_i, \hat{b}_i)$$
where \(N\) is the number of predictions, \(p_i\) is the class probability, \(b_i\) and \(\hat{b}_i\) are predicted and ground-truth boxes, and \(\lambda_{\text{cls}}\) and \(\lambda_{\text{box}}\) are weighting factors. This combination ensures balanced optimization for both defect localization and categorization in solar panels.
To validate my method, I conducted extensive experiments using a custom dataset of solar panel aerial images. The dataset was collected via drone flights over solar installations, capturing infrared images that highlight thermal anomalies indicative of defects. After preprocessing and augmentation, the dataset comprised 1,200 images, split into 960 for training and 240 for validation. Defects were categorized into three types: diode failures, junction box faults, and surface cracks. Data augmentation techniques, including rotation, flipping, cropping, and blurring, were applied to increase diversity and robustness. The following table summarizes the dataset statistics:
| Defect Type | Description | Number of Instances | Typical Size in Pixels |
|---|---|---|---|
| Diode Failure | Small circular hot spots from faulty diodes | Approx. 300 | 10×10 to 20×20 |
| Junction Box Fault | Larger irregular hot areas from connection issues | Approx. 280 | 50×50 to 100×100 |
| Surface Crack | Linear or branched hot patterns from physical damage | Approx. 250 | 30×30 to 80×80 |
The experimental environment included a Windows 10 system with an Intel i9 processor, 32 GB RAM, and an NVIDIA GTX 3080 GPU (16 GB VRAM). The software stack used Python 3.9 and PyTorch. Training parameters were set as follows: input image size of 640×512 pixels, initial learning rate of \(10^{-4}\), weight decay of \(10^{-5}\), batch size of 2, and 300 training epochs. Evaluation metrics included precision \(P\) and mean average precision \(\text{mAP}\), defined as:
$$P = \frac{TP}{TP + FP}$$
$$\text{mAP} = \frac{1}{\beta} \sum_{i=1}^{\beta} AP_i$$
where \(TP\) and \(FP\) are true and false positives, \(\beta\) is the number of defect classes (here, \(\beta=3\)), and \(AP_i\) is the average precision for class \(i\). Higher mAP indicates better overall detection accuracy for solar panel defects.
During training, the loss curves showed stable convergence. The improved DETR achieved lower loss values compared to the baseline, particularly after 40 epochs, with stabilization around 200 epochs. The following table presents the results of ablation studies and comparisons with other methods, including YOLOv5 and the original DETR. Each variant adds components incrementally to demonstrate their impact:
| Method | Precision for Diode Fault (\(P_2\)) | Precision for Crack Fault (\(P_1\)) | Precision for Junction Box Fault (\(P_0\)) | Mean Average Precision (mAP) |
|---|---|---|---|---|
| YOLOv5 | 89.9% | 90.4% | 90.3% | 90.2% |
| Original DETR | 89.2% | 89.8% | 89.8% | 89.6% |
| DETR + RPE | 93.1% | 93.1% | 92.5% | 92.9% |
| DETR + RPE + DSA | 91.9% | 92.4% | 92.0% | 92.1% |
| DETR + RPE + DSA + Focal Loss (Proposed) | 94.3% | 95.0% | 94.8% | 94.7% |
The results indicate that relative position encoding (RPE) boosted mAP by 3.3%, with notable gains for small diode faults (3.9% increase in \(P_2\)). The DSA module slightly reduced mAP by 0.8% but improved inference speed by approximately 20% due to sparse computations, a worthwhile trade-off for real-time solar panel inspections. Incorporating Focal Loss further enhanced mAP by 2.6%, demonstrating its effectiveness on hard-to-classify samples. Overall, my improved DETR achieved a mAP of 94.7%, surpassing the original DETR by 5.1% and outperforming YOLOv5, highlighting its superiority for solar panel defect detection.
To analyze computational efficiency, I measured the inference time per image on the test set. The original DETR required about 120 ms per image, while my improved version with DSA reduced this to 95 ms, without significant accuracy drop. This speedup is crucial for large-scale aerial surveys of solar farms, where thousands of images must be processed rapidly. The following formula estimates the complexity reduction from DSA. Let \(n\) be the sequence length and \(k\) be the average number of attended tokens per query after sparsification. Standard self-attention has complexity \(O(n^2 d)\), where \(d\) is the feature dimension. With DSA, complexity becomes \(O(n k d)\), where \(k \ll n\) in practice. For typical solar panel images, \(n\) is around 1600 (from feature maps), and DSA reduces \(k\) to about 200, yielding roughly 8× fewer operations in attention layers.
In conclusion, I have developed an enhanced DETR-based method for defect detection in solar panels, addressing key challenges in aerial imagery. By integrating relative position encoding, dynamic sparse attention, and Focal Loss, the model achieves higher accuracy for small targets, faster inference speeds, and better handling of difficult samples. Experimental results on a custom drone-captured dataset confirm the effectiveness, with a mean average precision of 94.7%. This approach can facilitate automated inspection of solar panels, promoting efficient maintenance and sustainable energy production. Future work may explore multi-scale feature fusion and transformer adaptations for other renewable energy infrastructure, further advancing the role of deep learning in solar panel monitoring.
