Efficient Solar Panel Defect Detection with Improved RT-DETR

With the increasing global demand for electricity and the urgent need to mitigate environmental pollution caused by fossil fuels, solar energy has gained significant attention as a renewable and clean power source. In many regions, large-scale ground-mounted grid-connected solar power plants have been established, alongside numerous distributed generation systems installed on rooftops in rural areas. The efficiency of solar power generation systems is directly influenced by the condition of solar panels. External environmental factors, such as desert and Gobi regions, can lead to defects like dust accumulation, bird droppings, and physical damage on solar panels, resulting in issues such as cracks and hot spots. These defects not only reduce energy conversion efficiency but also pose potential circuit safety risks. Therefore, timely and efficient fault diagnosis of solar panels is crucial for extending their lifespan and improving photoelectric conversion efficiency.

Current methods for solar panel defect detection include manual inspection, electrical characteristic-based detection, and computer vision-based approaches. Due to site limitations, manual inspection is often inefficient and struggles to promptly identify and address defects. Electrical characteristic-based detection involves installing sensors on solar panels to monitor operational parameters like current and voltage, enabling fault diagnosis through analysis. However, this method incurs high costs, lacks flexibility, and requires ongoing maintenance of sensors, adding to the overall expense. In contrast, computer vision-based detection has emerged as a mainstream approach due to its timeliness and cost-effectiveness.

Since 2012, the advent of convolutional neural networks (CNNs) has driven advancements in computer vision. Networks such as the YOLO series and Faster R-CNN have demonstrated strong performance in extracting local features through hierarchical feature aggregation. However, these methods generate a large number of redundant detection boxes, necessitating threshold filtering and non-maximum suppression (NMS) processing, which deviates from the ideal end-to-end detection paradigm. With the success of Transformer models in natural language processing, researchers have adapted them to computer vision, leading to the development of end-to-end object detection algorithms like DETR (Detection Transformer). This approach frames object detection as a set prediction problem, eliminating the need for excessive prediction boxes and post-processing steps like NMS. Despite its advantages, DETR suffers from slow training convergence and limited feature spatial resolution, prompting the development of numerous variants. For instance, Deformable DETR accelerates convergence by focusing on key sampling points, while Sparse DETR addresses computational complexity in the encoder through sparse queries. DINO, introduced as a state-of-the-art end-to-end object detector, incorporates contrastive denoising, mixed query selection, and a look-forward-twice scheme, establishing a new framework. The Baidu team further enhanced DINO to create RT-DETR (Real-Time Detection Transformer), which removes threshold filtering and NMS, achieving true end-to-end detection with higher training accuracy in fewer iterations.

In this study, we address challenges in solar panel defect detection, such as low accuracy, high model parameter counts, and issues with missed and false detections in complex backgrounds, by proposing an efficient algorithm based on the RT-DETR model. Our improvements focus on three key aspects: First, we enhance the backbone network by integrating attention mechanisms and efficient detection techniques to streamline the FasterNet architecture, achieving lightweight design. Using RepConv technology, we reparameterize partial convolutions (PConv) in the backbone and introduce the Efficient Multi-Scale Attention (EMA) mechanism to construct a lightweight multi-branch feature extraction module, thereby improving feature extraction capability and generalization. Second, we design a CRDFP (Contextual Reconstruction and Dynamic Fusion Pyramid) structure for feature fusion, which utilizes horizontal and vertical pooling to capture global context and combines pyramid context extraction modules to enhance multi-scale feature representation, thereby improving target recognition in complex backgrounds. Finally, we incorporate a deformable attention mechanism (Deformable Attention Transformer, DAT) in the neck encoder network, enabling dynamic selection of sampling points to concentrate on important regions, reduce computational load, and adapt to various visual tasks. These optimizations significantly enhance the model’s efficiency, accuracy, and generalization ability.

The RT-DETR model is an efficient single-stage object detector based on the Transformer architecture, introducing a real-time end-to-end object detector that simplifies the target detection pipeline. It consists of three main components: a backbone (e.g., ResNet18), an efficient hybrid encoder comprising AIFI and CCFM, and a decoder with auxiliary prediction head functionality. As a significant innovation in object detection, RT-DETR eliminates the NMS post-processing step commonly used in traditional models by optimizing the Transformer architecture, maintaining high precision and speed in real-time applications.

To tackle issues such as small detection targets, background interference, and light reflections in solar panel images, we improve the original network in three ways. First, in the backbone section, we combine RepConv with FasterNet’s partial convolutions and integrate the EMA attention mechanism to replace the BasicBlock, enhancing computational efficiency. Second, we design the CRDFP multi-scale fusion structure in the encoder network, which improves model recognition capability through dynamic interpolation and multi-feature fusion. Finally, we enhance the AIFI module using a deformable attention mechanism, enabling deeper understanding of target details and contextual information to boost accuracy. The improved network model, termed FCD-DETR, is illustrated in the network diagram.

In convolutional neural networks (CNNs), the BasicBlock is a common building unit widely used in deep network designs, particularly in residual networks (ResNet), where residual connections help mitigate gradient vanishing issues during deep network training. However, despite its effectiveness in many visual tasks, BasicBlock has limitations in fine-grained feature extraction, especially for solar panel defect detection. Traditional BasicBlock modules struggle to capture subtle local features and spatial information, particularly when dealing with fine cracks or local defects on solar panels. The core issue lies in the convolution design, where BasicBlock fails to effectively capture complex local features, leading to reduced accuracy in fine-grained detection.

To address this, we propose an improved FREBlock module. This module optimizes the convolution design in BasicBlock, allowing the network to more accurately capture fine defect features. FREBlock uses FasterNet as the backbone, applies structural reparameterization (Rep) to fuse partial convolutions, and incorporates EMA, resulting in FREConv. This enhances the network’s sensitivity to local features and reduces interference from global background noise. This design significantly improves the model’s performance in fine-grained defect detection. Additionally, FREBlock boosts computational efficiency by reducing redundant calculations and effectively compressing model parameters, ensuring efficient operation in real-time applications. These improvements not only enhance detection accuracy but also guarantee practicality and responsiveness in real-world deployments, providing a more efficient and precise solution for solar panel defect detection tasks.

In solar panel images, environmental and background constraints lead to substantial redundant computations during feature extraction, resulting in low computational efficiency and high floating-point operations. Traditional convolutional networks with fixed kernels have clear limitations. To solve this, various approaches have been explored, such as group convolution (GConv) and MicroNet. However, the former often faces increased memory access demands, requiring higher hardware performance when reducing FLOPs, while the latter further decomposes the network to reduce computations but suffers from low computational efficiency and performance degradation. Both methods introduce additional data operations, further prolonging runtime.

The FasterNet architecture introduces a simple, fast, and effective operator, essentially using PConv (Partial Convolution) design to reduce computational redundancy and memory increase. The partial convolution model applies standard convolution only to a subset of input channels to extract features, leaving the remaining channels unchanged. Specifically, it uses the first or last consecutive $$c_p$$ channels as representatives for computing the entire feature map. Without loss of generality, we assume input and output feature maps have the same number of channels. The FLOPs for PConv are given by:

$$FLOPs = h \times w \times k^2 \times c_p^2$$

When using common values, the FLOPs of a PConv are only 1/16 of a standard Conv, and the memory access for PConv is approximately:

$$h \times w \times 2c_p + k^2 \times c_p^2 \approx h \times w \times 2c_p$$

where $$h$$, $$w$$, $$k$$, and $$c_p$$ represent the feature map dimensions and kernel size.

However, PConv has limitations in complex natural scene image classification. Therefore, we employ structural reparameterization (Rep) to enhance convolutional network performance, resulting in the RPConv module. Reparameterization primarily fuses 1×1 and 3×3 convolutions (Conv) with batch normalization (BN) layers to reduce computational load. The BN formula is expressed as:

$$\hat{x}_i = \gamma \cdot \frac{x_i – \mu}{\sqrt{\sigma^2 + \varepsilon}} + \beta = \frac{\gamma}{\sqrt{\sigma^2 + \varepsilon}} \cdot x_i + \left( \beta – \frac{\gamma \cdot \mu}{\sqrt{\sigma^2 + \varepsilon}} \right)$$

Here, $$\varepsilon$$ is a stability term, $$\mu$$ is the mean statistic of input features, $$\sigma^2$$ is the variance statistic, $$\gamma$$ is the BN layer’s scaling parameter controlling feature variance, and $$\beta$$ is the BN layer’s shift parameter controlling feature mean position. This can be viewed as $$y = w x + b$$, where $$w_{BN} = \frac{\gamma}{\sqrt{\sigma^2 + \varepsilon}}$$ and $$b_{BN} = \beta – \frac{\gamma \cdot \mu}{\sqrt{\sigma^2 + \varepsilon}}$$. The fusion of Conv and BN is then:

$$\hat{x} = w_{BN} \cdot (w_{conv} \cdot x + b_{conv}) + b_{BN} = (w_{BN} \cdot w_{conv}) \cdot x + (w_{BN} \cdot b_{conv} + b_{BN})$$

Thus, the merged convolution parameters are:

$$\begin{cases} w = w_{BN} \cdot w_{conv} \ b = w_{BN} \cdot b_{conv} + b_{BN} \end{cases}$$

Here, $$w$$ and $$b$$ are the merged convolution weight and bias, respectively. During training, multi-branch convolutional layers are used, and during inference, branch parameters are reparameterized into the main branch to reduce computation and memory consumption.

Due to variations in environment, altitude, and other factors, target objects in images may have different sizes and levels of detail. Therefore, we introduce the Efficient Multi-Scale Attention (EMA) mechanism. This module preserves channel information while reducing computational costs. First, it employs cross-spatial learning by applying one-dimensional horizontal and vertical global pooling through two parallel sub-networks (X Avg Pool and Y Avg Pool). The resulting horizontal and vertical features are concatenated and fused, and their outputs are aggregated via matrix dot product operations to generate the first spatial attention map. Finally, the output feature maps within each group are aggregated using Sigmoid functions applied to the two generated spatial attention weights, capturing pixel-level pairwise relationships and highlighting global context across all pixels. Matrix dot product operations capture pixel-level pairwise relationships, emphasizing regions containing target objects and reducing interference from irrelevant background information. By grouping features, the model can allocate and process data across more GPU resources. For any input feature map, EMA divides it into G subgroups, each learning different semantics, reducing computational complexity from $$O(C^2 \times H \times W)$$ to $$O(C^2 / G \times H \times W)$$, where G is the number of groups. This grouping strengthens feature learning in semantic regions and compresses noise. Integrating the EMA module into the RPConv workflow allows the model to allocate computational resources rationally, dynamically distributing attention based on importance rather than averaging it across all feature regions or elements, avoiding excessive impact on model efficiency from complex computations. This ensures model performance without significantly increasing computational and storage demands. Through multi-branch structures during training, it learns rich image features such as textures and shapes of different objects, and fuses these features during inference, improving classification accuracy for various object categories.

Since most solar panel defect images are captured outdoors with complex and variable environments, factors like background and lighting pose significant challenges for defect detection. To enhance the model’s adaptability to these defect targets, especially small object detection, more efficient encoder modules are needed to fuse features from different levels. The original CCFM module in the network fuses three layers of semantic information extracted by the backbone network through top-down and bottom-up approaches. While this somewhat improves the model’s adaptability to multi-scale features, issues like missed detections and misalignments persist in complex environments. This is because different levels of semantic features have varying receptive fields, and simple fusion may cause conflicts between positional and boundary details in shallow features and deep feature information.

To address this, we propose an improved feature fusion strategy inspired by the Contextual Guided Spatial Feature Reconstruction Network (CGRSeg), incorporating the Rectangular Self-Calibration Module (RCM). RCM uses addition to model key regions and designs shape self-calibration to calibrate rectangular attention, enabling the model to focus more on foreground features. Building on this, we introduce the Pyramid Context Extraction Module (PCE), which captures rich semantic information through multi-scale pyramid pooling aggregation (Parallelized Patch-Aware Attention Module, PPA), enhancing the model’s contextual awareness. Additionally, we combine Dynamic Interpolation Fusion (DIF) and Multi-Fusion Block (FBM) strategies to form the CRDFP (Contextual Reconstruction and Dynamic Fusion Pyramid) structure. The DIF module dynamically adjusts feature maps at different scales, using bilinear interpolation to fuse multi-scale information and adaptively handle targets of varying sizes and shapes. The FBM module further improves the model’s ability to recognize targets in complex backgrounds through multiple feature processing and fusion steps.

To better extract image features, the RCM module we use consists of Rectangular Self-Calibration Attention (RCA), batch normalization (BN), and a multi-layer perceptron (MLP). RCA employs horizontal and vertical pooling to capture global context along two axial directions, generating two different axis vectors. By applying broadcast addition to these vectors, RCA effectively models rectangular regions of interest. Since RCA uses large-kernel strip convolutions, which have high parameter counts and can lead to unstable gradients, we incorporate BN layers to provide normalization, ensuring stable feature distributions and better learning and convergence. MLP refines features, and when combined with residual connections at the end of the流程, it further enhances feature reuse. Compared to traditional fully connected layers, MLP is more lightweight, meeting efficient detection needs. This module improves the model’s contextual awareness, thereby enhancing feature extraction accuracy and robustness, allowing it to better handle challenges such as varying lighting, angles, and background changes, with strong adaptability and expressive power.

The RCM module is integrated through PPA to fuse feature maps at different scales, resulting in the PCE section. RCA within RCM captures global contextual information of the image and models regions of interest via a broadcast addition mechanism, enhancing spatial contextual awareness. This process aims to optimize regions of interest in the image, bringing them closer to foreground objects. It decouples the calibration process using two types of strip convolution kernels, improving adaptability to features of different shapes. First, horizontal strip convolution calibrates shapes in the horizontal direction, adjusting elements in each row to align more closely with foreground regions. Then, batch normalization normalizes the extracted feature maps, stabilizing training and enhancing network robustness. The ReLU activation function is introduced to increase non-linear mapping capability, strengthening the model’s ability to express complex features. Next, vertical strip convolution adjusts shapes in the vertical direction, enabling the model to handle various shape transformations. Through this decoupled convolution approach, the module adaptively processes features in different directions and shapes, improving detection and recognition of objects with diverse forms. Finally, the calibrated feature maps are further enhanced via MLP to optimize image feature representation, boosting model performance in complex backgrounds.

In solar panel defect detection, feature maps at different levels contain information at various scales. High-level feature maps typically carry more semantic information, while low-level feature maps contain finer details. To effectively extract contextual information from images, especially for multi-scale feature fusion, the primary fusion modules in our structure are DIF and FBM.

The DIF module further fuses feature maps from different scales, addressing challenges in cross-scale feature fusion. By dynamically adjusting the channel numbers and spatial dimensions of feature maps at different scales, it achieves efficient feature fusion. In this module, after resizing $$X_2$$ to the same spatial dimensions as $$X_1$$ via bilinear interpolation, convolutional layers process the features, yielding fused feature maps. This method effectively integrates information from different scales, enhancing multi-scale feature representation.

The FBM module uses convolutional layers to process low-frequency and high-frequency feature maps separately, adjusts high-frequency features via activation functions, and finally combines the two feature maps. Low-frequency features provide global semantic information, while high-frequency features enhance detail and local feature recognition. This module adjusts the spatial dimensions of high-frequency features through bilinear interpolation, enabling effective fusion with low-frequency features, further improving the accuracy and robustness of solar panel defect detection.

This multi-module feature fusion approach not only enhances defect detection accuracy under varying lighting and background conditions but also maintains high computational efficiency. By working synergistically, these modules strengthen the semantic expression and detail capture capabilities of feature maps, providing a more robust and precise solution for solar panel defect detection.

In the original Transformer-based AIFI module, the standard self-attention mechanism fails to effectively handle internal scale interactions of low-level features, severely limiting model performance in complex scenarios of solar panel defect detection, such as background interference and light reflections. To address background interference issues, we employ an improved D-AIFI module that introduces a deformable attention mechanism to overcome this limitation. Traditional Transformer models use basic multi-headed self-attention (MHSA), which processes all pixels in the image, leading to high computational load and low efficiency. For MHSA, the formulation is:

$$q = x W_q, \quad k = x W_k, \quad v = x W_v$$

$$z^{(m)} = \sigma \left( \frac{q^{(m)} k^{(m)\top}}{\sqrt{d}} \right) v^{(m)}, \quad m = 1, \dots, M$$

$$z = \text{Concat}(z^{(1)}, \dots, z^{(M)}) W_o$$

where $$\sigma$$ denotes the softmax function, $$z^{(m)}$$ represents the embedded output of the m-th attention head, and $$q^{(m)}$$, $$k^{(m)}$$, $$v^{(m)}$$ are the query, key, and value embeddings, respectively, with $$W_q$$, $$W_k$$, $$W_v$$ as projection matrices.

By introducing DAT, which focuses only on key regions in the image, computational load is reduced while maintaining high detection performance. As shown in the deformable attention model, input feature maps generate dynamically selected important sampling points instead of processing the entire image fixedly, allowing the model to concentrate on the most critical regions for the task. For reference points in the image set, to obtain the correct offset for each, the feature map is linearly projected to obtain q features, which are then fed into a lightweight sub-network to generate offsets $$\theta_{\text{offset}}$$. The network structure involves input features first passing through a depthwise convolution to capture local features, followed by a GELU activation function and a 1×1 convolution to produce 2D offsets.

When inputting feature maps, predefined factors measure the amplitude to prevent excessive offsets and stabilize training. Features are then sampled at deformed points positions as keys and values, and projected via:

$$q = x W_q, \quad \tilde{k} = \tilde{x} W_k, \quad \tilde{v} = \tilde{x} W_v$$

with

$$\Delta p = \theta_{\text{offset}}(q), \quad \tilde{x} = \phi(x; p + \Delta p)$$

Here, $$\tilde{k}$$ and $$\tilde{v}$$ represent the deformed key and value embeddings. Specifically, the sampling function $$\phi$$ is set to bilinear interpolation to ensure differentiability:

$$\phi(z; (p_x, p_y)) = \sum_{(r_x, r_y)} g(p_x, r_x) g(p_y, r_y) z[r_y, r_x, :]$$

The latter part of the formula indexes all position information, but only the four nearest integer points are non-zero, resulting in four weighted averages after integration. Multi-head attention is applied to q, k, v, and relative position offsets are incorporated. The attention head formula is:

$$z^{(m)} = \sigma \left( \frac{q^{(m)} \tilde{k}^{(m)\top}}{\sqrt{d}} + \phi(\hat{p}; R) \right) \tilde{v}^{(m)}$$

This corresponds to position embedding and adaptation. Each head is connected, ultimately yielding $$z^{(m)}$$. Relative position bias computed via deformation points enhances the feature transformation capability of the multi-head attention mechanism, making it more focused on key information in the feature maps.

In the current field of solar panel defect detection, due to the scarcity of professional and publicly available datasets, our dataset comprises images collected from public sources such as PaddlePaddle datasets, Roboflow, and PV-HSD-2025, totaling 8312 original image samples. After filtering, 4271 images containing three types of defects—hot spots, dirt, and damage—were selected, meeting the classification requirements for practical applications.

During dataset partitioning, the images were split into training, validation, and test sets in an 8:1:1 ratio. In the training phase, input images were resized to 640×640 pixels, with a total of 200 training epochs, a batch size of 8, and 4 worker threads. Other parameters were set to default values.

The experimental environment configuration is as follows:

Parameter	Configuration
Operating System	Windows 10
GPU	NVIDIA GeForce RTX 4070 Super
CPU	Intel(R) Core(TM) i5-13400F
Memory	12 GB
Python Version	3.9
Software Framework	PyTorch

In this experiment, we used several key performance metrics for analysis, including precision (P), recall (R), giga floating-point operations per second (GFLOPs), number of parameters (Params), and mean average precision (mAP).

Precision P is the proportion of actual positive samples among all positive samples predicted by the model, i.e., “precision rate”. Recall R measures the proportion of actual positive samples that are correctly predicted by the model, i.e., “recall rate”. Higher values indicate better prediction accuracy and coverage of positive samples. The formulas are:

$$P = \frac{N_{TP}}{N_{TP} + N_{FP}}$$

$$R = \frac{N_{TP}}{N_{TP} + N_{FN}}$$

The number of model parameters (Params) refers to the total parameter count, with higher values indicating higher spatial complexity. Floating-point operations (GFLOPs) measure computational performance.

mAP is the average of average precision (AP) across multiple categories, with higher values indicating better model performance. AP is the area under the precision-recall curve, defined as:

$$AP = \int_0^1 P(R) dR$$

$$mAP = \frac{1}{n} \sum_{j=1}^n AP_j$$

We conducted ablation experiments to validate the impact of each improvement on the model. The results are shown in the table below.

Exp.	Basic	FREBlock	CRDFP	DAIFI	P/%	R/%	GFLOPs	Param/MB	mAP/%
1	√				76.6	68.3	58.3	20.8	75.6
2	√	√			78.1	70.6	52.0	17.1	76.7
3	√		√		78.4	71.2	48.6	19.3	77.3
4	√			√	79.8	71.9	58.5	20.0	77.5
5	√	√	√		81.6	73.4	43.3	16.4	78.5
6	√	√		√	81.5	73.7	49.1	18.6	78.7
7	√	√	√	√	82.3	74.2	43.2	16.1	79.2

From the experimental results, Experiment 1 used the unmodified baseline RT-DETR model, achieving an mAP of 75.6%. Experiment 2 replaced the original ResNet structure with the FREBlock backbone, reducing parameters and computations by 17.8% and 10.8%, respectively, and increasing mAP to 76.7%, a 1.1 percentage point improvement, achieving higher detection accuracy with fewer parameters. Experiments 3 and 4 introduced CRDFP and DAIFI separately, showing slight parameter reductions compared to the baseline but improving detection accuracy by 1.7 and 1.9 percentage points, respectively. Experiments 5 and 6 combined the improved backbone with CRDFP and DAIFI, respectively, resulting in parameter reductions and accuracy improvements of 2.9 and 3.1 percentage points. The final Experiment 7, incorporating FREBlock backbone, CRDFP hybrid encoder, and DAIFI, demonstrated cumulative effects, raising mAP to 79.2%, a total increase of 3.6 percentage points. These results indicate that the improved model exhibits efficient detection performance and potential for practical application.

To objectively verify the algorithm’s performance improvement and demonstrate its superiority over other mainstream algorithms, we compared it with several state-of-the-art models, including Faster R-CNN, YOLOv5m, YOLOv5l, YOLOv7, YOLOv8m, YOLOv8l, Deformable DETR, and RT-DETR. The results are summarized in the table below.

Model	P/%	R/%	Param/M	GFLOPs	mAP@0.5/%
Faster R-CNN	63.9	58.4	137.1	303.1	63.6
YOLOv5m	71.9	63.1	21.3	64.1	71.2
YOLOv5l	72.8	64.5	46.2	78.7	72.0
YOLOv8m	74.5	67.2	25.8	59.0	74.3
YOLOv8l	75.2	68.9	43.4	91.6	75.2
YOLOv10m	75.4	69.3	17.2	63.6	75.3
YOLOv11m	75.7	69.6	20.6	67.7	75.5
YOLOv12m	75.9	69.6	20.1	68.2	74.9
Deformable DETR	74.6	68.4	39.8	196.1	75.1
RT-DETR	75.9	69.8	20.8	58.2	75.6
FCD-DETR	82.3	74.2	16.1	43.2	79.2

The table shows significant differences in performance and efficiency among object detection models, with our improved algorithm achieving the highest accuracy. Its superior precision (82.3%) and recall (74.2%) indicate strong detection accuracy and reduced missed detections. Traditional methods like Faster R-CNN lag significantly in all metrics, highlighting advancements in modern architectures. In terms of computational efficiency, our model achieves these results with lower computational cost and fewer parameters, balancing detection speed and accuracy. In contrast, while the latest YOLO series and RT-DETR show decent mAP, their computational complexity limits their applicability in hardware-constrained scenarios. Leveraging the Transformer architecture for efficient feature extraction and global modeling, our model achieves卓越 performance without excessive resource consumption. Visual comparisons demonstrate that the improved model significantly enhances accuracy, with the original model showing missed and false detections, confirming that FCD-DETR performs well for solar panel target detection tasks.

To validate the generalization capability of our proposed FCD-DETR algorithm, we used the publicly available target detection dataset PVEL-AD, jointly released by Hebei University of Technology and Beihang University. This dataset contains 36,543 near-infrared images, including one class of non-defective images and 12 different classes of defective images. We used 3,812 images from this dataset, covering seven defect types: black core, crack, horizontal dislocation, finger, short circuit, star crack, and thick line.

We compared the improved FCD-DETR algorithm with the original RT-DETR algorithm on this dataset. The performance comparison is shown in the table below.

Model	P/%	R/%	mAP@0.5	Param/M
RT-DETR	66.6	69.7	74.4	20.8
FCD-DETR	76.7	72.9	80.8	16.1

The experimental results show that FCD-DETR effectively reduces model parameters and computations while maximizing accuracy in comprehensive defect detection. This not only validates the effectiveness of the proposed algorithm but also demonstrates its generalization capability in practical applications. To further substantiate the generalization experiment, we visualized detection results for typical defect categories in the dataset. The comparisons直观 show that our model exhibits better focus on key regions in complex backgrounds and improved detection capability for multiple overlapping samples compared to the original model, with reduced missed detections and false alarms, verifying the model’s effectiveness in detail capture.

In conclusion, addressing challenges in solar panel defect detection such as complex background interference, low accuracy in small object detection, and low computational efficiency of traditional models, we propose an efficient detection algorithm, FCD-DETR, based on the RT-DETR model that eliminates the need for NMS post-processing. By integrating the RepConv module to simplify the backbone network and incorporating the EMA attention mechanism to enhance feature extraction, designing the CRDFP multi-scale feature fusion strategy to improve contextual awareness, and introducing the deformable attention mechanism (DAttention) to optimize local detail capture, we significantly improve the model’s detection performance in complex scenarios. Experimental results show that the improved model achieves a mean average precision (mAP) of 79.2% on our custom dataset, while reducing the number of parameters by 22.6% and computational load by 25.9%, giving it dual advantages in accuracy and efficiency. Generalization experiments on the public dataset PVEL-AD further demonstrate its robustness and practicality, providing reliable technical support for the engineering application of solar panel defect detection. The results of this study offer effective technical means for efficient maintenance of solar panels, with certain engineering application value and promotion potential. Future work will explore more lightweight models to reduce the hardware requirements of DETR in practical engineering.