A Lightweight and Efficient Detection Model for Foreign Objects and Defects on Solar Panels

The imperative to transition towards sustainable energy sources is paramount in the global effort to address climate change. Solar energy, harnessed through photovoltaic technology, stands as a cornerstone of this transition. Solar panels, the fundamental components of photovoltaic systems, directly determine the efficiency, longevity, and safety of power generation. However, throughout their lifecycle—encompassing manufacturing, transportation, installation, and long-term operation in diverse environments—solar panels are susceptible to the accumulation of foreign objects (e.g., dust, snow, bird droppings) and the development of intrinsic defects (e.g., cracks, hot spots, electrical damage). These issues can severely degrade performance; for instance, even thin layers of dust can reduce power output by 10-20%, while defects like hot spots pose significant fire risks, leading to substantial economic losses and safety hazards for large-scale solar farms. Therefore, developing automated, accurate, and efficient inspection systems for solar panels is critically important for ensuring the health and profitability of solar energy assets.

Traditional inspection methods for solar panels often rely on manual visual checks or basic image processing techniques, which are labor-intensive, subjective, and inefficient for large-scale deployments. While initial approaches using clustering in color spaces or handcrafted feature extraction showed promise, they struggle with the complex and varied appearance of defects on solar panels under different lighting and weather conditions. The advent of deep learning, particularly Convolutional Neural Networks (CNNs), has revolutionized visual inspection by providing models capable of learning robust feature representations directly from data. Among deep learning-based object detectors, the YOLO (You Only Look Once) family is renowned for its excellent balance between speed and accuracy, making it highly suitable for real-time or near-real-time industrial inspection tasks.

YOLOv11n, as a recent and lightweight member of this series, offers a solid baseline. However, our investigation and preliminary experiments revealed that its application to the specific domain of solar panel inspection faces several challenges when aiming for optimal deployment on edge devices or in resource-constrained environments. The primary issues we identified include: (1) Suboptimal feature extraction for diverse defects: The standard convolutional modules may not efficiently capture the multi-scale and intricate texture patterns characteristic of small cracks, bird droppings, or discolored patches on solar panels. (2) High computational footprint: While designed to be lightweight, the baseline model’s parameter count and floating-point operations (FLOPs) could be further reduced to enable faster inference on low-power hardware commonly used in drone-based or fixed-camera inspection systems. (3) Limited feature refinement before fusion: The neck network, responsible for fusing features from different depths, could benefit from mechanisms that emphasize the most informative channels, especially for distinguishing subtle defects from complex backgrounds like soiled or patterned solar panels. (4) Inefficient bounding box regression: The default Complete Intersection over Union (CIoU) loss, while effective, can have slow convergence and suboptimal performance for objects with extreme aspect ratios or small sizes, which are common in solar panel defect imagery.

To overcome these limitations, we propose a comprehensive and efficient enhancement to the YOLOv11n architecture, termed FESI-YOLOv11n. The name encapsulates our four core innovations: F (Faster_Block_EMA integration), E (Efficient detection head), S (SEAttention mechanism), and I (Inner_DIoU loss). Our goal is to simultaneously boost detection accuracy for foreign objects and defects on solar panels while significantly reducing the model’s computational complexity, creating a solution that is both high-performing and practical for field deployment.

1. The FESI-YOLOv11n Architecture: A Detailed Breakdown

The overall architecture of our proposed FESI-YOLOv11n model follows the proven backbone-neck-head paradigm but incorporates targeted modifications at each stage. We systematically replaced key components to enhance feature extraction, reduce redundancy, and improve optimization.

1.1 Backbone Enhancement with C3k2_Faster_EMA

The backbone is responsible for extracting hierarchical features from the input image. The original YOLOv11n uses C3k2 modules, which employ a bottleneck structure with two 3×3 convolutions. While effective, we sought a module that better balances representational power and efficiency for the textures found on solar panels.

Our solution is the C3k2_Faster_EMA module. It synergistically combines the principles of the lightweight FasterNet Block and the multi-scale Efficient Multi-scale Attention (EMA) mechanism. The core innovation lies in the Faster_Block_EMA unit, which replaces the standard bottleneck convolutions. This unit first applies a Partial Convolution (PConv). Unlike standard convolution, PConv performs spatial feature extraction only on a subset of input channels, leaving the others unchanged. This dramatically reduces computational cost and memory access. The processed features then pass through two pointwise (1×1) convolutions for channel interaction and dimension restoration, with Batch Normalization and activation functions in between. Crucially, the EMA attention module is integrated in parallel. EMA groups channels and uses a combination of global average pooling and small-kernel depthwise convolutions to capture cross-dimensional interactions across both channel and spatial axes, efficiently generating attention weights without significant overhead.

The integration of Faster_Block_EMA into the C3k2 skeleton creates a powerful residual block. The use of dual 3×3 convolutions (approximating a 5×5 receptive field) within the residual path is retained, ensuring strong feature extraction capability for the diverse shapes and textures of solar panel anomalies. The overall effect is a module that performs more efficient multi-scale feature extraction with heightened sensitivity to critical regions, all while reducing parameters and FLOPs. The structure of this module is summarized in the following processing flow:

Step Operation Purpose
1 Input Feature Map Split Divides channels for parallel processing.
2 Path A: 1×1 Convolution Maintains identity/shortcut path.
3 Path B: Sequence of [CBS, N x Faster_Block_EMA, CBS] Core feature extraction with PConv and EMA attention.
4 Concatenation Merges features from Path A and B.
5 Final 1×1 Convolution Fuses concatenated features and adjusts channel dimensions.

Mathematically, if we denote the input to a Faster_Block_EMA as \( X \in \mathbb{R}^{C \times H \times W} \), the PConv operates on the first \( \frac{C}{r} \) channels (where \( r \) is a reduction ratio, often 4 or 8). The output of the full C3k2_Faster_EMA module can be seen as a refined feature map \( Y \) with enhanced multi-scale contextual information:
$$ Y = \mathcal{F}_{\text{C3k2\_Faster\_EMA}}(X) = \text{Conv}_{1\times1}\big( [X_{\text{pathA}}, \mathcal{F}_{\text{Faster\_EMA}}(X_{\text{pathB}})] \big) $$
where \( \mathcal{F}_{\text{Faster\_EMA}} \) represents the sequence of operations in the Faster_Block_EMA chain, and \( [\cdot] \) denotes concatenation.

1.2 Feature Refinement with SEAttention in the Neck

The neck of the detector, typically a Feature Pyramid Network (FPN) or Path Aggregation Network (PAN), combines features from different backbone levels to create a multi-scale representation. Before these features are fused, we introduce a lightweight but effective channel-wise attention mechanism to recalibrate them.

We embed a Squeeze-and-Excitation (SEAttention) module after the C2PSA module in the backbone, just before the features are passed to the neck for upsampling and concatenation. The SEAttention module operates through a simple yet powerful two-step process: Squeeze and Excitation.

  1. Squeeze: Global spatial information is compressed using Global Average Pooling (GAP). For an input feature map \( U \) with \( C \) channels, this produces a channel descriptor \( z \in \mathbb{R}^C \), where the \( c \)-th element is:
    $$ z_c = \mathcal{F}_{sq}(U_c) = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} U_c(i, j) $$
  2. Excitation: A simple gating mechanism with sigmoid activation learns the interdependencies between channels. It typically involves two fully connected (FC) layers with a non-linearity (ReLU) in between:
    $$ s = \mathcal{F}_{ex}(z, W) = \sigma(W_2 \delta(W_1 z)) $$
    Here, \( W_1 \in \mathbb{R}^{\frac{C}{r} \times C} \) and \( W_2 \in \mathbb{R}^{C \times \frac{C}{r}} \) are the FC layer weights, \( \delta \) is the ReLU function, \( \sigma \) is the sigmoid function, and \( r \) is a reduction ratio (e.g., 16). The output \( s \) is a vector of channel weights between 0 and 1.

The final output of the SEAttention module is obtained by rescaling the input feature map \( U \) with the learned weights:
$$ \tilde{U}_c = s_c \cdot U_c $$
This operation allows the model to adaptively emphasize informative features (e.g., those corresponding to defect edges or texture anomalies on the solar panels) and suppress less useful ones, leading to more discriminative feature maps for subsequent fusion and detection, with minimal added computation.

1.3 Efficient Multi-Scale Detection Head

The detection head is responsible for predicting both the class and the bounding box coordinates from the fused multi-scale features. The original YOLOv11n head is effective but can be streamlined. Inspired by the efficient design philosophy of EfficientDet, we reconstruct the detection head to be more parameter-efficient.

Our Detect_Efficient head adopts a streamlined flow. First, the input features from the BiFPN (or PAN) are processed by two parallel branches of grouped convolutions. Grouped convolutions reduce computation by dividing channels into groups and performing convolution independently within each group. The outputs of these branches are then fed into two separate prediction sub-networks: one for class confidence and one for bounding box regression (BoxNet). These sub-networks use standard convolutions. Finally, the predictions from all feature scales (P3, P4, P5) are concatenated to form the final output.

This design decouples the processing streams for classification and localization early on, allowing each to specialize. The use of grouped convolutions at the initial stage significantly cuts down parameters and FLOPs compared to using standard convolutions throughout. The structure can be conceptualized as follows for a single scale:

Component Layers Key Purpose
Stem Two parallel 3×3 Grouped Convolutions Initial lightweight feature transformation.
Class Predictor Standard 1×1 and 3×3 Convolutions Predicts per-anchor class probabilities.
Box Predictor (BoxNet) Standard 1×1 and 3×3 Convolutions Predicts bounding box offsets (dx, dy, dh, dw).
Output Concatenation of Class & Box outputs Forms the final detection tensor for the scale.

1.4 Optimized Regression with Inner_DIoU Loss

Accurate localization of defects on solar panels is crucial, as the size and extent of a crack or hot spot are important metrics. The choice of bounding box regression loss function significantly impacts localization precision. We replace the commonly used CIoU loss with the more advanced Inner_DIoU loss.

The fundamental metric is Intersection over Union (IoU):
$$ \text{IoU} = \frac{|B \cap B^{gt}|}{|B \cup B^{gt}|} $$
where \( B \) is the predicted box and \( B^{gt} \) is the ground-truth box.

Distance-IoU (DIoU) loss introduces a penalty term based on the normalized center-point distance:
$$ \mathcal{L}_{DIoU} = 1 – \text{IoU} + \frac{\rho^2(b, b^{gt})}{c^2} $$
Here, \( \rho(\cdot) \) is the Euclidean distance, \( b \) and \( b^{gt} \) are the box centers, and \( c \) is the diagonal length of the smallest enclosing box covering both predicted and ground-truth boxes.

Inner_DIoU builds upon DIoU but reformulates it to focus on accelerating convergence, especially for high-IoU samples. Its definition is:
$$ \mathcal{L}_{\text{Inner\_DIoU}} = 1 – \text{Inner\_DIoU} = 1 – \Big( \text{IoU} – \frac{d^2}{c^2} \Big) $$
The term \( \frac{d^2}{c^2} \) is the same normalized center distance as in DIoU. By subtracting it from the IoU, the Inner_DIoU metric itself decreases when either the overlap is low or the centers are misaligned. Minimizing the loss \( 1 – \text{Inner\_DIoU} \) therefore simultaneously maximizes overlap and minimizes center distance. This formulation provides more balanced gradients during training, leading to faster convergence and more precise box regression for the often small and irregularly shaped targets on solar panels, without adding any computational cost during inference.

2. Experimental Setup and Comprehensive Analysis

To rigorously evaluate the performance of FESI-YOLOv11n, we conducted extensive experiments on a dedicated dataset of solar panel images.

2.1 Dataset and Implementation Details

We curated a dataset comprising 12,887 images of solar panels, each resized to 640×640 pixels. The dataset includes six critical categories relevant to solar panel health monitoring: bird-drop, clean, dusty, snow-covered, electrical-damage, and physical-damage. The dataset was split into training (10,309 images), validation (1,289 images), and test (1,289 images) sets. All models were trained from scratch for 300 epochs using SGD optimizer with a batch size of 16, on a system with an NVIDIA RTX 4060 Ti GPU.

2.2 Ablation Studies and Component Analysis

We performed systematic ablation studies to validate the contribution of each proposed component. The baseline is the original YOLOv11n model.

A. Impact of Convolutional Module: We first evaluated the effect of replacing the standard C3k2 module with various advanced alternatives. The results clearly demonstrate the superiority of our proposed C3k2_Faster_EMA module.

Model Variant mAP@50 (%) mAP@50:95 (%) Parameters GFLOPs
YOLOv11n (Baseline) 68.0 65.7 2,583,322 6.3
+ C3k2-RVB 66.3 62.3 2,289,650 5.9
+ iRMB-Cascaded 67.4 63.7 2,445,242 6.3
+ ContextGuided 69.2 66.2 2,188,355 5.6
+ RFAConv 69.5 66.9 2,636,786 6.6
+ C3k2-DRB 70.1 67.5 2,443,410 6.3
+ C3k2_Faster_EMA (Ours) 70.4 68.1 2,301,746 6.0

The C3k2_Faster_EMA module achieved the highest mAP scores (70.4% and 68.1%) while maintaining low parameter count and computational cost, proving its efficiency and effectiveness for solar panel feature extraction.

B. Impact of Attention Mechanism: Next, we integrated various attention mechanisms into the model enhanced with C3k2_Faster_EMA. The SEAttention module provided the best performance boost.

Attention Mechanism mAP@50 (%) mAP@50:95 (%) Parameters GFLOPs
AFGCAttention 69.9 67.3 2,359,752 5.9
MLCA 70.4 68.0 2,301,756 5.9
Dattention 70.5 67.8 2,560,578 6.1
SegNext_Attention 70.7 68.4 2,387,906 6.0
CAFM 70.8 67.9 2,639,531 6.2
TripletAttention 70.9 68.5 2,294,154 5.9
SEAttention (Ours) 71.2 68.7 2,302,146 5.9

C. Impact of Detection Head: We then compared different detection head designs on the model with C3k2_Faster_EMA and SEAttention. Our Detect_Efficient head achieved an excellent balance.

Detection Head mAP@50 (%) mAP@50:95 (%) Parameters GFLOPs
v10Detect 65.8 63.4 2,302,146 5.9
Detect_RSCD 69.4 66.5 2,551,657 6.2
MultiSEAMHead 70.8 68.1 4,314,306 5.6
Detect_SEAM 70.9 68.4 2,210,370 5.3
Detect_Efficient (Ours) 71.1 68.5 2,033,218 4.7

Notice the significant drop in parameters and GFLOPs with our head, while accuracy is preserved or even slightly improved, highlighting its efficiency.

D. Impact of Loss Function: Finally, we evaluated different regression losses on the nearly complete model. Inner_DIoU delivered the best final performance.

Loss Function mAP@50 (%) mAP@50:95 (%)
focaler_GIoU 68.9 66.1
EIoU 69.4 66.9
focaler_DIoU 70.3 67.9
mpdIoU 70.7 68.1
SIoU 70.8 68.4
DIoU 71.2 68.7
Inner_DIoU (Ours) 71.6 69.1

E. Comprehensive Ablation Study: The progressive integration of our components is summarized below, showing the cumulative benefit.

Components Integrated mAP@50 (%) mAP@50:95 (%) Parameters GFLOPs
Baseline (YOLOv11n) 68.0 65.7 2,583,322 6.3
+ C3k2_Faster_EMA 70.4 (+2.4) 68.1 (+2.4) 2,301,746 6.0
+ SEAttention 71.2 (+3.2) 68.7 (+3.0) 2,302,146 5.9
+ Detect_Efficient 71.1 (+3.1) 68.5 (+2.8) 2,033,218 4.7
+ Inner_DIoU (FESI-YOLOv11n) 71.6 (+3.6) 69.1 (+3.4) 2,033,218 4.7

The final FESI-YOLOv11n model achieves a 3.6 percentage point increase in mAP@50 and a 3.4 percentage point increase in the more stringent mAP@50:95 metric over the baseline. Remarkably, this is accomplished alongside a 21.29% reduction in parameters (from 2.58M to 2.03M) and a 25.4% reduction in computation (from 6.3 GFLOPs to 4.7 GFLOPs).

2.3 Comparison with State-of-the-Art Models

We compared FESI-YOLOv11n against other recent and lightweight YOLO variants on our solar panel test set. The results underscore the superior efficiency and accuracy of our approach.

Model mAP@50 (%) mAP@50:95 (%) Parameters GFLOPs
YOLOv6n 69.3 67.1 4,155,618 11.5
YOLOv8n 70.9 68.4 2,685,538 6.9
YOLOv9s 71.5 69.0 21,362,066 84.1
YOLOv10n 67.2 63.0 2,696,756 8.2
YOLOv11n (Baseline) 68.0 65.7 2,583,322 6.3
FESI-YOLOv11n (Ours) 71.6 69.1 2,033,218 4.7

Our model outperforms YOLOv8n and YOLOv10n in accuracy while being significantly lighter. It achieves comparable accuracy to the larger YOLOv9s but with less than 10% of its parameters and 5.6% of its computational cost. Compared to its direct baseline, YOLOv11n, our model is unequivocally more accurate and efficient.

3. Conclusion and Future Work

In this work, we have presented FESI-YOLOv11n, a highly efficient and accurate deep learning model tailored for the detection of foreign objects and defects on solar panels. The core of our contribution lies in four synergistic improvements: (1) The C3k2_Faster_EMA module, which enhances multi-scale feature extraction from solar panel imagery while reducing computational burden; (2) The strategic placement of the SEAttention mechanism to recalibrate channel features before multi-scale fusion, boosting discriminative power; (3) The redesigned Detect_Efficient head that employs grouped convolutions to drastically cut parameters and FLOPs without sacrificing accuracy; and (4) The adoption of the Inner_DIoU loss function for faster-converging and more precise bounding box regression.

Extensive experimental results on a comprehensive solar panel dataset validate the effectiveness of each component. Our final model achieves a significant boost in detection precision (mAP@50:95 of 69.1%) while simultaneously reducing the model size by 21.29% and computational load by 25.4% compared to the baseline YOLOv11n. This makes FESI-YOLOv11n not only more accurate but also more suitable for deployment on resource-constrained edge devices, such as inspection drones or embedded systems at solar farms.

This work provides a robust technical solution for the automated inspection of solar panels, contributing to the maintenance efficiency, safety, and economic viability of solar power generation. Future research directions include extending the model to perform pixel-wise segmentation of defects for more detailed analysis, adapting it to real-time video streams from autonomous inspection vehicles, and exploring knowledge distillation techniques to create even smaller variants for ultra-low-power hardware, further broadening the applicability of AI-driven maintenance for solar energy infrastructure.

Scroll to Top