Improved U-Net for Solar Panel Image Segmentation

In the field of renewable energy, solar power generation has seen rapid development due to its wide availability, cleanliness, and abundance. Solar panels are the core components of photovoltaic systems, and their operational status directly impacts energy conversion efficiency. When defects occur on the surface of solar panels, the affected areas exhibit reduced efficiency and often show high brightness in infrared images. With the advancement of unmanned aerial vehicles (UAVs) and thermal imaging technology, image processing methods have become primary for defect detection in solar panels. To accurately locate defective components without interference from complex backgrounds, it is essential to first segment the complete solar panel regions from images. This segmentation task is challenging due to issues such as low contrast,模糊 boundaries, and cluttered backgrounds in real-world environments. Traditional image processing methods, which rely on color, edge, or texture features, often suffer from poor generalization and inadequate segmentation accuracy under varying conditions. Therefore, I propose an improved U-Net network for solar panel image segmentation, which enhances performance while maintaining model lightweightness, making it suitable for practical engineering applications.

Deep convolutional neural networks have shown superior performance in image segmentation tasks. The U-Net architecture, with its symmetric encoder-decoder structure and skip connections, is widely used in fields like medical imaging and remote sensing. However, the standard U-Net may lack sufficient feature extraction capability for diverse solar panel morphologies and complex backgrounds. To address this, I introduce modifications including depthwise separable convolutions, an efficient channel attention (ECA) module, and a composite loss function. These improvements aim to enhance segmentation accuracy, reduce computational costs, and accelerate convergence. In this article, I will detail the methodology, experimental setup, and results, demonstrating the effectiveness of the proposed approach for solar panel image segmentation.

The segmentation of solar panels is crucial for subsequent defect detection and component localization. In complex environments, solar panel images often exhibit low contrast,模糊 boundaries, and varied backgrounds, such as vegetation, roads, buildings, and water bodies. Traditional methods, like those based on local statistical features or color space transformations, are sensitive to environmental changes and may fail under different lighting or terrain conditions. For instance, using HSV color space thresholds for segmentation can be ineffective when solar panels have similar hues to the background. Similarly, methods relying on handcrafted features lack robustness. Deep learning approaches, particularly semantic segmentation networks, offer a solution by learning hierarchical features directly from data. However, existing networks like U-Net, Res-U-Net, and MobileNetV2 have limitations: U-Net may underperform on complex solar panel structures, Res-U-Net has high parameter counts, and MobileNetV2 sacrifices accuracy for lightweightness. Therefore, I developed an improved U-Net network that balances accuracy, efficiency, and generalization for solar panel segmentation.

The core idea of my method is to retain the U-Net’s encoder-decoder backbone while replacing standard convolutions with depthwise separable convolutions to reduce parameters. Additionally, I insert ECA attention modules between convolution blocks to enhance feature representation without significant computational overhead. The loss function combines cross-entropy, Dice, and Focal losses to handle class imbalance and improve convergence. I evaluated the proposed network on a dataset of 3,200 infrared images of solar panels, comparing it with MobileNetV2, U-Net, and Res-U-Net. Results show that my improved U-Net achieves higher pixel accuracy (PA) and mean intersection over union (MIoU), with fewer parameters than U-Net and Res-U-Net, making it suitable for deployment on resource-constrained platforms like embedded systems or UAVs.

In the following sections, I will first describe the network architecture in detail, including the depthwise separable convolutions, ECA module, and loss function. Then, I will explain the experimental setup, including dataset generation, training details, and evaluation metrics. After that, I will present results from ablation studies and comparative analyses, supported by tables and formulas. Finally, I will conclude with the implications of this work for solar panel inspection and future research directions.

Network Architecture

The proposed network is based on the U-Net framework, which consists of an encoder for feature extraction and a decoder for spatial resolution recovery, connected via skip concatenations. I modified this structure to improve its efficiency and performance for solar panel segmentation. The overall architecture is shown in Figure 1, but since I cannot reference figures directly, I will describe it textually. The input image size is 640×512 pixels. The encoder has four downsampling stages, each followed by a max-pooling layer with a 3×3 kernel. The decoder has four upsampling stages, using linear interpolation to restore image size, and skip connections from the encoder to preserve细节信息.

In the standard U-Net, each stage in the encoder and decoder uses two consecutive 3×3 convolutions followed by ReLU activation and batch normalization. I replace these standard convolutions with depthwise separable convolutions to reduce computational cost. Furthermore, I introduce a block structure: each block comprises two sets of depthwise separable convolutions with an ECA attention module in between. The number of blocks increases with network depth to capture more abstract features. Specifically, the encoder uses 1, 2, 3, and 4 blocks in its four stages, respectively, while the decoder uses 1 block per stage. The final layer is a 1×1 convolution that outputs the probability map for solar panel regions.

Depthwise separable convolution decomposes a standard convolution into two steps: depthwise convolution and pointwise convolution. For an input feature map of size $H \times W \times C$, where $H$ and $W$ are height and width, and $C$ is the number of channels, the depthwise convolution applies $C$ separate $K \times K \times 1$ kernels to each channel, producing $C$ feature maps. The computational cost for this step is $K^2 \times C \times H \times W$. Then, pointwise convolution uses $N$ kernels of size $1 \times 1 \times C$ to combine the channels, yielding an output of size $H \times W \times N$ with a cost of $C \times H \times W \times N$. The total cost of depthwise separable convolution is $C \times H \times W \times (K^2 + N)$, compared to $K^2 \times H \times W \times C \times N$ for standard convolution. The parameter count is reduced from $K^2 \times C \times N$ to $K^2 \times C + C \times N$. Typically, $K=3$, so the reduction factor is approximately $\frac{1}{N} + \frac{1}{9}$. For example, if $N=64$, the computation is about $\frac{1}{64} + \frac{1}{9} \approx 0.0156 + 0.1111 = 0.1267$ times that of standard convolution, meaning an 87.33% reduction. This lightweight design is crucial for processing high-resolution images of solar panels efficiently.

The ECA (Efficient Channel Attention) module enhances feature representation by modeling channel-wise dependencies without dimensionality reduction. Given an input feature map $X \in \mathbb{R}^{H \times W \times C}$, the module first performs global average pooling to obtain a channel descriptor $z \in \mathbb{R}^{1 \times 1 \times C}$, where each element $z_c$ is computed as:

$$ z_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} X_c(i,j) $$

Then, a one-dimensional convolution with kernel size $k$ is applied to $z$ to capture local cross-channel interactions. The kernel size $k$ is adaptively determined based on channel dimension $C$: $k = \frac{\log_2(C)}{\gamma} + \frac{b}{\gamma}$, where $\gamma$ and $b$ are constants set to 2 and 1, respectively. In my implementation, I use $k=5$ for simplicity. The convolved output is passed through a Sigmoid activation to generate channel weights $\alpha \in \mathbb{R}^{1 \times 1 \times C}$. Finally, the input feature map is scaled by these weights: $Y = X \otimes \alpha$, where $\otimes$ denotes channel-wise multiplication. This process highlights informative channels relevant to solar panel segmentation while suppressing less useful ones, improving feature discrimination.

The loss function is a linear combination of three terms: cross-entropy loss ($L_{CE}$), Dice loss ($L_{Dice}$), and Focal loss ($L_{FL}$). This composite loss addresses class imbalance, as solar panel regions may occupy a small portion of the image in some cases, and accelerates convergence. The total loss $L$ is defined as:

$$ L = \lambda L_{Dice} + \mu L_{CE} + \eta L_{FL} $$

where $\lambda$, $\mu$, and $\eta$ are weighting coefficients. Through empirical testing, I set $\lambda = 0.2$, $\mu = 0.5$, and $\eta = 0.3$ to balance the contributions. The cross-entropy loss for binary segmentation is given by:

$$ L_{CE} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 – y_i) \log(1 – p_i)] $$

where $N$ is the number of pixels, $y_i \in \{0,1\}$ is the ground truth label (0 for background, 1 for solar panel), and $p_i$ is the predicted probability of pixel $i$ belonging to the solar panel class. The Dice loss measures the overlap between prediction and ground truth:

$$ L_{Dice} = 1 – \frac{2 \sum_{i=1}^{N} y_i p_i + \epsilon}{\sum_{i=1}^{N} y_i + \sum_{i=1}^{N} p_i + \epsilon} $$

where $\epsilon$ is a small constant to avoid division by zero. The Focal loss addresses hard-to-classify examples by down-weighting easy samples:

$$ L_{FL} = -\frac{1}{N} \sum_{i=1}^{N} [\alpha (1 – p_i)^{\gamma} y_i \log(p_i) + (1 – \alpha) p_i^{\gamma} (1 – y_i) \log(1 – p_i)] $$

I use $\alpha = 0.25$ and $\gamma = 2$ as suggested in prior work. The combination of these losses ensures robust training for diverse solar panel images.

To summarize the network components, Table 1 lists the key operations in each stage of the encoder and decoder. Note that all convolutions are depthwise separable unless specified otherwise.

Stage	Operation	Output Size	Blocks
Encoder 1	Depthwise Separable Conv (3×3), BN, ReLU, ECA, Depthwise Separable Conv (3×3), BN, ReLU, MaxPool (3×3)	320×256×64	1
Encoder 2	Two sets of Depthwise Separable Convs with ECA, MaxPool	160×128×128	2
Encoder 3	Three sets of Depthwise Separable Convs with ECA, MaxPool	80×64×256	3
Encoder 4	Four sets of Depthwise Separable Convs with ECA, MaxPool	40×32×512	4
Decoder 1	Upsample (2×), Concatenate with Encoder 4, Depthwise Separable Conv Block	80×64×256	1
Decoder 2	Upsample, Concatenate with Encoder 3, Depthwise Separable Conv Block	160×128×128	1
Decoder 3	Upsample, Concatenate with Encoder 2, Depthwise Separable Conv Block	320×256×64	1
Decoder 4	Upsample, Concatenate with Encoder 1, Depthwise Separable Conv Block, 1×1 Conv	640×512×1	1

This architecture balances depth and efficiency, making it suitable for segmenting solar panels in various environments.

Experimental Setup

To validate the proposed method, I conducted experiments on a custom dataset of solar panel infrared images. The dataset was collected using a UAV equipped with an infrared camera with a resolution of 640×512 pixels. The images cover diverse scenarios, including mountainous areas, plains, rooftops, water surfaces, and plateaus, with varying solar panel arrangements, brightness levels, and background clutter. This diversity ensures that the model can generalize to real-world conditions. I manually annotated the images using LabelMe software to generate ground truth masks, where solar panel regions are marked as foreground. To augment the dataset, I applied random flipping, cropping, and noise addition to the original images and corresponding masks, resulting in 3,200 image-mask pairs. The dataset was split into training (2,240 images), validation (640 images), and test (320 images) sets, maintaining a 7:2:1 ratio.

The experiments were performed on a Windows system with an NVIDIA GeForce 1080 Ti GPU, using the PyTorch deep learning framework. The network was trained for 100 epochs with a batch size of 2. The optimizer was SGD with an initial learning rate of 0.05, decayed using a polynomial schedule: $lr = base\_lr \times (1 – \frac{iter}{total\_iter})^{0.9}$, where $iter$ is the current iteration and $total\_iter$ is the total number of iterations. Input images were resized to 640×512 and normalized. Data augmentation included random horizontal and vertical flips, rotations up to 10 degrees, and Gaussian noise with standard deviation 0.01.

I compared the improved U-Net with three baseline networks: MobileNetV2, U-Net, and Res-U-Net. MobileNetV2 was chosen for its lightweight design, U-Net for its segmentation prowess, and Res-U-Net for its residual connections. All networks were trained under the same conditions for fairness. The evaluation metrics were Pixel Accuracy (PA) and Mean Intersection over Union (MIoU), defined as:

$$ PA = \frac{\sum_{i=0}^{k} p_{ii}}{\sum_{i=0}^{k} \sum_{j=0}^{k} p_{ij}} $$
$$ MIoU = \frac{1}{k+1} \sum_{i=0}^{k} \frac{p_{ii}}{\sum_{j=0}^{k} p_{ij} + \sum_{j=0}^{k} p_{ji} – p_{ii}} $$

where $k=1$ for binary segmentation (solar panel vs. background), $p_{ii}$ are true positives, $p_{ij}$ are false positives, and $p_{ji}$ are false negatives. Higher values indicate better segmentation. I also reported parameter counts and inference times on both GPU and CPU to assess efficiency.

For ablation studies, I examined the impact of the ECA module and the composite loss function. I replaced the ECA module with the SE (Squeeze-and-Excitation) module and tested individual loss functions to justify my design choices. The weighting coefficients for the loss function were determined through grid search, as detailed in the next section.

Results and Analysis

The proposed improved U-Net network achieved superior performance on the solar panel segmentation task. Table 2 summarizes the quantitative results on the test set, comparing PA, MIoU, parameter counts, and inference times. All values are averaged over the test set, with standard deviations indicating consistency.

Network	Parameters	PA (Mean ± Std)	MIoU (Mean ± Std)	GPU Inference Time (s)	CPU Inference Time (s)
MobileNetV2	7.3M	0.9656 ± 0.043	0.9553 ± 0.039	0.12 ± 0.051	1.17 ± 0.074
U-Net	51M	0.9814 ± 0.037	0.9701 ± 0.033	0.12 ± 0.044	2.51 ± 0.128
Res-U-Net	56M	0.9867 ± 0.024	0.9736 ± 0.030	1.11 ± 0.046	2.49 ± 0.141
Improved U-Net (Ours)	17M	0.9931 ± 0.019	0.9802 ± 0.026	0.12 ± 0.040	1.58 ± 0.145

As shown, my improved U-Net attains the highest PA (0.9931) and MIoU (0.9802), with significantly lower standard deviations, indicating robust segmentation across diverse solar panel images. The parameter count is only 17 million, which is 33.3% of U-Net’s 51M and 30.4% of Res-U-Net’s 56M, while being slightly higher than MobileNetV2’s 7.3M. However, MobileNetV2 suffers in accuracy, making my network a better trade-off. Inference times on GPU are comparable across networks (around 0.12 seconds), but on CPU, my network is faster than U-Net and Res-U-Net, though slower than MobileNetV2. This efficiency makes the improved U-Net suitable for real-time applications on embedded devices, such as UAV-based inspection of solar panels.

To analyze the contribution of the ECA module, I conducted an ablation study by replacing it with the SE module. Table 3 presents the results. Both modules improve over a baseline without attention, but ECA yields better performance with fewer parameters, as it avoids dimensionality reduction.

Attention Module	PA (Mean ± Std)	MIoU (Mean ± Std)	Parameter Increase
None	0.9820 ± 0.030	0.9680 ± 0.035	0%
SE	0.9866 ± 0.033	0.9703 ± 0.036	~5%
ECA	0.9931 ± 0.019	0.9802 ± 0.026	~2%

The ECA module enhances feature focus on solar panel regions, leading to more precise boundaries. For example, in images with complex backgrounds like trees or buildings, ECA helps suppress irrelevant features, reducing false positives. The local cross-channel interaction in ECA is computationally efficient, adding minimal overhead while boosting accuracy.

Next, I evaluated the loss function. Table 4 compares the composite loss with individual loss terms. The convergence speed was measured by the number of epochs to reach a loss below 0.05, and segmentation performance was assessed on the test set.

Loss Function	Convergence Epochs	PA (Mean)	MIoU (Mean)
Dice Loss	63	0.9673	0.9634
Cross-Entropy Loss	36	0.9821	0.9702
Focal Loss	39	0.9764	0.9659
Composite Loss (Ours)	25	0.9931	0.9802

The composite loss converges in 25 epochs, faster than any single loss, and achieves the highest PA and MIoU. This is because it combines the strengths of each component: cross-entropy provides stable gradients, Dice loss handles class imbalance by focusing on overlap, and Focal loss emphasizes hard examples. The weighting coefficients $\lambda = 0.2$, $\mu = 0.5$, $\eta = 0.3$ were determined through grid search, optimizing for both speed and accuracy. Formally, the grid search minimized the validation loss over a range of values: $\lambda, \mu, \eta \in [0,1]$ with $\lambda + \mu + \eta = 1$. The optimal point was found using a step size of 0.1, and further refined to 0.2, 0.5, 0.3 via binary search.

Visual results further demonstrate the effectiveness of my method. For instance, in images with small solar panels against a cluttered background, the improved U-Net accurately segments the panels with sharp boundaries, while other networks may miss parts or include background noise. In cases where solar panels have low contrast due to weather conditions, my network maintains high accuracy thanks to the ECA module’s ability to enhance relevant features. These qualitative observations align with the quantitative metrics.

To provide deeper insight, I analyze the computational complexity of the network. Let $F$ be the number of feature maps, $S$ the spatial size, and $K$ the kernel size. The total floating-point operations (FLOPs) for depthwise separable convolution can be approximated as:

$$ FLOPs_{depthwise} = S^2 \times (K^2 \times C + C \times N) $$

For standard convolution, it is $S^2 \times K^2 \times C \times N$. In my network, with input size 640×512, $C=3$ initially, and increasing channels in the encoder, the total FLOPs are reduced by about 70% compared to standard U-Net. This reduction enables faster training and inference, which is crucial for large-scale solar panel inspections where thousands of images may be processed.

Moreover, I tested the generalization ability by applying the trained model to unseen solar panel images from different geographic locations. The improved U-Net maintained high accuracy, whereas traditional methods like threshold-based segmentation failed due to variations in illumination and panel appearance. This robustness stems from the deep learning approach’s ability to learn invariant features from diverse training data.

In terms of practical application, the segmentation output can be used for downstream tasks such as defect detection. For example, once solar panel regions are identified, anomaly detection algorithms can focus on these areas to identify hot spots or cracks. This pipeline enhances the efficiency of solar farm maintenance. My network’s lightweight design allows it to be deployed on edge devices, enabling real-time analysis during UAV flights, reducing the need for data transmission and storage.

Conclusion

In this article, I proposed an improved U-Net network for solar panel image segmentation. The method addresses challenges like low contrast,模糊 boundaries, and complex backgrounds by incorporating depthwise separable convolutions, an ECA attention module, and a composite loss function. Depthwise separable convolutions reduce parameter counts and computational costs, making the network efficient for deployment. The ECA module enhances feature representation by modeling channel dependencies without dimensionality reduction, improving segmentation accuracy for solar panels. The composite loss combines cross-entropy, Dice, and Focal losses to handle class imbalance and accelerate convergence.

Experiments on a dataset of 3,200 infrared images of solar panels show that the improved U-Net outperforms MobileNetV2, U-Net, and Res-U-Net in terms of pixel accuracy and MIoU, while having fewer parameters than U-Net and Res-U-Net. Ablation studies confirm the contributions of each component. The network is robust across diverse solar panel scenarios, including mountainous terrain, rooftops, and water surfaces, demonstrating strong generalization.

Future work could explore further lightweight techniques, such as neural architecture search, to optimize the network for specific solar panel types or environments. Additionally, integrating temporal information from video sequences could improve segmentation in dynamic conditions. The proposed method provides a reliable foundation for automated solar panel inspection, contributing to the sustainable development of photovoltaic energy systems. By enabling accurate and efficient segmentation of solar panels, this technology supports the maintenance and optimization of solar farms, ultimately enhancing energy production and reducing operational costs.

In summary, the improved U-Net offers a balanced solution for solar panel image segmentation, combining high accuracy with practical efficiency. As solar energy continues to expand globally, such advanced image processing tools will play a vital role in ensuring the reliability and performance of photovoltaic installations.