Improved U-Net for Solar Photovoltaic Panel Extraction

In the context of global carbon reduction efforts, solar photovoltaic (PV) technology has experienced rapid development. The installation characteristics of PV systems include large areas, irregular shapes, and diverse scenarios, necessitating efficient methods for their detection and monitoring. Traditional supervised classification approaches often yield poor results due to the lack of common training regions across numerous PV images, leading to low accuracy and high manual costs. This prolongs detection cycles and reduces efficiency. To address these challenges, we employ deep learning models based on convolutional neural networks (CNNs) to identify and extract PV panels from high-resolution remote sensing imagery, enabling high-precision intelligent detection and assessment. Our work focuses on enhancing the U-Net architecture to improve feature extraction and fusion, specifically tailored for solar energy storage applications.

The proliferation of solar energy storage systems underscores the need for accurate and efficient monitoring techniques. Solar energy storage is integral to maximizing the utilization of renewable resources, and precise extraction of PV panels facilitates better management and planning. In this study, we propose a feature-enhanced fusion network that integrates attention mechanisms and residual modules into the U-Net framework. This approach aims to enhance the model’s focus on PV regions, mitigate interference from complex backgrounds, and improve robustness across various installation scenarios. By leveraging advanced deep learning techniques, we aim to contribute to the optimization of solar energy storage systems through reliable and automated detection methods.

Our research begins with a review of traditional supervised classification methods, which have been widely used in remote sensing image analysis. These include maximum likelihood classification, minimum distance classification based on Euclidean distance, and Mahalanobis distance classification. The maximum likelihood method operates under the assumption that features follow a normal distribution, with the discriminant function defined as:

$$ f_i(x) = p(q_i | x) = \frac{p(x | q_i) p(q_i)}{p(x)} $$

where ( p(x | q_i) ) is the conditional probability of observing ( x ) given class ( q_i ), ( p(q_i) ) is the prior probability of class ( q_i ), and ( p(x) ) is the common term. For multispectral data, this is expressed as:

$$ f_i(x) = \frac{p(q_i)}{(2\pi)^{K/2} |\Sigma_i|^{1/2}} \exp\left(-\frac{1}{2}(x – u_i)^T \Sigma_i^{-1} (x – u_i)\right) $$

where ( K ) is the number of bands, ( \Sigma_i ) is the covariance matrix for class ( i ), and ( u_i ) is the mean vector. Minimum distance classification using Euclidean distance calculates the distance between a pixel and class centers:

$$ d_{E_i}(x) = \sqrt{(x – M_i)^T (x – M_i)} $$

while Mahalanobis distance incorporates covariance:

$$ d_{M_i}(x) = (x – M_i)^T \Sigma_i^{-1} (x – M_i) $$

However, these methods struggle with the high variability and complex backgrounds in PV imagery, necessitating more advanced approaches like deep learning for solar energy storage applications.

Convolutional neural networks (CNNs) have revolutionized image analysis by automatically learning hierarchical features. A typical CNN consists of convolutional layers, activation functions, pooling layers, and fully connected layers. The convolution operation for a 2D image can be represented as:

$$ (f \ast g)(n) = \sum_{\tau} f(\tau) g(n – \tau) $$

where ( f ) is the input image and ( g ) is the kernel. Activation functions like ReLU introduce non-linearity:

$$ \text{ReLU}(x) = \max(0, x) $$

Pooling layers, such as max pooling, reduce spatial dimensions while retaining important features. Fully connected layers integrate features for classification, and the output layer often uses Softmax for probability estimation:

$$ \text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}} $$

These components form the foundation of semantic segmentation models like U-Net, which we enhance for PV extraction.

We selected publicly available high-resolution PV datasets comprising four scenarios: WaterSurface, Grassland, SalineAlkali, and Cropland. To address data scarcity and improve model generalization, we applied data augmentation techniques including rotation, noise addition, brightness adjustment, and Gaussian filtering. The datasets were split into training (80%), validation (10%), and test (10%) sets. The table below summarizes the dataset details after augmentation.

Scenario	Resolution	Augmentation Ratio	Original Size	Augmented Size
WaterSurface	0.3m	1:5	599	2995
Cropland	0.3m	1:5	951	4755
SalineAlkali	0.3m	1:5	422	2110
Grassland	0.3m	1:5	140	700

For model validation, we used an independent Shrubwood dataset, similarly augmented to ensure robustness. Additionally, we applied our method to the Panda Power Station in Datong City, China, using 1m resolution imagery from 2014, 2017, and 2020 to demonstrate practical utility in solar energy storage monitoring.

Our improved U-Net model incorporates Convolutional Block Attention Module (CBAM) and coordinate attention mechanisms to enhance feature extraction. The CBAM module sequentially applies channel and spatial attention. Given an input feature map ( F \in \mathbb{R}^{C \times H \times W} ), channel attention ( M_c \in \mathbb{R}^{C \times 1 \times 1} ) is computed as:

$$ M_c(F) = \sigma(\text{MLP}(\text{MaxPool}(F)) + \text{MLP}(\text{AvgPool}(F))) $$

where ( \sigma ) is the sigmoid function, and MLP denotes a shared multilayer perceptron. Spatial attention ( M_s \in \mathbb{R}^{1 \times H \times W} ) is then applied:

$$ M_s(F) = \sigma(f^{7 \times 7}([\text{MaxPool}(F); \text{AvgPool}(F)])) $$

where ( f^{7 \times 7} ) is a convolution with a ( 7 \times 7 ) kernel. The refined feature map is obtained as:

$$ F’ = M_c(F) \otimes F, \quad F” = M_s(F’) \otimes F’ $$

where ( \otimes ) denotes element-wise multiplication. Additionally, we embed coordinate attention into residual modules to capture precise location information. The coordinate attention mechanism aggregates features along horizontal and vertical directions, enhancing the model’s ability to locate PV panels accurately. The residual module with coordinate attention helps mitigate gradient vanishing and network degradation, expressed as:

$$ W(x) = F(x) + x $$

where ( F(x) ) is the residual mapping. These improvements enable the model to focus on relevant features for solar energy storage applications, reducing false positives and improving boundary detection.

We conducted experiments using an Ubuntu 18.04 system with an NVIDIA GeForce RTX2080 Ti GPU, PyTorch 1.6.1, and Python 3.6.0. The model was trained with a batch size of 4 during the frozen phase and 1 during the unfrozen phase, using the Adam optimizer with learning rates of ( 10^{-4} ) and ( 10^{-5} ), respectively. The loss function was cross-entropy:

$$ \text{Loss} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{C} y_{ij} \log(p_{ij}) $$

where ( N ) is the number of samples, ( C ) is the number of classes, ( y_{ij} ) is the ground truth, and ( p_{ij} ) is the predicted probability. Evaluation metrics included precision, recall, ( F_1 )-score, and Intersection over Union (IoU):

$$ \text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN} $$

$$ F_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}, \quad \text{IoU} = \frac{TP}{TP + FP + FN} $$

where TP, FP, and FN denote true positives, false positives, and false negatives, respectively. We compared our method with traditional supervised classification and state-of-the-art models like PSPNet, SegNet, DeepLabV3+, and HRNet.

Ablation studies demonstrated the effectiveness of our modifications. For instance, on the WaterSurface dataset, adding CBAM alone improved the mean ( F_1 )-score (( mF_1 )) by 4.12% and mean IoU (( mIoU )) by 2.78%, while combining CBAM with coordinate attention-enhanced residuals increased ( mF_1 ) by 3.12% and ( mIoU ) by 3.94%. The table below shows comparative results on the WaterSurface dataset, highlighting our model’s superiority.

Model	mPrecision	mRecall	$ mF_1 $	mIoU
PSPNet	90.68%	93.76%	92.19%	89.70%
SegNet	89.85%	92.54%	91.18%	90.85%
DeepLabV3+	90.91%	94.08%	92.47%	90.24%
HRNet	91.26%	95.04%	93.11%	91.15%
U-Net	90.53%	94.71%	92.57%	90.62%
Ours	92.47%	96.32%	94.35%	92.67%

Similar trends were observed across other datasets, with our model achieving the highest ( mF_1 ) and ( mIoU ) values. For example, on the SalineAlkali dataset, ( mF_1 ) improved by 4.46% and ( mIoU ) by 2.58% compared to the baseline U-Net. These results validate our approach’s ability to handle diverse PV installation scenarios, which is crucial for solar energy storage infrastructure monitoring.

Validation on the Shrubwood dataset confirmed the model’s generalization, with ( mF_1 ) and ( mIoU ) increasing by 4.05% and 4.07%, respectively, over U-Net. Furthermore, application to the Panda Power Station demonstrated practical utility. Using imagery from 2014, 2017, and 2020, we extracted PV areas and estimated their expansion, supporting the station’s management and contributing to solar energy storage optimization. The estimated areas were 3.082 km² (2014), 4.010 km² (2017), and 4.253 km² (2020), reflecting the growth of solar energy storage capacity.

In conclusion, our improved U-Net model effectively addresses the challenges of PV panel extraction in complex environments. By integrating attention mechanisms and residual learning, we enhance feature representation and model stability, leading to higher accuracy and robustness. This work underscores the importance of advanced deep learning techniques in solar energy storage applications, enabling efficient monitoring and planning. Future research will focus on expanding datasets to include more diverse scenarios and improving computational efficiency for real-time deployment. As solar energy storage continues to evolve, reliable detection methods will play a pivotal role in maximizing the benefits of renewable energy sources.