Semi-Supervised Solar Panel Segmentation

The precise identification and segmentation of solar panels from aerial or satellite imagery is a critical task in the photovoltaic industry, enabling efficient monitoring, planning, and maintenance of renewable energy systems. With the global push toward green, low-carbon energy solutions to achieve sustainable development goals, solar power has emerged as one of the most promising sources of clean electricity. However, accurate segmentation of solar panels faces significant challenges, including low contrast in images, blurred boundaries, complex backgrounds (e.g., urban rooftops, vegetation, and water bodies), and the high cost of obtaining large-scale, precisely annotated datasets. Traditional supervised semantic segmentation models, such as those based on convolutional neural networks (CNNs) or Transformers, often rely heavily on extensive labeled data, which is labor-intensive and time-consuming to acquire. In practical scenarios, annotated data may be scarce, making full supervision impractical. To address this, I propose a novel semi-supervised learning framework for solar panel image segmentation, combining an improved perturbation-based approach with a dual-branch feature aggregation network to leverage both labeled and unlabeled data effectively. This method aims to enhance segmentation accuracy even with limited annotations, making it more applicable to real-world deployments where data labeling is a bottleneck.

The core of my approach lies in two key innovations: first, a unified semi-supervised learning framework that extends FixMatch by incorporating both image-level and feature-level perturbations to explore a broader perturbation space; and second, a dual-branch feature aggregation network that integrates CNN and Transformer architectures to capture multi-level features, including local details and global context. By designing a multi-level spatial attention module and an interlaced fusion decoder, the network effectively aggregates features from both branches, improving the localization and boundary clarity of solar panels. In experiments, this method demonstrates superior performance on publicly available solar panel datasets at different spatial resolutions, achieving high mean intersection over union (MIoU) scores with only a fraction of labeled data. The following sections detail the methodology, experimental setup, and results, highlighting the effectiveness of the proposed techniques for solar panel segmentation tasks.

Semantic segmentation of solar panels is essential for various applications, such as inventory management, energy yield estimation, and environmental impact assessment. Solar panels, often installed in distributed settings like rooftops, fields, or water surfaces, present unique challenges due to their varying sizes, orientations, and surrounding environments. High-resolution remote sensing imagery provides a rich data source, but manual annotation is expensive and prone to errors. Existing methods typically employ deep learning models like U-Net, DeepLabV3+, or Transformer-based networks, which require large labeled datasets for training. However, in many cases, only a small subset of images may be annotated, leading to poor generalization when using fully supervised approaches. Semi-supervised learning offers a solution by utilizing both labeled and unlabeled data during training, often through consistency regularization or pseudo-labeling techniques. For instance, FixMatch enforces consistency between weakly and strongly perturbed views of unlabeled images, but it primarily focuses on image-level perturbations, limiting the exploration of feature-space variations. My work builds on this by introducing a multi-branch perturbation framework that combines image and feature perturbations, along with a dual-branch network architecture tailored for solar panel characteristics. This not only improves robustness but also enhances feature discrimination, crucial for handling the subtle contrasts and complex backgrounds common in solar panel imagery.

The proposed semi-supervised learning framework is designed to maximize the use of unlabeled data while maintaining high accuracy with limited labels. Let $D_l = \{(x_i, y_i)\}_{i=1}^{N_l}$ represent the labeled dataset, where $x_i$ is an input image and $y_i$ is the corresponding segmentation mask, and $D_u = \{x_j\}_{j=1}^{N_u}$ represent the unlabeled dataset. The goal is to train a segmentation model $f_\theta$ parameterized by $\theta$ that minimizes a combined loss function: $L = L_s + \lambda_u L_u$, where $L_s$ is the supervised loss on labeled data, $L_u$ is the unsupervised loss on unlabeled data, and $\lambda_u$ is a weighting factor. For labeled data, the supervised loss is typically the cross-entropy loss between predictions and ground truth masks: $$L_s = -\frac{1}{N_l} \sum_{i=1}^{N_l} \sum_{c=1}^{C} y_i^{(c)} \log(f_\theta(x_i)^{(c)}),$$ where $C$ is the number of classes (e.g., solar panel vs. background). For unlabeled data, my framework extends FixMatch by incorporating additional perturbation branches. Specifically, for an unlabeled image $x_u$, I generate a weakly perturbed view $x_w$ (e.g., via random cropping or flipping) and two strongly perturbed views $x_{s1}$ and $x_{s2}$ (e.g., via RandAugment or CutMix). Additionally, I introduce a feature-level perturbation branch where the features extracted from $x_w$ are perturbed before decoding. The predictions are denoted as $p_w = f_\theta(x_w)$, $p_{s1} = f_\theta(x_{s1})$, $p_{s2} = f_\theta(x_{s2})$, and $p_{fp} = f_\theta(\text{PerturbFeatures}(x_w))$. The unsupervised loss enforces consistency between these predictions using pseudo-labels derived from $p_w$ with a confidence threshold $\tau$: $$L_u = \frac{1}{B_u} \sum \mathbb{1}(\max(p_w) \geq \tau) \left( \lambda H(p_w, p_{fp}) + \frac{\mu}{2} (H(p_w, p_{s1}) + H(p_w, p_{s2})) \right),$$ where $B_u$ is the batch size for unlabeled data, $\mathbb{1}(\cdot)$ is an indicator function, $H(\cdot, \cdot)$ is the cross-entropy function, and $\lambda$ and $\mu$ are weights set to 0.5. This formulation encourages the model to be invariant to various perturbations, improving generalization. The confidence threshold $\tau$ is set to 0.95 to filter out noisy pseudo-labels, ensuring reliable supervision from unlabeled data.

The dual-branch feature aggregation network is the backbone of the segmentation model, designed to capture both local and global features essential for solar panel segmentation. Solar panels often exhibit fine edges and homogeneous textures, requiring detailed spatial information, while the global context helps distinguish them from similar objects like rooftops or water bodies. The network consists of two parallel branches: a CNN branch and a Transformer branch. The CNN branch, based on a lightweight DCNet architecture, extracts multi-scale local features through dense connection modules (DCModules). Each DCModule comprises multiple convolutional blocks with skip connections to preserve spatial details. The output of a DCModule can be expressed as: $$x_{\text{output}} = E(H(x_1, x_2, x_3, x_4)),$$ where $x_1, x_2, x_3, x_4$ are feature maps from different blocks, $H$ denotes concatenation, and $E$ represents an efficient channel attention (ECA) module that enhances feature selectivity. The Transformer branch, based on the Convolutional Vision Transformer (CvT), captures long-range dependencies through self-attention mechanisms. Given an input image, it is divided into patches, and transformer blocks apply multi-head attention to model global relationships: $$\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V,$$ where $Q, K, V$ are query, key, and value matrices derived from patch embeddings. This allows the network to integrate contextual information across the entire image, which is beneficial for identifying solar panels in cluttered scenes.

To effectively combine features from both branches, I design a multi-level spatial attention module (MSAM) and an interlaced fusion decoder. The MSAM is applied to the CNN branch to refine spatial features by emphasizing important regions. It uses global pooling and max pooling operations at multiple scales (e.g., $1 \times n$ and $n \times 1$ kernels with $n = 3, 5, 7$) to generate attention weights. The process can be summarized as: $$X’_2 = t(F_1(\text{ReLU}(F_1(F_{g1}(\phi(X_1)))))),$$ $$W = \text{Sigmoid}(X’_2 \cdot X’_1),$$ $$f_{\text{out}} = \psi(W \cdot X’_1) + X,$$ where $X$ is the input feature map, $X_1$ is the fused multi-scale feature, $F_{g1}$ is global average pooling, $F_1$ is a 1D convolution, $t$ is transpose, and $\psi$ reshapes the tensor. The output $f_{\text{out}}$ is then aligned with the Transformer branch features via convolution for concatenation. The interlaced fusion decoder progressively upsamples and merges features from both branches. Let $x_i^c$ and $x_i^t$ denote features from the CNN and Transformer branches at stage $i$, respectively. The decoder computes: $$D_i = \sigma(\text{BN}(\text{DWConv}(M_i))), \quad i=1,2,3,4,$$ $$M_i = \begin{cases} \text{Concat}(D_{i+1}, x_i^c), & i=1,3 \\ \text{Concat}(D_{i+1}, x_i^t), & i=2 \\ \text{Concat}(x_i^t, x_{i+1}^c), & i=4 \end{cases},$$ where $\sigma$ is the GELU activation, BN is batch normalization, DWConv is depthwise separable convolution, and Concat denotes concatenation. The final output $D’_i$ is obtained via bilinear upsampling: $D’_i = \text{Up}(D_i)$. This design ensures that local details from the CNN branch and global semantics from the Transformer branch are synergistically integrated, leading to precise segmentation masks for solar panels.

Experiments were conducted on three publicly available solar panel datasets with varying spatial resolutions: PV01 (0.1 m), PV03 (0.3 m), and PV08 (0.8 m). These datasets contain images of solar panels in diverse environments, such as rooftops, fields, and water bodies, making them suitable for evaluating robustness. The images were resized to $256 \times 256$ pixels, and the datasets were split into training, validation, and test sets in a 7:2:1 ratio. To simulate limited annotation scenarios, I used only subsets of the labeled training data (e.g., 1/64, 1/32, 1/16, 1/8 of the full set) while leveraging the remaining unlabeled data. The model was implemented in PyTorch and trained on an NVIDIA RTX 3090 GPU using stochastic gradient descent with momentum 0.9, weight decay 0.01, and a polynomial learning rate schedule: $$\text{LR} = 0.001 \times \left(1 – \frac{E}{300}\right)^2,$$ where $E$ is the epoch number. Training ran for 200 epochs, and performance was evaluated using standard metrics: pixel accuracy (PA), mean intersection over union (MIoU), recall (RC), and precision (PR). These are defined as: $$\text{PA} = \frac{\sum_{i=0}^k p_{ii}}{\sum_{i=0}^k \sum_{j=0}^k p_{ij}}, \quad \text{MIoU} = \frac{1}{K+1} \sum_{i=0}^k \frac{p_{ii}}{\sum_{j=0}^k p_{ij} + \sum_{j=0}^k p_{ji} – p_{ii}},$$ $$\text{RC} = \frac{TP}{TP + FN}, \quad \text{PR} = \frac{TP}{TP + FP},$$ where $p_{ij}$ is the number of pixels of class $i$ predicted as class $j$, $TP$ is true positives, $FP$ is false positives, and $FN$ is false negatives. Higher values indicate better segmentation accuracy, particularly for solar panels.

The proposed method was compared against several state-of-the-art supervised and semi-supervised segmentation models. For supervised baselines, I used U-Net, U-Net++, DeepLabV3+, BiseNet, SegFormer, and HRViT, trained with the same limited labeled data. For semi-supervised baselines, I adapted Mean-Teacher, FixMatch, and ST++ to work with the dual-branch network for fairness. Table 1 summarizes the MIoU results on the PV01, PV03, and PV08 datasets with different labeled data ratios. The proposed method consistently outperforms all baselines, especially when labeled data is scarce. For instance, with only 1/32 labeled data, it achieves MIoU scores of 83.74%, 82.77%, and 80.73% on PV01, PV03, and PV08, respectively, representing improvements of 2-3 percentage points over the best semi-supervised competitor. This demonstrates the effectiveness of the unified perturbation framework and dual-branch architecture in leveraging unlabeled data for solar panel segmentation.

Table 1: Comparison of MIoU (%) on Solar Panel Datasets with Varying Labeled Data Ratios
Method	Supervision	PV01 (1/32)	PV03 (1/32)	PV08 (1/32)	PV01 (1/64)	PV03 (1/64)	PV08 (1/64)
U-Net	Full	73.01	71.58	70.19	71.79	70.42	69.21
DeepLabV3+	Full	70.39	72.45	72.83	69.58	71.11	71.49
SegFormer	Full	75.13	74.23	72.75	73.50	72.50	71.37
Mean-Teacher	Semi	77.20	78.13	76.90	76.37	76.82	75.27
FixMatch	Semi	81.36	79.45	76.21	79.72	78.30	75.32
ST++	Semi	81.98	81.40	79.04	80.66	80.24	77.40
Proposed	Semi	83.74	82.77	80.73	82.53	81.58	79.83

To further validate the contributions of individual components, ablation studies were conducted on the PV03 dataset. Table 2 shows the impact of the perturbation strategies in the semi-supervised framework. The baseline FixMatch uses only image-level strong perturbations. Adding feature-level perturbations (strategy a) improves MIoU by about 1.5% with 1/32 labeled data, while adding dual-stream image perturbations (strategy b) yields a similar gain. Combining both strategies (the full framework) achieves the best performance, highlighting the importance of exploring diverse perturbation spaces for solar panel segmentation. This aligns with the intuition that solar panels exhibit varied appearances, and robust feature learning benefits from multi-faceted augmentations.

Table 2: Ablation Study on Semi-Supervised Perturbation Strategies (MIoU % on PV03)
Perturbation Strategy	1/64 Labels	1/32 Labels	1/16 Labels	1/8 Labels
FixMatch (baseline)	78.30	79.45	80.71	81.28
+ Feature-level (a)	79.83	80.94	81.15	82.53
+ Dual-stream image (b)	80.04	80.77	81.24	82.39
Full framework (a+b)	81.58	82.77	83.61	84.67

Table 3 presents an ablation study on the dual-branch feature aggregation network. Using only the CNN branch results in lower MIoU due to limited global context, while only the Transformer branch struggles with fine details. Combining both branches without feature aggregation improves performance, but adding the MSAM and interlaced decoder leads to significant gains. For example, with 1/32 labeled data, the full network achieves 82.77% MIoU compared to 80.75% without aggregation modules. This underscores the value of integrating local and global features for solar panel segmentation, where precise boundaries and contextual understanding are both critical. The multi-level spatial attention module effectively highlights solar panel regions, reducing false positives from background clutter, while the decoder recovers spatial details lost during downsampling.

Table 3: Ablation Study on Dual-Branch Network Components (MIoU % on PV03)
Network Configuration	1/64 Labels	1/32 Labels	1/16 Labels	1/8 Labels
CNN branch only	76.54	78.81	79.99	80.60
Transformer branch only	77.32	79.54	80.70	81.23
CNN + Transformer	78.94	80.75	81.41	82.49
+ Feature aggregation (MSAM)	80.09	81.63	82.13	83.22
+ Interlaced decoder (full)	81.58	82.77	83.61	84.67

The proposed method also excels in handling challenging cases common in solar panel imagery. For instance, in low-contrast scenarios where solar panels blend with rooftops, the dual-branch network leverages global attention to distinguish them, while the CNN branch refines edges. In complex backgrounds with vegetation or water, the perturbation framework enhances robustness by exposing the model to varied augmentations during training. Qualitative results show reduced false positives and smoother boundaries compared to baselines. The semi-supervised approach effectively utilizes unlabeled data to learn invariant features, making it suitable for large-scale solar panel mapping where annotations are limited. Future work could explore adapting the framework to other renewable energy infrastructure, such as wind turbines, or integrating multi-temporal data for dynamic monitoring. Additionally, extending the perturbation strategies to include domain-specific augmentations, like simulating different lighting conditions or panel orientations, may further improve performance for solar panel segmentation in diverse environments.

In conclusion, I have presented a novel semi-supervised learning algorithm for solar panel image segmentation that addresses the challenges of limited labeled data and complex image characteristics. The unified perturbation framework combines image-level and feature-level perturbations to exploit a broader augmentation space, improving consistency regularization for unlabeled data. The dual-branch feature aggregation network, incorporating CNN and Transformer architectures, effectively captures multi-level features essential for accurate solar panel localization and boundary delineation. Experimental results on multiple datasets demonstrate state-of-the-art performance, with significant improvements in MIoU even when using only a small fraction of labeled data. This work contributes to the photovoltaic industry by offering a practical solution for automated solar panel monitoring, supporting the global transition to sustainable energy. The methods are generalizable and can be extended to other semantic segmentation tasks where data annotation is costly. As solar energy adoption grows, efficient and accurate segmentation tools will play a crucial role in optimizing energy systems and advancing green initiatives.