Monocular Camera-Based Visual Localization for Solar Panels

The operation and maintenance of large-scale photovoltaic (PV) power stations present significant logistical challenges. Among these, the cleaning of solar panels is a critical but labor-intensive and costly task. The accumulation of dust, pollen, bird droppings, and other debris on the surface of solar panels drastically reduces their energy conversion efficiency. To address this, automated cleaning solutions have emerged. A prominent example is the PV cleaning shuttle system, a mobile platform equipped with a robotic arm and a detachable cleaning end-effector. This system navigates between solar panel arrays, positions itself precisely, and uses the robotic arm to deploy the cleaning module onto the panel surface. The core requirement for the successful and safe operation of such a system is the accurate and real-time perception of the solar panel’s pose relative to the vehicle chassis. Specifically, the system must determine the tilt angle of the solar panel and the distance to its lower edge to guide the robotic arm’s motion trajectory. This paper presents a monocular vision-based method for solar panel localization, combining an enhanced deep learning-based keypoint detector with geometric reasoning using prior knowledge of the solar panel’s dimensions.

Existing methods for solar panel detection and localization can be broadly categorized. Thermal imaging techniques exploit the heat signature of solar panels. However, they are sensitive to environmental conditions like ambient temperature and sunlight intensity, often require expensive equipment, and may not provide the geometric precision needed for robotic manipulation. Methods based on visible-light camera imagery are more versatile. Traditional computer vision approaches, such as color space conversion with thresholding or handcrafted feature-based classifiers (e.g., Support Vector Machines, Random Forests), are often computationally efficient but lack robustness in complex, variable outdoor environments. Their performance degrades under changing lighting, shadows, or when the target color blends with the background.

The advent of deep learning has significantly advanced the field. Object detection networks like YOLO and Faster R-CNN, or segmentation networks like FCN and U-Net, can automatically learn robust features from data. While these methods offer superior accuracy compared to traditional ones, they often come with increased computational complexity. For a real-time system deployed on an embedded platform on a vehicle, a balance between accuracy and inference speed is paramount. Furthermore, standard object detection provides a bounding box, which is insufficient for precise pose estimation. Semantic segmentation provides a pixel-wise mask, offering detailed contour information, but extracting precise corner locations from a mask adds an extra processing step and may be sensitive to segmentation noise. Therefore, we frame the problem as a keypoint detection task, directly regressing the image coordinates of the four corner points of a solar panel. This provides the most direct and compact representation for subsequent geometric calculations.

We base our solution on the YOLOv8-pose architecture. The YOLO (You Only Look Once) family is renowned for its excellent speed-accuracy trade-off, making it suitable for real-time applications. YOLOv8-pose extends the object detection framework to include a pose estimation head that regresses keypoint coordinates alongside the bounding box. We select the YOLOv8n (nano) variant as our baseline due to its minimal parameter count and fast inference speed, ideal for deployment on resource-constrained hardware. However, to meet the high precision demands of robotic manipulation, we introduce several architectural improvements to enhance the feature extraction and fusion capabilities of the baseline model without excessively inflating its computational cost.

The overall architecture of our improved model retains the backbone-neck-head design. The Backbone (CSPDarknet) is responsible for extracting hierarchical features from the input image. The Neck (FPN+PAN) performs multi-scale feature fusion, combining deep, semantically rich features with shallow, high-resolution features to maintain accuracy for objects of various sizes. The Head performs the final detection and keypoint regression. Our modifications are focused on the Backbone and Neck components.

In the Backbone, after the final SPPF (Spatial Pyramid Pooling Fast) module, we integrate a PSA (Polarized Self-Attention) module. The SPPF module captures multi-scale contextual information through parallel pooling operations at different kernel sizes. The PSA module enhances this by applying a polarized filtering mechanism to the feature maps. It splits the channel dimension into two parts. One part undergoes multi-head self-attention (with BatchNorm for faster inference) to model long-range spatial dependencies, while the other part is preserved. The outputs are then concatenated. This allows the network to adaptively emphasize informative features and suppress less useful ones, improving the model’s focus on the structural details of the solar panel, especially its corners, against cluttered backgrounds. Placing it after SPPF avoids the high computational cost of self-attention on high-resolution feature maps from earlier stages.

For the Neck, we enhance both the upsampling and downsampling paths. The original YOLOv8 uses a simple nearest-neighbor interpolation for upsampling, which can lead to blocky artifacts and loss of fine detail crucial for precise keypoint localization. We replace it with the DySample (Dynamic Sample) module. DySample learns a dynamic sampling map for upsampling. Given an input feature map $X$ of size $C \times H \times W$ and an upscaling factor $s$, it generates an offset map $O$ via a lightweight linear layer and pixel shuffling: $O = \text{linear}(X)$. This offset $O$ is added to a standard coordinate grid $G$ to form a dynamic sampling grid $S = G + O$. The final upsampled feature map $X’$ of size $C \times sH \times sW$ is obtained via a differentiable grid sampling operation: $X’ = \text{grid\_sample}(X, S)$. This data-dependent approach allows for adaptive and smoother feature reconstruction during upscaling, leading to more accurate feature fusion in the FPN+PAN structure.

Similarly, we improve the downsampling path. Standard strided convolutions or pooling layers used for downsampling can lead to information loss. We propose an enhanced version of the ADown module, termed ADown*. The original ADown module uses a parallel structure: the input is average-pooled and then split; one branch undergoes max-pooling followed by a convolution, while the other branch undergoes a convolution directly; the results are concatenated. Our ADown* modification replaces the standard convolutions in both branches with Depthwise Separable Convolutions. A depthwise separable convolution factorizes a standard convolution into a depthwise convolution (applying a single filter per input channel) followed by a pointwise convolution (a 1×1 convolution). This significantly reduces the number of parameters and computations. Additionally, we add a residual connection that incorporates the original pooled features before the convolutions, ensuring better gradient flow and feature preservation. The structure of ADown* is summarized below:

1. Apply Average Pooling to the input.

2. Split the feature map.

3. Branch 1: Apply Max Pooling, then a Depthwise Separable Convolution (DSConv).

4. Branch 2: Apply a Depthwise Separable Convolution (DSConv).

5. Concatenate the outputs of Branch 1 and Branch 2.

6. Add a residual connection from the initial split point (before convolutions) to the concatenated output.

This improved downsampler is used in the Neck to replace conventional downsampling convolutions, reducing parameters while enhancing the model’s perception of the target structure.

The keypoint detector provides the 2D pixel coordinates of the four solar panel corners: top-left $P_{lu}$, top-right $P_{ru}$, bottom-left $P_{ld}$, and bottom-right $P_{rd}$. To estimate 3D pose (distance $S$ and tilt angle $\theta$) from these 2D points, we employ a geometric method based on prior knowledge and the perspective camera model. We assume the physical width $W$ and length $L$ of the solar panel are known, as they are standard for a given installation. We also assume the solar panel is installed at a fixed tilt $\alpha$ relative to the horizontal plane (often equal to the site’s latitude for optimal annual yield). The camera is assumed to be calibrated, providing its intrinsic matrix $K$ with focal lengths $f_x, f_y$ and principal point $(c_x, c_y)$.

We define a camera coordinate system: origin $O$ at the camera’s optical center, $Z$-axis along the optical axis, $Y$-axis pointing downward (aligned with the vertical direction in the image), and $X$-axis completing the right-handed system. The goal is to find the distance $S$ from the camera to the center of the solar panel’s lower edge and the in-plane rotation angle $\theta$ between the panel’s lower edge and the camera’s $X$-axis.

Let a 3D point $P=(X, Y, Z)^T$ in the camera frame project to an image point $p=(u, v)^T$. The perspective projection is:
$$ Z \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = K \begin{bmatrix} X \\ Y \\ Z \end{bmatrix} = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \end{bmatrix} $$
From this, we get the useful relation:
$$ Z = \frac{Y f_y}{(v – c_y)} $$

Considering the spatial relationship between the corners. The vector from the bottom-left corner $P_{ld}$ to the top-left corner $P_{lu}$ aligns with the solar panel’s side of length $W$ and is oriented by angles $\alpha$ and $\theta$:
$$ \begin{bmatrix} X_{ld} \\ Y_{ld} \\ Z_{ld} \end{bmatrix} = \begin{bmatrix} X_{lu} \\ Y_{lu} \\ Z_{lu} \end{bmatrix} + \begin{bmatrix} W \cos\alpha \sin\theta \\ W \sin\alpha \\ W \cos\alpha \cos\theta \end{bmatrix} $$
The vector from the top-left corner $P_{lu}$ to the top-right corner $P_{ru}$ aligns with the panel’s lower edge of length $L$:
$$ \begin{bmatrix} X_{ru} \\ Y_{ru} \\ Z_{ru} \end{bmatrix} = \begin{bmatrix} X_{lu} \\ Y_{lu} \\ Z_{lu} \end{bmatrix} + \begin{bmatrix} L \cos\theta \\ 0 \\ -L \sin\theta \end{bmatrix} $$
Note the negative sign for the $Z$-component, assuming the panel extends in the negative $Z$ direction for positive $\theta$.

Applying the projection relation $Z = Y f_y / (v – c_y)$ to points $P_{lu}$ and $P_{ru}$, and using the second spatial relation, we can derive an expression for $Z_{lu}$:
$$ Z_{lu} = L’ \sin\theta \quad \text{where} \quad L’ = \frac{L (c_y – v_{ru})}{v_{ru} – v_{lu}} $$
Here, $L’$ is a scale factor derived from the known length $L$ and the vertical coordinates of the top two corners.

Next, applying the projection relation to points $P_{lu}$ and $P_{ld}$, and using the first spatial relation along with the expression for $Z_{lu}$, we can solve for the tilt angle $\theta$:
$$ \theta = \arcsin\left( \frac{C}{\sqrt{A^2 + B^2}} \right) – \arctan\left( \frac{B}{A} \right) $$
where
$$ A = L’ (v_{ld} – v_{lu}), \quad B = W \cos\alpha (v_{ld} – c_y), \quad C = W f_y \sin\alpha $$
Finally, the distance $S$ from the camera to the midpoint of the lower edge of the solar panel can be calculated as:
$$ S = Z_{lu} + W \cos\alpha \cos\theta + \frac{L}{2} \sin\theta $$
This formulation uses three of the four detected corners. In practice, the calculation can be performed using any combination of three non-collinear corners, and an average can be taken to reduce noise, provided at least three corners are visible and accurately detected.

To train and evaluate our model, we constructed a dedicated solar panel dataset. Using a monocular RGB camera (1920×1080 resolution), we captured approximately 40 minutes of video footage of solar panels from a moving vehicle under various lighting and viewing angles. After frame extraction and manual filtering, we obtained 1,780 high-quality images. Each image was annotated using the Labelme tool, marking both the bounding box and the four corner keypoints for every visible solar panel. The dataset was split into 80% for training, 10% for validation, and 10% for testing.

We use standard keypoint detection metrics for evaluation, primarily the mean Average Precision based on Object Keypoint Similarity (OKS-based mAP). OKS is the standard metric for pose estimation, analogous to IoU for object detection. It measures the similarity between predicted and ground-truth keypoints, normalized by the scale of the object and a per-keypoint constant:
$$ \text{OKS} = \frac{\sum_i \exp(-d_i^2 / (2s^2 \kappa_i^2)) \cdot \delta(v_i > 0)}{\sum_i \delta(v_i > 0)} $$
where $d_i$ is the Euclidean distance between the i-th predicted and ground-truth keypoint, $s$ is the scale factor (square root of the object’s bounding box area), $\kappa_i$ is a per-keypoint constant that controls falloff, and $v_i$ is the visibility flag for the ground-truth keypoint. We report $AP^{kp}_{50}$ (AP at OKS=0.50), $AP^{kp}_{75}$ (AP at OKS=0.75), and $AP^{kp}$ (the average AP over OKS thresholds from 0.50 to 0.95 with a step of 0.05). We also track model parameters and inference speed (Frames Per Second, FPS).

For the final pose estimation performance, we evaluate the accuracy of the calculated angle $\theta$ and distance $S$ against ground truth values obtained from manual measurement or simulation. We report the average error ($\theta_{ave}$, $S_{ave}$) and the maximum error ($\theta_{max}$, $S_{max}$) over the test set:
$$ \theta_{ave} = \frac{1}{K} \sum_{k=1}^{K} |\theta_k^{pred} – \theta_k^{gt}|, \quad \theta_{max} = \max(|\theta_k^{pred} – \theta_k^{gt}|) $$
$$ S_{ave} = \frac{1}{K} \sum_{k=1}^{K} |S_k^{pred} – S_k^{gt}|, \quad S_{max} = \max(|S_k^{pred} – S_k^{gt}|) $$
where $K$ is the number of test samples.

All models were trained from scratch (no pre-trained weights) for 300 epochs with a batch size of 32 and an input image size of 640×640. We used an initial learning rate of 0.01, decaying to 0.001. Experiments were conducted on a system with an NVIDIA GeForce RTX 3090 GPU.

The performance of our proposed model is compared against several state-of-the-art keypoint detection models on our solar panel test dataset. The results are summarized in the table below.

Model	$AP^{kp}_{50}$ (%)	$AP^{kp}_{75}$ (%)	$AP^{kp}$ (%)	FPS
Mask R-CNN	65.2	62.4	58.7	75.3
YOLOv7-w6-pose	79.1	70.9	65.5	138.9
YOLOv8n-pose (Baseline)	93.0	73.6	74.3	211.0
RTMO-s	82.6	67.4	67.3	154.4
Our Proposed Model	96.3	90.2	84.4	172.5

Our model achieves the highest accuracy across all AP metrics, with a significant 10.1% absolute improvement in $AP^{kp}$ over the baseline YOLOv8n-pose. While there is a trade-off in inference speed (172.5 FPS vs. 211.0 FPS), the speed remains well within the requirements for real-time operation on a moving vehicle, and the accuracy gain is substantial for precise manipulation.

To validate the contribution of each proposed module, we conducted an ablation study. The results are presented in the following table. ‘+PSA’ denotes adding the PSA module to the backbone, ‘+ADown*’ and ‘+DySample’ denote using these modules in the neck.

Model Configuration	$AP^{kp}_{50}$ (%)	$AP^{kp}_{75}$ (%)	$AP^{kp}$ (%)	Params (M)
Baseline (YOLOv8n-pose)	93.0	73.6	74.3	2.8
+ PSA	95.7	77.7	76.3	3.5
+ ADown*	91.3	75.0	75.7	2.5
+ DySample	91.9	74.2	75.8	2.8
+ PSA, + ADown*	92.3	76.4	76.0	3.1
+ PSA, + DySample	92.9	76.5	77.3	3.3
+ ADown*, + DySample	94.3	83.1	81.2	2.5
Full Model (All three)	96.3	90.2	84.4	3.1

The ablation study shows that each module contributes to the final performance. The PSA module provides a consistent boost, especially in $AP^{kp}_{50}$. The DySample and ADown* modules together offer a very strong improvement in $AP^{kp}_{75}$ and $AP^{kp}$, indicating they are particularly effective at achieving more precise keypoint localization (higher OKS thresholds). Notably, the combination of ADown* and DySample achieves excellent performance (81.2% $AP^{kp}$) with fewer parameters than the baseline. Our full model integrates all three enhancements, achieving the best overall accuracy with a moderate parameter increase to 3.1M.

The ultimate test is the accuracy of the derived pose parameters. We evaluated the angle and distance calculation using keypoints from both the baseline model and our improved model. The following tables show the average and maximum errors for a range of true angles and distances.

True Angle θ (deg)	Baseline YOLOv8n-pose		Our Proposed Model
	θ_ave (deg)	θ_max (deg)	θ_ave (deg)	θ_max (deg)
-50	10.3	14.1	8.2	10.4
-40	8.9	14.4	7.8	13.7
-30	9.6	13.9	5.6	11.8
-20	12.5	14.5	9.2	10.3
-10	9.1	10.0	5.0	7.8
0	8.3	9.7	7.4	8.9
10	9.4	15.0	6.0	9.9
20	11.9	13.6	8.7	12.4
30	6.7	10.1	5.2	8.6
40	8.4	12.5	5.5	9.0
50	8.2	13.1	7.6	12.5
Overall Improvement	—		26.2% Reduction in Avg. Error

True Distance S (mm)	Baseline YOLOv8n-pose		Our Proposed Model
	S_ave (mm)	S_max (mm)	S_ave (mm)	S_max (mm)
1600	121	133	98	108
1700	129	143	96	111
1800	116	129	101	113
1900	134	152	105	118
2000	119	138	98	113
2100	133	147	102	117
2200	141	156	112	136
2300	147	163	110	128
2400	143	158	109	135
2500	150	156	119	133
2600	146	159	107	126
2700	139	154	117	136
2800	148	163	121	136
2900	153	168	133	145
3000	162	174	135	154
Overall Improvement	—		20.1% Reduction in Avg. Error

The results clearly demonstrate the impact of improved keypoint detection on the final pose estimation. Using our enhanced model, the average error in calculating the solar panel tilt angle was reduced by 26.2%, and the average error in distance calculation was reduced by 20.1%, compared to using the baseline detector. The distance errors are within approximately 110 mm on average, and angle errors within 7 degrees on average, which is a significant precision level for guiding a robotic arm for deployment tasks. The maximum errors are also consistently lower with our model, indicating greater robustness.

This work presents a complete visual localization pipeline for solar panels using a single monocular camera. The core of the method is a highly accurate and efficient keypoint detection network, built upon an improved YOLOv8-pose architecture. The integration of the PSA attention mechanism enhances feature discriminability, while the DySample upsampler and our proposed ADown* downsampler work in tandem to enable more precise multi-scale feature fusion essential for corner localization. The geometric calculation module effectively translates the 2D keypoint measurements into 3D pose parameters (tilt and distance) by leveraging known physical dimensions of the solar panel and basic camera geometry. Experimental results on a custom dataset confirm that our approach achieves state-of-the-art keypoint detection accuracy for solar panels and, consequently, provides significantly more reliable pose estimates than the baseline method. The system maintains real-time performance, making it a practical and effective solution for enabling autonomous operations like robotic cleaning in photovoltaic power stations. Future work could explore temporal filtering of poses across video frames to further smooth estimates, extend the method to handle partially occluded solar panels using more advanced keypoint association models, or adapt it to different panel geometries and mounting configurations.