In recent years, the rapid development of perovskite solar cells has positioned them as a leading technology in the photovoltaic field due to their high efficiency and low-cost fabrication potential. However, the complexity of multi-scale factors—ranging from material composition and device architecture to fabrication processes and environmental conditions—poses significant challenges for traditional optimization methods. In this work, we introduce a full-process artificial intelligence framework designed to address these challenges by leveraging a large-scale dataset and advanced machine learning techniques. This framework enables accurate performance prediction and targeted optimization for perovskite solar cells, facilitating the design of high-efficiency devices.
The foundation of our AI framework is a comprehensive dataset comprising over 20,000 experimentally measured perovskite solar cell samples, each characterized by approximately 260 multi-scale features. These features encompass the entire lifecycle of perovskite solar cells, including material selection, deposition methods, layer configurations, and testing environments. To illustrate the structural complexity of a typical perovskite solar cell, consider the following representation:

The dataset was curated from diverse sources to ensure broad coverage of perovskite solar cell variants. Key features include electron-transport layer (ETL) and hole-transport layer (HTL) sequences, perovskite composition, annealing conditions, and environmental factors during testing. We applied rigorous feature engineering to handle the high dimensionality and heterogeneity of the data. This involved target encoding for categorical variables and Pearson correlation analysis to eliminate redundant or irrelevant features. The correlation matrix for top features is summarized in Table 1, highlighting the relationships between critical parameters and device performance.
| Feature | PCE | VOC | JSC | FF |
|---|---|---|---|---|
| ETL Stack Sequence | 0.78 | 0.65 | 0.72 | 0.68 |
| HTL Stack Sequence | 0.75 | 0.62 | 0.70 | 0.66 |
| Annealing Temperature | 0.70 | 0.58 | 0.67 | 0.63 |
| Deposition Solvent | 0.68 | 0.55 | 0.65 | 0.60 |
| Perovskite Bandgap | 0.65 | 0.52 | 0.62 | 0.58 |
To model the performance parameters of perovskite solar cells—such as power conversion efficiency (PCE), open-circuit voltage (VOC), short-circuit current (JSC), and fill factor (FF)—we employed multiple machine learning algorithms. The general form of the predictive model can be expressed as:
$$ y = f(X) + \epsilon $$
where \( y \) represents the target performance parameter (e.g., PCE), \( X \) is the feature matrix, and \( \epsilon \) denotes the error term. We evaluated models including ridge regression, random forest, gradient boosting, and XGBoost. The XGBoost algorithm demonstrated superior performance, with its objective function given by:
$$ \mathcal{L}(\phi) = \sum_{i=1}^{n} l(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k) $$
Here, \( l \) is the loss function, \( \hat{y}_i \) is the predicted value, and \( \Omega \) penalizes model complexity. The learning curves for VOC prediction, as shown in Figure 1, indicate high accuracy with R² values of 0.93 and 0.80 for training and test sets, respectively. The root mean square error (RMSE) was 0.02 for training and 0.04 for testing, underscoring the model’s robustness. Similar results were obtained for other parameters, as summarized in Table 2.
| Parameter | R² (Training) | R² (Test) | RMSE (Training) | RMSE (Test) |
|---|---|---|---|---|
| PCE | 0.91 | 0.78 | 0.03 | 0.05 |
| VOC | 0.93 | 0.80 | 0.02 | 0.04 |
| JSC | 0.89 | 0.76 | 0.04 | 0.06 |
| FF | 0.87 | 0.74 | 0.05 | 0.07 |
In addition to regression, we developed classification models to categorize perovskite solar cells into performance tiers (e.g., low, medium, and high PCE). The confusion matrix for PCE classification achieved an overall accuracy of 82.63%, with precision and recall metrics detailed in Table 3. This dual approach of regression and classification enhances the reliability of the AI framework for performance simulation.
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Low PCE (10-15%) | 0.869 | 0.854 | 0.861 |
| Medium PCE (15-20%) | 0.812 | 0.798 | 0.805 |
| High PCE (>20%) | 0.798 | 0.815 | 0.806 |
To interpret the model decisions, we applied SHapley Additive exPlanations (SHAP) analysis. The SHAP values for the PCE regression model revealed that the ETL stack sequence is the most influential feature, followed by HTL stack sequence and annealing conditions. The SHAP value for a feature \( j \) is computed as:
$$ \phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|! (|N| – |S| – 1)!}{|N|!} [f(S \cup \{j\}) – f(S)] $$
where \( N \) is the set of all features, and \( f(S) \) is the model prediction using feature subset \( S \). This analysis highlights the critical role of interface engineering in perovskite solar cells. For instance, optimizing the ETL and HTL sequences can reduce recombination losses, as described by the diode equation:
$$ J = J_{\text{ph}} – J_0 \left( \exp\left(\frac{qV}{nkT}\right) – 1 \right) $$
where \( J_{\text{ph}} \) is the photocurrent, \( J_0 \) is the saturation current, and \( n \) is the ideality factor. By prioritizing features with high SHAP importance, the framework guides material selection and process optimization.
To address the multi-scale complexity of perovskite solar cells, we implemented a clustering strategy using t-distributed stochastic neighbor embedding (t-SNE) and K-means algorithms. The t-SNE algorithm minimizes the divergence between probability distributions in high and low-dimensional spaces:
$$ \text{Cost} = \sum_{i} \sum_{j} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} $$
where \( p_{j|i} \) and \( q_{j|i} \) represent pairwise similarities in the original and embedded spaces. This approach enabled the identification of distinct perovskite solar cell clusters based on feature combinations, such as annealing conditions or transport layer materials. For each cluster, statistical heatmaps and radar charts were generated to recommend optimal parameter values. For example, the annealing optimization for a specific cluster suggested a two-step process: heating from 65°C to 100°C in 1 minute, followed by maintenance at 100°C for 9 minutes. This aligns with the Arrhenius equation for thermal activation:
$$ k = A \exp\left(-\frac{E_a}{RT}\right) $$
where \( k \) is the rate constant, \( E_a \) is activation energy, and \( R \) is the gas constant. Such insights facilitate targeted experimental design for perovskite solar cells.
The expansion module of the AI framework was validated experimentally using 64 fabricated perovskite solar cells with novel feature values. The framework identified key optimization features—such as HTL material replacement—and recommended changes that improved PCE by 0.92% to 2.43% in absolute values. The performance enhancement can be modeled using the efficiency formula:
$$ \text{PCE} = \frac{J_{\text{SC}} \times V_{\text{OC}} \times \text{FF}}{P_{\text{in}}} \times 100\% $$
where \( P_{\text{in}} \) is the incident power. This demonstration underscores the framework’s adaptability to new perovskite solar cell designs and its potential to accelerate innovation in photovoltaic technology.
In conclusion, our AI framework provides a robust platform for the full-process design and optimization of perovskite solar cells. By integrating large-scale data, advanced machine learning models, and interpretable analytics, it addresses the multi-scale challenges inherent in perovskite solar cell development. Future work will focus on incorporating real-time data and expanding the framework to other photovoltaic materials, further solidifying the role of artificial intelligence in advancing renewable energy solutions.
