Predicting Remaining Useful Life of Sodium-ion Batteries Using Feature Selection-Based LightGBM Algorithm

The accurate prediction of the Remaining Useful Life (RUL) of batteries is a cornerstone for ensuring the reliability, safety, and economic viability of large-scale energy storage systems. Among emerging technologies, the sodium-ion battery has garnered significant attention as a promising alternative to lithium-ion batteries due to its advantages in cost, resource abundance, and promising low-temperature performance. Effective RUL prediction for sodium-ion batteries allows for proactive maintenance, maximizes utilization, prevents unexpected failures, and reduces lifecycle costs. This study proposes a robust data-driven framework that integrates a novel feature selection procedure with an optimized Light Gradient Boosting Machine (LightGBM) model to achieve high-accuracy RUL prediction for sodium-ion batteries.

Introduction and Motivation

The proliferation of renewable energy sources and electric vehicles has escalated the demand for efficient, safe, and cost-effective energy storage solutions. In this landscape, the sodium-ion battery presents a compelling proposition. Its core materials, such as sodium, iron, and manganese, are more abundant and geographically widespread than the lithium, cobalt, and nickel critical to lithium-ion batteries, leading to potentially lower material costs and enhanced supply chain security. Performance-wise, sodium-ion batteries exhibit excellent rate capability and superior performance in low-temperature environments. These attributes make the sodium-ion battery a strong candidate for applications in grid storage, low-speed electric vehicles, and backup power systems.

However, like all electrochemical energy storage devices, sodium-ion batteries degrade over time and usage. This degradation manifests as a gradual loss of capacity and an increase in internal resistance, ultimately leading to the end of the battery’s useful life when its State of Health (SOH) falls below a predefined threshold, typically 80% of its initial capacity. Predicting the RUL—the number of remaining charge-discharge cycles before this threshold is reached—is therefore critical. Accurate RUL estimation enables condition-based maintenance, prevents system downtime, optimizes replacement schedules, and enhances overall system safety for sodium-ion battery packs.

Existing RUL prediction methodologies can be broadly categorized into model-based and data-driven approaches. Model-based methods rely on constructing explicit electrochemical or empirical models of the sodium-ion battery’s degradation physics. While potentially insightful, these models often require profound domain expertise, are difficult to parameterize accurately under varying operational conditions, and may not generalize well. In contrast, data-driven methods leverage statistical and machine learning algorithms to learn the complex, non-linear relationship between measurable battery parameters and its degradation trend directly from historical or real-time operational data. These methods, including Support Vector Regression (SVR), Gaussian Process Regression (GPR), and Long Short-Term Memory (LSTM) networks, have shown great promise due to their adaptability and ability to handle complex patterns without requiring an explicit physical model.

Nevertheless, a common challenge in data-driven prognostics for sodium-ion batteries is the “curse of dimensionality” and feature redundancy. During battery aging tests, multiple parameters (e.g., voltage, current, temperature, internal resistance) are monitored, leading to a high-dimensional feature space. Not all features contribute equally to predicting the degradation of the sodium-ion battery; some are highly correlated with each other, introducing redundancy and noise that can impair model training efficiency and prediction accuracy. Therefore, an effective feature selection process is paramount to identify the most informative and non-redundant health indicators.

This work addresses these challenges by introducing a comprehensive framework for sodium-ion battery RUL prediction. The core contributions are twofold: First, we propose a hybrid feature selection procedure that combines Pearson Correlation Coefficient (PCC) and Grey Relational Grade (GRG) analysis to systematically identify an optimal, non-redundant feature set strongly correlated with the sodium-ion battery’s capacity fade. Second, we employ a LightGBM model, renowned for its high efficiency and accuracy, and further enhance its performance through hyperparameter optimization using Grid Search Cross-Validation (GridSearchCV). The efficacy of the proposed method is rigorously validated using cycle aging data from sodium-ion batteries, demonstrating superior prediction accuracy and robustness compared to other benchmark algorithms like Gradient Boosting Decision Tree (GBDT) and Random Forest (RF).

Sodium-Ion Battery Aging Experiment and Data Acquisition

To generate a realistic dataset for model development and validation, a long-term cycle aging experiment was conducted on commercially available sodium-ion battery cells. The cathode material was Na₄Fe₃(PO₄)₂P₂O₇. Prior to testing, the assembled cells were allowed to rest at room temperature for 10 hours to facilitate sufficient contact between the electrodes and electrolyte and stabilize the formation of the solid electrolyte interphase (SEI) layer.

The entire aging experiment was performed in a controlled thermal chamber maintaining a constant temperature of 30°C. A standard cycling protocol was applied, as illustrated in the schematic below. The initial step involved a low-rate formation cycle at 0.1C to activate the cell. Following activation, the cells underwent continuous charge-discharge cycling at three different constant current (CC) rates: 1C, 1.5C, and 2C. The specific charging protocol consisted of a Constant Current (CC) phase until the voltage reached 3.8V, followed by a Constant Voltage (CV) phase at 3.8V until the current dropped to a near-zero cutoff. After charging, a rest period of 5 minutes was instituted. The discharge phase was a CC discharge at the specified C-rate until the voltage fell to a cutoff of 1.8V, followed by another 5-minute rest period. This sequence constituted one full cycle, repeated until the cell’s capacity faded to 80% of its initial rated capacity, defining its End of Life (EOL).

Throughout the cycling of the sodium-ion battery, a suite of nine key parameters was recorded at each cycle, forming the raw feature space for subsequent analysis:

F1: Discharge Specific Capacity (mAh/g)
F2: Discharge Specific Energy (Wh/kg)
F3: Discharge Capacitance (F)
F4: Median Voltage (V)
F5: Discharge Time (s)
F6: Discharge DC Internal Resistance (Ω)
F7: Net Discharge Capacity (mAh)
F8: Net Discharge Energy (Wh)
F9: Energy Efficiency (%)

The primary indicator of degradation for the sodium-ion battery, the State of Health (SOH), is defined as the ratio of its current maximum discharge capacity to its initial capacity:
$$ \text{SOH}(k) = \frac{C_k}{C_0} \times 100\% $$
where $C_k$ is the capacity at cycle $k$ and $C_0$ is the initial capacity. Consequently, the Remaining Useful Life (RUL) at cycle $k_t$ is calculated as:
$$ \text{RUL}(k_t) = N_{\text{EOL}} – k_t $$
where $N_{\text{EOL}}$ is the total cycle count when SOH reaches 80%.

The experimental results confirmed the expected degradation behavior of the sodium-ion battery. The SOH decreased monotonically with increasing cycle numbers across all tested C-rates. The total cycle life ($N_{\text{EOL}}$) was found to be dependent on the discharge rate: approximately 1955 cycles at 1C, 1717 cycles at 1.5C, and 1156 cycles at 2C, highlighting the impact of stress factors on the longevity of the sodium-ion battery. The trends of the measured parameters, such as decreasing median voltage and increasing internal resistance, provided the foundational data for the prognostic model.

Feature Engineering for Sodium-Ion Battery Prognostics

The raw data from the sodium-ion battery aging test contains nine time-series features. Using all features directly can lead to model overfitting, increased computational cost, and reduced interpretability due to multicollinearity. Therefore, a principled feature engineering step is essential to extract the most informative and compact set of health indicators (HIs).

Correlation Analysis

To quantify the relationship between each potential feature and the target RUL, two complementary correlation metrics were employed: the linear Pearson Correlation Coefficient (PCC) and the nonlinear Grey Relational Grade (GRG).

The PCC between a feature $F_i$ and the RUL sequence is given by:
$$ \rho_i = \frac{\sum_{k=1}^{n} (F_i(k) – \bar{F}_i)(\text{RUL}(k) – \overline{\text{RUL}})}{\sqrt{\sum_{k=1}^{n} (F_i(k) – \bar{F}_i)^2 \sum_{k=1}^{n} (\text{RUL}(k) – \overline{\text{RUL}})^2}} $$
where $n$ is the total number of cycles, and $\bar{F}_i$ and $\overline{\text{RUL}}$ are the mean values of the feature and RUL, respectively. $|\rho_i|$ close to 1 indicates a strong linear relationship.

The GRG is particularly effective for analyzing dynamic processes with nonlinear relationships. The grey relational coefficient $\xi_i(k)$ at each point is calculated first:
$$ \xi_i(k) = \frac{\min_i \min_k |\text{RUL}(k)-F_i(k)| + \rho \cdot \max_i \max_k |\text{RUL}(k)-F_i(k)|}{|\text{RUL}(k)-F_i(k)| + \rho \cdot \max_i \max_k |\text{RUL}(k)-F_i(k)|} $$
where $\rho$ is a distinguishing coefficient, typically set to 0.5. The overall GRG $r_i$ for feature $i$ is then the average of all grey relational coefficients:
$$ r_i = \frac{1}{n} \sum_{k=1}^{n} \xi_i(k) $$
A GRG value closer to 1 signifies a stronger dynamic relational degree with the target.

The calculated PCC and GRG values for all nine features from a representative sodium-ion battery test are visualized in the tables below. The results from both methods are largely consistent, identifying a subset of features with high correlation to RUL.

Table 1: Pearson Correlation Coefficients (PCC) with RUL
Feature	Description	PCC ($\rho_i$)
F1	Discharge Specific Capacity	-0.992
F6	Discharge DC Internal Resistance	0.985
F4	Median Voltage	-0.978
F3	Discharge Capacitance	-0.975
F7	Net Discharge Capacity	-0.974
F2	Discharge Specific Energy	-0.970
F8	Net Discharge Energy	-0.966
F5	Discharge Time	-0.942
F9	Energy Efficiency	-0.512

Table 2: Grey Relational Grades (GRG) with RUL
Feature	Description	GRG ($r_i$)
F1	Discharge Specific Capacity	0.943
F6	Discharge DC Internal Resistance	0.937
F4	Median Voltage	0.914
F3	Discharge Capacitance	0.913
F7	Net Discharge Capacity	0.907
F2	Discharge Specific Energy	0.903
F8	Net Discharge Energy	0.891
F5	Discharge Time	0.851
F9	Energy Efficiency	0.649

Optimal Feature Selection Procedure

The analysis reveals that many features are highly correlated not only with RUL but also with each other (e.g., F1, F2, F7, F8 are all strongly related to capacity). To eliminate redundancy and select an optimal feature subset, the following algorithmic procedure was designed and implemented:

Procedure: Hybrid Feature Selection

Calculate the PCC and GRG matrices for all features against RUL and inter-feature correlations.
Identify candidate features where $|\rho_i| \geq 0.5$ OR $r_i \geq 0.8$ (high correlation with target).
Sort the candidate list $F = [f_1, f_2, …, f_m]$ in descending order of their average correlation score.
Iterate through the sorted list. For each feature $f_i$, check its correlation with every other feature $f_j (j > i)$ in the candidate set.
If the inter-feature correlation $\rho_{ij} > 0.98$ OR $r_{ij} > 0.98$ (indicating high redundancy), remove the feature with the lower average correlation to the target from the set $F$.
Output the final filtered feature set $F$.

Applying this procedure to the sodium-ion battery aging data resulted in the selection of four optimal features: Discharge Specific Capacity (F1), Discharge Capacitance (F3), Median Voltage (F4), and Discharge DC Internal Resistance (F6). This set maintains very high predictive relevance for the sodium-ion battery’s RUL while minimizing information redundancy. The original 9-dimensional feature space was thus reduced to a more efficient 4-dimensional space.

Finally, to ensure stable and fast model convergence, the selected feature set was normalized to a [0, 1] range using Min-Max scaling:
$$ X_{\text{norm}} = \frac{X – X_{\text{min}}}{X_{\text{max}} – X_{\text{min}}} $$
where $X$ is the original feature value, and $X_{\text{min}}$ and $X_{\text{max}}$ are the minimum and maximum values of that feature over the entire cycle life of the sodium-ion battery.

RUL Prediction Model: LightGBM with Hyperparameter Optimization

Model Architecture: LightGBM

For the regression task of predicting the continuous RUL value of the sodium-ion battery, the Light Gradient Boosting Machine (LightGBM) algorithm was chosen. LightGBM is a high-performance gradient boosting framework that constructs an ensemble of decision trees sequentially. Its core advantages, which are particularly beneficial for processing time-series data from sodium-ion battery tests, include:

High Efficiency & Speed: It uses a novel technique called Gradient-based One-Side Sampling (GOSS) to keep instances with large gradients, significantly speeding up training without compromising accuracy.
Low Memory Usage: It employs Exclusive Feature Bundling (EFB) to bundle sparse features, reducing the feature space.
Superior Accuracy: It grows trees leaf-wise (best-first) rather than level-wise, often leading to lower loss and better model performance.
Native Handling of Non-linearity: It can effectively model the complex, non-linear degradation patterns inherent in sodium-ion battery aging.

The objective function at the $t$-th iteration of LightGBM combines a differentiable loss function $l$ (e.g., Mean Squared Error) and a regularization term $\Omega$:
$$ \text{obj}^{(t)} = \sum_{i=1}^{n} l\left(y_i, H_{t-1}(x_i) + h_t(x_i)\right) + \sum_{j=1}^{t} \Omega(h_j) $$
where $y_i$ is the true RUL, $H_{t-1}$ is the combined model from previous $t-1$ trees, $h_t$ is the new tree at iteration $t$, and $\Omega(h_j)$ penalizes model complexity to prevent overfitting.

Hyperparameter Tuning via GridSearchCV

The performance of LightGBM is highly sensitive to its hyperparameters. To maximize the predictive capability for the sodium-ion battery RUL, a comprehensive hyperparameter optimization was conducted using Grid Search with Cross-Validation (GridSearchCV). This method exhaustively searches through a manually specified subset of the hyperparameter space, evaluating each combination using a cross-validation strategy to guard against overfitting. For time-series data from the sodium-ion battery, a TimeSeriesSplit cross-validator was used, which respects the temporal order of cycles, ensuring that the model is always trained on past data and validated on future data.

The key hyperparameters tuned for the LightGBM model and their search grid were:

num_leaves: The maximum number of leaves in one tree. Controls model complexity. Grid: [15, 31, 63].
learning_rate: The shrinkage rate applied to each tree’s contribution. Grid: [0.01, 0.05, 0.1].
n_estimators: The number of boosting iterations (trees). Grid: [50, 100, 150].

The GridSearchCV process identifies the combination that minimizes the chosen evaluation metric (e.g., Root Mean Squared Error – RMSE) on the validation folds. This rigorous optimization ensures the LightGBM model is tailored specifically to the degradation characteristics of the sodium-ion battery data.

Table 3: Optimal Hyperparameters for LightGBM Model
Hyperparameter	Optimal Value	Description
num_leaves	31	Balances complexity and overfitting risk.
learning_rate	0.1	Provides a good trade-off between convergence speed and stability.
n_estimators	100	Sufficient number of trees to capture degradation patterns.

Experimental Results and Performance Analysis

The proposed framework—feature selection followed by GridSearchCV-optimized LightGBM (FS-GS-LightGBM)—was evaluated on the aging datasets of the sodium-ion battery at three different discharge rates (1C, 1.5C, 2C). To demonstrate its superiority, it was compared against two other powerful tree-based ensemble models: a GridSearchCV-optimized Gradient Boosting Decision Tree (GS-GBDT) and a GridSearchCV-optimized Random Forest (GS-RF). All models were trained and tested under identical conditions, using the same 4 optimal features selected by our procedure and the same TimeSeriesSplit for data partitioning.

Performance Metrics

Three standard regression metrics were used to quantitatively assess the prediction accuracy for the sodium-ion battery RUL:

Mean Absolute Error (MAE): $ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i – \hat{y}_i| $
Mean Squared Error (MSE): $ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 $
Root Mean Squared Error (RMSE): $ \text{RMSE} = \sqrt{\text{MSE}} $

where $y_i$ is the true RUL and $\hat{y}_i$ is the predicted RUL. Lower values for all metrics indicate better predictive performance.

Prediction Results and Comparative Analysis

The visual comparison of RUL prediction curves for the 2C rate sodium-ion battery data is highly illustrative. The FS-GS-LightGBM model’s predictions closely track the true RUL trajectory throughout the battery’s entire lifespan, with a very narrow error band. In contrast, the GS-GBDT and GS-RF models show significantly larger deviations and more unstable predictions, especially in the later stages of life where accurate prognosis is most critical. This trend is consistent across all three tested C-rates, underscoring the robustness of the proposed method.

The quantitative results, consolidated in the table below, provide definitive evidence of the FS-GS-LightGBM model’s excellence in predicting the remaining useful life of the sodium-ion battery.

Table 4: Comparative Performance of Prediction Models for Sodium-Ion Battery RUL
Discharge Rate	Model	MAE (cycles)	MSE (cycles²)	RMSE (cycles)
2C	FS-GS-LightGBM (Proposed)	1.32	3.34	1.83
	GS-GBDT	7.86	144.23	12.01
	GS-RF	18.04	749.10	27.37
1.5C	FS-GS-LightGBM (Proposed)	1.48	4.32	2.08
	GS-GBDT	4.21	44.61	6.68
	GS-RF	5.47	59.50	7.71
1C	FS-GS-LightGBM (Proposed)	2.97	17.68	4.20
	GS-GBDT	5.41	66.77	8.17
	GS-RF	9.21	180.02	13.42

The proposed model achieves remarkably low error rates. Across all test scenarios for the sodium-ion battery, its MAE never exceeds 3.0 cycles, its MSE stays below 17.7, and its RMSE is under 4.2 cycles. This level of accuracy is substantially superior to both GS-GBDT and GS-RF. For instance, at the 2C rate, the RMSE of the proposed model is 85% lower than that of GS-GBDT and 93% lower than that of GS-RF.

Impact of Feature Selection on Computational Efficiency

Beyond accuracy, the feature selection procedure delivers a significant practical benefit: a drastic reduction in model training time. By eliminating 5 redundant features from the sodium-ion battery dataset, the dimensionality of the input space is more than halved. The table below compares the training times (in seconds) for the optimized models with and without the feature selection step.

Table 5: Model Training Time Comparison (with/without Feature Selection)
Model	2C (With FS / Without FS)	1.5C (With FS / Without FS)	1C (With FS / Without FS)
GS-LightGBM	4.47s / 18.11s	4.98s / 19.10s	5.54s / 19.21s
GS-GBDT	29.10s / 58.00s	27.83s / 63.23s	46.50s / 122.96s
GS-RF	9.98s / 14.05s	11.50s / 16.55s	21.91s / 30.72s

The results are clear: feature selection accelerates the training process for all models. The benefit is most pronounced for the LightGBM model, where training time is reduced by approximately 75% across all test cases. This makes the overall framework not only more accurate but also significantly more efficient, a crucial factor for potential real-time or online prognostic applications for sodium-ion battery management systems.

Conclusion

This study successfully developed and validated a high-performance data-driven framework for predicting the Remaining Useful Life (RUL) of sodium-ion batteries. The framework’s strength lies in its two-stage design: First, a rigorous feature selection procedure that hybridizes Pearson Correlation Coefficient and Grey Relational Grade analyses to distill the aging data from the sodium-ion battery into a compact, informative, and non-redundant set of four optimal health indicators. Second, the application of a powerful Light Gradient Boosting Machine (LightGBM) model, whose hyperparameters are meticulously optimized via Grid Search Cross-Validation to tailor it specifically to the degradation dynamics of the sodium-ion battery.

Experimental validation on cycle aging data under multiple discharge rates (1C, 1.5C, 2C) demonstrated the framework’s exceptional accuracy and robustness. The proposed FS-GS-LightGBM model consistently outperformed optimized GBDT and Random Forest models, achieving prediction errors (MAE < 3.0, RMSE < 4.2) that are significantly lower. Furthermore, the feature selection step provided a substantial computational advantage, reducing model training time by up to 75%, thereby enhancing the framework’s practicality for implementation.

The outcomes of this research affirm that the combination of intelligent feature engineering and advanced, optimized machine learning algorithms like LightGBM provides a reliable and effective pathway for sodium-ion battery prognostics. This work contributes a valuable tool towards improving the management, safety, and economic performance of energy storage systems based on sodium-ion battery technology, supporting its broader adoption in sustainable energy applications.