Optimization of Battery Energy Storage System Scheduling Based on Predictive-Guided Deep Deterministic Policy Gradient

In recent years, the integration of distributed energy resources has accelerated, and the battery energy storage system (BESS) has emerged as a critical solution to address the intermittency and volatility in energy systems. The combination of BESS with photovoltaic (PV) systems enables the efficient storage of surplus solar energy and its discharge during optimal periods, which is vital for cost savings, emission reductions, and load stabilization in microgrid scenarios such as residential communities and industrial parks. However, achieving optimal charging and discharging decisions for BESS in real-time, based on fluctuating PV generation and load demands, remains a complex challenge. This complexity is exacerbated in multi-building协同 optimization scenarios with multiple BESS units, where uncertainties in source and load, along with the distributed nature of buildings, amplify the difficulties. Therefore, developing intelligent, transferable, and flexible调度 strategies for BESS is a key research focus in energy management systems.

To tackle these issues, various approaches have been proposed. For instance, some studies combine rule-based control with robust optimization to mitigate environmental uncertainties, but these methods tend to be overly conservative, leading to reduced economic benefits and flexibility. Others formulate调度 models for economic optimization using dynamic programming and genetic algorithms, which improve economic efficiency and battery protection but suffer from slow convergence in multi-objective constrained problems and difficulties in guaranteeing optimal solutions. Additionally, model predictive control (MPC) techniques leverage forecasts of future source and load conditions to achieve efficient energy调度, but their performance heavily depends on the accuracy of prediction models, which are challenging to construct. In recent years, deep reinforcement learning (DRL) has gained traction in BESS调度 due to its ability to learn control strategies through environmental interactions without complex optimization models. However, existing DRL methods often overlook inter-building communication and action constraints, limiting their effectiveness in achieving collective benefits and safety in multi-agent settings.

In this study, I propose a Predictive-Guided, Attention-based Deep Deterministic Policy Gradient (PGADDPG) approach to optimize BESS调度. This method builds upon the Deep Deterministic Policy Gradient (DDPG) framework and incorporates a hybrid reward function, attention mechanisms for communication between BESS units, and rolling prediction-guided control to enhance策略 learning. Specifically, I develop a multi-objective reward function that encourages efficient BESS utilization while balancing individual and collective goals. I integrate an attention mechanism into the policy network to allow BESS agents to share partial information, such as net electricity consumption and state of charge (SOC), fostering协同 control. Furthermore, I employ a Self-Attention Bidirectional Long Short-Term Memory (SA-BiLSTM) network for 24-hour rolling predictions of PV output and an adaptive cyclic average model for load forecasting. Based on these predictions, a hierarchical control strategy is formulated to guide the DRL training process, incorporating adaptive weight adjustment and entropy regularization to smooth策略 improvements and avoid local optima. The overall framework is designed to handle uncertainties in microgrid environments effectively.

The environmental model for the microgrid involves multiple residential buildings equipped with BESS and PV systems. At any time step $ t $, the net exchange power $ P^G_t $ is determined by the difference between power purchased from the grid $ P^{G2MG}_t $ and power sold to the grid $ P^{MG2G}_t $, with the constraint that buying and selling cannot occur simultaneously. The PV generation $ P^{PV}_t $ prioritizes meeting the load demand $ P^{Load}_t $, and any surplus is used to update the BESS power $ P^{BESS}_t $ through charging, discharging, or selling, ensuring energy balance across system components. The SOC constraints and energy update equations for the BESS are defined as follows:

$$ SOC_t = E_t / Q $$
$$ Q \cdot SOC^{min} \leq E_t \leq Q \cdot SOC^{max} $$
$$ E_{t+1} = (1 – \eta_{sd}) E_t + (\eta_{c,t} P_{c,t} + P_{d,t} / \eta_{d,t}) \Delta t $$

Here, $ SOC_t $ represents the battery state of charge within the range $ [SOC^{min}, SOC^{max}] $, $ E_t $ is the battery energy, $ Q $ is the maximum battery capacity, and $ P_{c,t} $ and $ P_{d,t} $ are the charging and discharging powers, respectively, with the constraint that charging and discharging cannot occur simultaneously. The efficiencies $ \eta_{c,t} $, $ \eta_{d,t} $, and $ \eta_{sd} $ denote charging efficiency, discharging efficiency, and self-discharge rate, respectively, which are dynamically adjusted using piecewise linear functions and interpolation intervals to update the battery energy $ E_{t+1} $ at the next time step.

The control network architecture of PGADDPG is illustrated in the following diagram, which shows the integration of DRL agents, attention mechanisms, and prediction-guided control. In this framework, each DRL agent observes environmental states, including date, temperature, humidity, solar irradiance, load, PV output, SOC, net electricity consumption, and electricity prices. Based on these observations and rewards, the agents generate control actions, which are stored in an experience replay buffer for training. The attention mechanism facilitates communication between agents by sharing selected information, such as net consumption and SOC, allowing each agent to consider collective goals. Meanwhile, rolling predictions for load and PV output are performed using machine learning models, and a hierarchical control strategy derived from these predictions guides the DRL training. This strategy is combined with the DDPG policy through adaptive weighting, and entropy regularization is applied to encourage exploration and prevent overfitting to the guided策略.

The multi-objective hybrid reward function is designed to promote effective BESS utilization while maintaining numerical balance and aligning individual and collective objectives. For a building $ b $ at time $ t $, the base reward $ R_c(b,t) $ and penalty coefficient $ X_0(b,t) $ are computed as:

$$ X_0(b,t) = – (1.4 + 1.2 \times \text{sign}(C_e(b,t)) \times SOC(b,t)) $$
$$ R_c(b,t) = X_0(b,t) \times \left( C_e(b,t) + C_c(b,t) + \left( \sum_{b=1}^N C_e(b,t) \right)^2 + \left( \sum_{b=1}^N C_c(b,t) \right)^2 \right) $$
$$ R(b,t) = R_c(b,t) + \sum_{i=1}^4 w_i X_i $$

Here, $ C_e(b,t) $ and $ C_c(b,t) $ represent the electricity cost and carbon emission cost for building $ b $, calculated as the product of electricity price, carbon density, and net electricity consumption. $ N $ is the number of buildings, and $ w_i $ are weighting coefficients for additional penalty terms $ X_i $, which include penalties for not charging before peak hours or not discharging during peak periods. These penalties help reduce unreasonable exploration by the agents. The reward function emphasizes energy saving and emission reduction, with the sign function and SOC multiplication in $ X_0(b,t) $ encouraging BESS control, and squaring operations on collective costs to unify numerical scales.

For rolling predictions, I address the high volatility and limited correlation in load data using a weighted linear regression hybrid model. This model incorporates daily, weekly, and monthly cyclic averages to capture periodic patterns in load data. Historical data is normalized and filtered for outliers, and predictions are made for 24-hour horizons with weights assigned based on recency. The combined prediction $ \hat{y}_t $ is given by:

$$ \hat{y}_t = w_0 \cdot \hat{y}_t^0 + w_1 \cdot \hat{y}_t^1 + w_2 \cdot \hat{y}_t^2 + b $$

where $ \hat{y}_t^0 $, $ \hat{y}_t^1 $, and $ \hat{y}_t^2 $ are predictions from daily, weekly, and monthly cyclic average models, respectively, $ w_0 $, $ w_1 $, and $ w_2 $ are weights updated every 72 hours using gradient descent, and $ b $ is a bias term. For PV prediction, I employ a SA-BiLSTM model that processes features such as date information, historical PV data, and solar irradiance. The Bi-LSTM component captures temporal dependencies, while the self-attention mechanism assigns weights to different sequence elements. The attention calculation for an output sequence $ X = (x_1, x_2, \dots, x_n) $ is as follows:

$$ e_{ij} = \frac{q_i^T k_j}{\sqrt{d_k}} $$
$$ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^n \exp(e_{ik})} $$
$$ c_i = \sum_{j=1}^n \alpha_{ij} v_j $$
$$ \text{out}_i = W_o [q_i, c_i] + b_o $$

In this equations, $ e_{ij} $ and $ \alpha_{ij} $ are the attention energy and weight between elements $ x_i $ and $ x_j $, $ q_i $, $ k_j $, and $ v_j $ are query, key, and value vectors (set to the Bi-LSTM outputs), $ d_k $ is the sequence dimension, $ n $ is the number of elements (e.g., 24 hours), $ c_i $ is the weighted feature representation, and $ W_o $ and $ b_o $ are network parameters. The output is combined with Bi-LSTM results via residual connections to enhance feature reuse and training efficiency.

The prediction-driven hierarchical control strategy uses these forecasts to guide BESS actions. At each time step $ t $, the strategy computes the minimum required SOC during peak hours ($ SOC^{min}_{peak} $) and the surplus PV energy before peak periods ($ SOC^{pre}_{peak} $). If $ SOC^{pre}_{peak} $ is less than $ SOC^{min}_{peak} $, the battery is charged to meet the requirement; otherwise, lower-level control maximizes the use of surplus energy based on the next time step’s predictions:

$$ R^{pv}_{t+1} = E^{pv}_{t+1} – E^{load}_{t+1} $$
$$ a_t = R^{pv}_{t+1} / Q $$

Here, $ E^{pv}_{t+1} $ and $ E^{load}_{t+1} $ are the predicted PV and load values at $ t+1 $, $ R^{pv}_{t+1} $ is the surplus energy, and $ a_t $ is the BESS action. If $ R^{pv}_{t+1} \leq 0 $, discharging is limited to ensure priority for upper-level conditions.

In the reinforcement learning implementation, I model the problem as a Markov Decision Process (MDP) with state space $ S $, action space $ A $, reward function $ R $, transition probability $ P $, and discount factor $ \gamma $. The state includes environmental variables, and actions are continuous within $ [SOC^{min}, SOC^{max}] $. The DDPG algorithm uses Actor and Critic networks to approximate the policy $ \mu_\theta(s) $ and Q-function $ Q_\phi(s,a) $, respectively. The policy loss $ J(\theta) $ and Critic loss $ L(\phi) $ are optimized as:

$$ \nabla_\theta J(\theta) \approx \mathbb{E}_{s \sim \rho_\mu} \left[ \nabla_\theta \mu_\theta(s) \nabla_a Q_\phi(s,a) \big|_{a=\mu_\theta(s)} \right] $$
$$ L(\phi) = \mathbb{E}_{(s,a,r,s’) \sim D} \left[ (Q_\phi(s,a) – y)^2 \right] $$
$$ y = r + \gamma Q_{\phi’}(s’, \mu_{\theta’}(s’)) $$

where $ \theta $ and $ \phi $ are network parameters, $ \theta’ $ and $ \phi’ $ are target network parameters updated via soft updates, and $ D $ is the experience replay buffer. To integrate prediction guidance, I combine the DDPG policy $ a_{rl} $ with the hierarchical control strategy $ a_{bf} $ using adaptive weights:

$$ \text{diff} = |a_{rl} – a_{bf}| $$
$$ w_{bf} = \frac{1}{1 + e^{-(\text{diff} – t) \cdot \lambda(n)}} $$
$$ \lambda(n) = \max(0, 1 – (\text{eps} / \text{tot\_eps} – n)) $$
$$ a = w_{rl} \cdot a_{rl} + w_{bf} \cdot a_{bf} $$

Here, $ w_{rl} = 1 – w_{bf} $, $ t $ is a threshold updated based on historical differences, $ \text{eps} $ and $ \text{tot\_eps} $ are current and total training episodes, and $ n $ is the number of self-exploration episodes. Early in training, $ w_{bf} $ dominates to provide guidance, but as training progresses, $ w_{rl} $ increases to promote exploration. After guidance ends, $ w_{bf} $ is set to zero. Importance sampling and entropy regularization are applied to focus on actions deviating from $ a_{bf} $, with the policy loss adjusted as:

$$ J(\theta) \propto -\sum (\text{normw} \cdot \text{Critic}) + \varepsilon \cdot \text{entropy} $$

where $ \text{normw} $ is normalized importance weight, $ \text{entropy} $ is the action entropy, and $ \varepsilon $ is a regularization coefficient.

For experimental validation, I use data from the Pecan Street Dataport, which includes one year of load, PV, weather, and time-of-use electricity price data for five residential buildings in the same region. The BESS parameters are set as follows:

Battery Parameter	Value
Capacity (kWh)	11.0
Power (kW)	6
Maximum SOC	91%
Minimum SOC	5%
Initial Charge/Discharge Efficiency	94%
Self-Discharge Rate	0.001%
Energy Power Interpolation Intervals	[[0, 0.85], [0.6, 0.94], [1, 0.85]]
Energy Efficiency Interpolation Intervals	[[0.0, 1], [0.7, 1], [1.0, 0.2]]

The time-of-use electricity prices vary by season and time of day, as summarized below:

Time Period	June-September (Weekday)	June-September (Weekend)	October-May (Weekday)	October-May (Weekend)
8 AM-4 PM	$0.22/kWh	$0.22/kWh	$0.21/kWh	$0.21/kWh
4 PM-9 PM	$0.54/kWh	$0.40/kWh	$0.50/kWh	$0.50/kWh
9 PM-8 AM	$0.22/kWh	$0.22/kWh	$0.21/kWh	$0.21/kWh

I evaluate the PGADDPG approach under four scenarios: independent DDPG (IDDPG), PGADDPG (with prediction guidance and attention), PGDDPG (prediction guidance without attention), and AIDDPG (attention without guidance). Each scenario is trained for 20 episodes, with the first episode involving random exploration and the last used for evaluation. The models are then tested on two additional buildings to assess transferability. Key hyperparameters include a replay buffer size of 200,000, batch size of 256, learning rates of 0.0004 for both Q and policy networks, discount factor $ \gamma = 0.992 $, target update rate $ \tau = 0.0003 $, and noise decay rate of 0.99. The network architecture comprises an input dimension of 15, an attention hidden layer of 300 with Softmax activation, and Actor/Critic networks with two hidden layers of dimensions 300 and 400 using ReLU activation.

The rolling prediction results demonstrate the effectiveness of the forecasting models. For PV prediction, the SA-BiLSTM model achieves lower mean squared error (MSE) and mean absolute error (MAE) compared to standard LSTM, as shown below:

Prediction Error	SA-BiLSTM (Building 1)	SA-BiLSTM (Building 2)	LSTM (Building 1)	LSTM (Building 2)
1-hour MSE	0.034	0.018	0.217	0.270
1-hour MAE	0.041	0.064	0.200	0.215
24-hour MSE	0.046	0.041	0.207	0.267
24-hour MAE	0.072	0.049	0.193	0.212

For load prediction, the adaptive cyclic average model outperforms day-ahead forecasting, with errors increasing over longer horizons but remaining acceptable for guiding control:

Prediction Error	Adaptive Cyclic Average (Building 1)	Adaptive Cyclic Average (Building 2)	Day-Ahead Prediction (Building 1)	Day-Ahead Prediction (Building 2)
1-hour MSE	0.332	0.464	0.624	0.983
1-hour MAE	0.449	0.583	0.529	0.619
24-hour MSE	0.477	0.829	0.654	1.159
24-hour MAE	0.504	0.630	0.530	0.620

In terms of reinforcement learning performance, the reward convergence during training shows that PGADDPG and PGDDPG start with higher rewards due to prediction guidance but experience fluctuations as entropy regularization encourages exploration. After stabilization, PGADDPG achieves the highest evaluation reward of -0.59, compared to -0.76 for IDDPG, indicating the benefits of guidance and attention. The attention mechanism enables agents to share information, leading to improved collective outcomes. For example, in a typical summer day, all strategies learn to store surplus PV energy and discharge during peak hours, but PGADDPG and PGDDPG produce smoother调度 plans with reduced load fluctuations.

The evaluation results for Buildings 1 and 2 over one year are summarized below, demonstrating that PGADDPG outperforms other methods in reducing electricity consumption, economic costs, and carbon emissions:

Algorithm	Electricity Consumption (Building 1)	Electricity Consumption (Building 2)	Economic Cost (Building 1)	Economic Cost (Building 2)	Carbon Emissions (Building 1)	Carbon Emissions (Building 2)
PGADDPG	0.771	0.675	0.659	0.606	0.743	0.666
PGDDPG	0.788	0.750	0.662	0.664	0.760	0.733
AIDDPG	0.813	0.777	0.692	0.697	0.786	0.762
IDDPG	0.879	0.870	0.770	0.784	0.855	0.854

Collectively, PGADDPG reduces overall economic costs, carbon emissions, and electricity consumption to 61.5%, 70.5%, and 72.4% of the baseline (no BESS control), respectively, and achieves improvements of 10.3%, 15.1%, and 15.2% over IDDPG. Additionally, PGADDPG results in lower daily peak electricity usage and reduced load fluctuations, enhancing grid stability. The transferability test on Buildings 3 and 4 shows similar trends, with PGADDPG reducing collective indicators by 8.8%, 10.7%, and 11.2% compared to IDDPG, confirming the method’s adaptability to new environments.

In conclusion, the PGADDPG approach effectively optimizes BESS调度 by integrating prediction guidance, attention-based communication, and deep reinforcement learning. The method demonstrates significant improvements in economic efficiency, emission reduction, and load stabilization, while maintaining transferability across different buildings. However, this study focuses on small-scale microgrids with BESS and PV systems, and future work should explore larger-scale scenarios, integrate additional energy sources, and develop more efficient prediction and guidance techniques to further enhance the battery energy storage system performance.