|
1.INTRODUCTIONThe effective utilization of distributed photovoltaics (PV) is a crucial means to drive the energy production and consumption revolution in China. However due to the randomness and variability of its power output the impact of distributed PV integration into the grid on issues such as distribution grid safety power quality and system stability cannot be underestimated [1]. Therefore accurate prediction of distributed PV power is of paramount importance for optimizing the operation of multi-level electric grids. Currently research on centralized PV forecasting is relatively mature and can be categorized into two main approaches based on modeling logic. One approach is the mechanism-driven method where predictive models are established based on physical principles using meteorological information and PV system parameters [2]. Another approach is data-driven involving the development of predictive models through the analysis of historical output meteorological data and other pertinent information to establish relationships among them. Commonly used methods include statistical methods machine learning techniques and hybrid methods [3]. In contrast to large-scale PV stations which are geographically concentrated distributed PV generation systems are scattered across different locations. Additionally communication and monitoring equipment for distributed PV installations are often incomplete leading to the loss of crucial power and meteorological data. As a result achieving high short-term forecasting accuracy for distributed PV systems is more challenging compared to centralized PV stations. This research paper introduces a data-driven fusion model designed to improve the accuracy of power predictions for distributed PV stations:
2.DATA PROCESSING OF DISTRIBUTED PHOTOVOLTAIC GENERATIONSData preprocessing is a crucial step in addressing forecasting challenges. However the current literature often lacks general data preprocessing methods that are specifically tailored for distributed PV generation. The methodologies of data processing are discussed especially for the distributed PV generations in this section. 2.1Initial Data Identification Based on iForestThe iForest proposed by Liu et al.[4] is an unsupervised anomaly detection algorithm that is suitable for continuous data and is used to detect and mine outlier points. Moreover the Isolation Forest algorithm has high computational efficiency and accuracy is sensitive to globally sparse points and is suitable for high-dimensional data and large datasets. It is applicable for handling anomalous points in wind and photovoltaic power measurement data. The iForest algorithm primarily comprises two key stages: the initial stage involves the construction of an Isolation Forest which is composed of individual trees referred to as iTrees and the subsequent stage focuses on assessing the degree of anomaly. 2.1.1Construct an iTrees:The sample set x is formed by randomly selecting v data points from the training dataset which are then placed into the root node of the tree. The current tree height is initialized and the tree height is limited to 1. Within the current node’s data a specific dimension g is chosen and a cut-off point k is randomly generated within the range of the maximum and minimum values of dimension g creating a hyperplane. The data within the current node is divided into two subspaces based on the hyperplane: data points smaller than k in dimension g are placed on the left side and data points greater than or equal to k are placed on the right side resulting in the division of the current node’s data space into two child subspaces. 2.1.2Degree of anomaly judgment:Following the acquisition of t iTrees the assembly of the iForest takes place. Within each tree the algorithm searches for a sample point denoted as x within the sample set. Subsequently the degree of anomaly is determined by calculating the anomaly index using Equation (1). Within the equation S(x) represents the anomaly index assigned to the identified sample ranging from 0 to 1. Meanwhile E(h(x)) denotes the average path length required to measure x on the iTrees and c(v) signifies the average search path within the binary tree which comprises v points from the training data x. In the equation h(x) = ln(x) + ξ ξ is the Euler constant. Based on Equation (3.1) we can conclude that: S(x) = 1 indicating that every sample is. abnormal. S(x) = 0 indicating that there are no abnormal points in every sample. S(x) is between (0 1) indicating that no conspicuous abnormal points are found among all samples. The flowchart of the iForest algorithm is shown in Figure 1. 2.2Data Reconstruction Cubic Spline Interpolation.The prediction model of distributed photovoltaic systems relies on historical data for predictive analysis. However in practical applications missing data in historical records can seriously Influence the precision of the prediction model [5]. Therefore an effective method is needed to repair missing historical data and enhance the precision of the prediction model. Cubic spline interpolation is a classic interpolation method that provides high accuracy and smoothness. The basic idea is to divide the entire interpolation interval into several small intervals and use a cubic polynomial function to approximate the original function in each small interval while ensuring first- and second-degree continuity of the polynomial functions at the interpolation points of adjacent small intervals. Let x be a sequence and y be a function with the nodes of x denoted by x0, x1, …, xn and corresponding nodes of y denoted by y0, y1,…,yn. There exists a functional relationship between x and y: yi = f (xi) i = 0,1…,n. Cubic spline interpolation describes the function between two data nodes [xi, xi+1] in each subinterval f (x) = fi (x) and the curve is smooth within the interval with continuous derivatives. n cubic polynomials can represent : the function in n subintervals with a total of 4n unknown parameters ai bi ci di. We can obtain 4n–2 conditions based on the continuity of the interpolation and its derivative: The remaining two conditions are boundary conditions at x0 and xn. In distributed photovoltaic prediction the cubic spline interpolation algorithm can be used to repair missing historical data the specific steps are outlined as follows:
2.3Correlation analysis.Similar to numerous other renewable energy sources PV power is heavily reliant on weather conditions. Correlations between the main meteorological factors in meteorological data and photovoltaic output are analyzed via Pearson correlation coefficient (PCC) [6]. Where rXY and cov(X, Y) are the PCC value and covariance of time series variables X, Y respectively. σX and σY represent the standard deviation of variable X,Y. E (•) denotes the mathematical expectation of the variable. The large absolute value of PCC demonstrate the strong the correlation between variables. The correlation coefficient between NWP and photovoltaic power is illustrated in Figure 2. 3.STACKING MODELStacking ensemble learning scheme is designed to integrate the data-driven sub-models[7]. With two layers it is able to embody the advantages of different models to the greatest extent and obtains the best prediction effect. The data after generalized processing is partitioned into k training sets. The first layer in stacking ensemble learning scheme is composed of data-driven sub-models as the base models. The partitioned training sets are processed with the k-fold test and k-fold cross validation. Then in the second layer there is another machine learning algorithm as the meta learner. Because of the optimization ability in large difference degree XGBoost algorithm is selected. And 6-fold test and 6-fold cross validation are undertaken in our previous experiment as a good choice to achieve a strong generalization ability for the multiple sub-modes. There are three stages compromised for the proposed method.
4.CASE STUDYTo validate the efficacy of the proposed approach method in this study comparisons are drawn between the proposed method and classical approaches including data-driven model as the first-level base learners in the proposed method. The performance of the models can be greatly influenced by the selection of model hyperparameters. In order to enhance the persuasiveness of the experimental comparisons the hyperparameters of the individual models are optimized using a whale optimization method. Collect data from a distributed photovoltaic power station for three representative days each representing different weather conditions: sunny cloudy rainy and the predicted results and indicators are illustrated in Fig. 3. Figure3 (a) presents the single-day forecasting results under clear sky conditions. The solid lines represent the predictions of various models while the dashed lines depict the absolute error rates relative to the ground truth values. During sunny conditions the photovoltaic curve exhibits a smooth variation and the data trends remain relatively stable. All models demonstrate good learning and forecasting capabilities. However when fluctuations occur in the photovoltaic power curve around 10:00 the Stacking ensemble model and LSTM exhibit better tracking of the power variation trends. According to the absolute error rate curve the accuracy of the predictions of all models is higher from 9:00 to 17:00. At the endpoints the error rates of the proposed models are below 0.1 indicating superior predictive performance through model fusion. Figure3 (b) displays the single-day forecasting results for rainy weather conditions. During rainy weather the fluctuations in photovoltaic output increase noticeably. For periods with relatively stable trends all models perform well in their predictions. However for segments of photovoltaic power with strong and unpredictable fluctuations the models struggle to accurately track the variations. Mechanistic and SVM models show larger deviations during peak and inflection points while data-driven models perform relatively accurately. Additionally due to the low irradiance during rainy weather and the non-uniform impact on photovoltaic power the mechanistic model exhibits higher prediction error rates at the endpoints. Based on the error rate curves the Stacking model maintains a consistent level of prediction accuracy showing advantages in both accuracy and stability. Figure3(c) illustrates photovoltaic forecasting under partly cloudy weather conditions. The thickness of the cloud cover indirectly affects the absorption of solar radiation by photovoltaic panels leading to frequent fluctuations in the photovoltaic curve albeit less intense than during rainy weather. During the rising phase of photovoltaic power the predictions of Stacking LSTM and LGBM closely align with the actual curve. In the middle of the curve with frequent fluctuations Stacking outperforms other models in capturing values closer to the actual ones except at peak points and their immediate vicinity. During the declining phase of photovoltaic power Stacking and LSTM exhibit relatively good tracking performance. Examining the relative error curve it is observed that models other than Stacking and LSTM experience increased error rates at power discontinuity points while LSTM shows larger prediction errors on both sides. 5.SUMMARYThis study addresses the issue of insufficient meteorological data for power prediction in distributed photovoltaic (PV) sites. To overcome this challenge a hybrid data-driven approach is proposed. Experimental validation using real-world data yields the following findings:
6.ACKNOWLEDGMENTThis work is supported by Science and Technology Project of State Grid Corporation of China(Grant No. 5108-202218280A-2-226-XG). REFERENCESZhang Tinghui Xie Mingcheng et al.,
“System dynamics simulation of shared value of distributed photovoltaic and its impact on distribution network,”
Automation of Electric Power Systems, 45
(18), 35
–44
(2021). Google Scholar
Barbieri F Rajakaryna S Ghosh A,
“Very short-term photovoltaic power forecasting with cloud modeling : a review,”
Renewable and Sustainable Energy Reviews, 2017
(75), 242
–263 Google Scholar
Sheng H M Xiao J Cheng Y H et al.,
“Short-term solar power forecasting based on weighted Gaussian process regression,”
IEEE Transactions on Industrial Electronics2018, 65
(1), 300
–308 https://doi.org/10.1109/TIE.2017.2714127 Google Scholar
Liu F T Ting K M Zhou Z,
“Isolation-based anomaly detection,”
ACM Transactions on Knowledge Discovery from Data, 6
(1), 1
–39
(2012). https://doi.org/10.1145/2133360.2133363 Google Scholar
Jiao Tianli Zhang Jianmin Li Xiong et al.,
“Spatial clustering method for large-scale distributed user photovoltaics based on spatial correlation,”
Automation of Electric Power Systems, 43
(21), 97
–102
(2019). Google Scholar
Lin S Feng Y,
“Research on Stock Price Prediction Based on Orthogonal Gaussian Basis Function Expansion and Pearson Correlation Coefficient Weighted LSTM Neural Network,”
Advances in Computer Signals and Systems20226, 5 Google Scholar
Shi Jiaqi Zhang Jianhua,
“Load forecasting based on multi-model by Stacking ensemble learning,”
in Proceedings of the CSEE201939,
4032
–4042 Google Scholar
|