Open Access Paper
15 January 2025 A fusion model for predicting distributed photovoltaic power
Rui Li, Hui Hui, Ming Wang, Yang Zhao, Miao Zhao, Qianfan Zhou
Author Affiliations +
Proceedings Volume 13513, The International Conference Optoelectronic Information and Optical Engineering (OIOE2024); 135133Z (2025) https://doi.org/10.1117/12.3056624
Event: The International Conference Optoelectronic Information and Optical Engineering (OIOE2024), 2024, Wuhan, China
Abstract
In this paper a predictive method using a fused model for separately processing data is proposed to address the strong stochasticity and variability in distributed photovoltaic (PV) power plants which cannot be adequately accommodated by traditional approaches. The data from the distributed PV plant are processed to meet the precision requirements of prediction. Firstly the isolation forest (iForest) algorithm is employed for data cleaning followed by data reconstruction through cubic spline interpolation for the cleaned data with missing values. Finally prediction is carried out using a stacking fused model with the input data having undergone feature engineering through correlation analysis. The effectiveness of the proposed method is confirmed through validation using real-world data.

1.

INTRODUCTION

The effective utilization of distributed photovoltaics (PV) is a crucial means to drive the energy production and consumption revolution in China. However due to the randomness and variability of its power output the impact of distributed PV integration into the grid on issues such as distribution grid safety power quality and system stability cannot be underestimated [1]. Therefore accurate prediction of distributed PV power is of paramount importance for optimizing the operation of multi-level electric grids.

Currently research on centralized PV forecasting is relatively mature and can be categorized into two main approaches based on modeling logic. One approach is the mechanism-driven method where predictive models are established based on physical principles using meteorological information and PV system parameters [2]. Another approach is data-driven involving the development of predictive models through the analysis of historical output meteorological data and other pertinent information to establish relationships among them. Commonly used methods include statistical methods machine learning techniques and hybrid methods [3].

In contrast to large-scale PV stations which are geographically concentrated distributed PV generation systems are scattered across different locations. Additionally communication and monitoring equipment for distributed PV installations are often incomplete leading to the loss of crucial power and meteorological data. As a result achieving high short-term forecasting accuracy for distributed PV systems is more challenging compared to centralized PV stations.

This research paper introduces a data-driven fusion model designed to improve the accuracy of power predictions for distributed PV stations:

  • 1) Through the utilization of isolation forest (iForest) for data cleansing and cubic spline interpolation for handling missing values the meteorological data obtained from nearby public weather stations or photovoltaic stations equipped with numerical weather prediction (NWP) are processed independently.

  • 2) Feature extraction via correlation analysis was employed to enhance the predictive accuracy of the fusion model.

  • 3) Finally A stacking ensemble learning approach is employed to amalgamate the strengths of data-driven models. This approach contributes to enhanced model generalization performance and predictive accuracy.

2.

DATA PROCESSING OF DISTRIBUTED PHOTOVOLTAIC GENERATIONS

Data preprocessing is a crucial step in addressing forecasting challenges. However the current literature often lacks general data preprocessing methods that are specifically tailored for distributed PV generation. The methodologies of data processing are discussed especially for the distributed PV generations in this section.

2.1

Initial Data Identification Based on iForest

The iForest proposed by Liu et al.[4] is an unsupervised anomaly detection algorithm that is suitable for continuous data and is used to detect and mine outlier points. Moreover the Isolation Forest algorithm has high computational efficiency and accuracy is sensitive to globally sparse points and is suitable for high-dimensional data and large datasets. It is applicable for handling anomalous points in wind and photovoltaic power measurement data.

The iForest algorithm primarily comprises two key stages: the initial stage involves the construction of an Isolation Forest which is composed of individual trees referred to as iTrees and the subsequent stage focuses on assessing the degree of anomaly.

2.1.1

Construct an iTrees:

The sample set x is formed by randomly selecting v data points from the training dataset which are then placed into the root node of the tree. The current tree height is initialized and the tree height is limited to 1.

Within the current node’s data a specific dimension g is chosen and a cut-off point k is randomly generated within the range of the maximum and minimum values of dimension g creating a hyperplane.

The data within the current node is divided into two subspaces based on the hyperplane: data points smaller than k in dimension g are placed on the left side and data points greater than or equal to k are placed on the right side resulting in the division of the current node’s data space into two child subspaces.

2.1.2

Degree of anomaly judgment:

Following the acquisition of t iTrees the assembly of the iForest takes place. Within each tree the algorithm searches for a sample point denoted as x within the sample set. Subsequently the degree of anomaly is determined by calculating the anomaly index using Equation (1).

00143_PSISDG13513_135133Z_page_2_1.jpg

Within the equation S(x) represents the anomaly index assigned to the identified sample ranging from 0 to 1. Meanwhile E(h(x)) denotes the average path length required to measure x on the iTrees and c(v) signifies the average search path within the binary tree which comprises v points from the training data x.

00143_PSISDG13513_135133Z_page_2_2.jpg

In the equation h(x) = ln(x) + ξ ξ is the Euler constant.

Based on Equation (3.1) we can conclude that:

S(x) = 1 indicating that every sample is. abnormal.

S(x) = 0 indicating that there are no abnormal points in every sample.

S(x) is between (0 1) indicating that no conspicuous abnormal points are found among all samples.

The flowchart of the iForest algorithm is shown in Figure 1.

Figure 1.

The general process flow of the iForest algorithm.

00143_PSISDG13513_135133Z_page_3_1.jpg

2.2

Data Reconstruction Cubic Spline Interpolation.

The prediction model of distributed photovoltaic systems relies on historical data for predictive analysis. However in practical applications missing data in historical records can seriously Influence the precision of the prediction model [5].

Therefore an effective method is needed to repair missing historical data and enhance the precision of the prediction model. Cubic spline interpolation is a classic interpolation method that provides high accuracy and smoothness. The basic idea is to divide the entire interpolation interval into several small intervals and use a cubic polynomial function to approximate the original function in each small interval while ensuring first- and second-degree continuity of the polynomial functions at the interpolation points of adjacent small intervals.

Let x be a sequence and y be a function with the nodes of x denoted by x0, x1, …, xn and corresponding nodes of y denoted by y0, y1,…,yn. There exists a functional relationship between x and y: yi = f (xi) i = 0,1…,n. Cubic spline interpolation describes the function between two data nodes [xi, xi+1] in each subinterval f (x) = fi (x) and the curve is smooth within the interval with continuous derivatives. n cubic polynomials can represent :

00143_PSISDG13513_135133Z_page_3_2.jpg

the function in n subintervals with a total of 4n unknown parameters ai bi ci di. We can obtain 4n–2 conditions based on the continuity of the interpolation and its derivative:

00143_PSISDG13513_135133Z_page_3_3.jpg

The remaining two conditions are boundary conditions at x0 and xn.

In distributed photovoltaic prediction the cubic spline interpolation algorithm can be used to repair missing historical data the specific steps are outlined as follows:

  • (1) Collect Known Data Points: Collect missing data points based on the cleaned data which can be data points from surrounding time periods or other relevant data points.

  • (2) Fit a Cubic Spline Function: Fit a cubic spline function using the known data points. Cubic spline function ensures the smoothness of interpolation and the continuity of the first derivative at nodes.

  • (3) Estimate Missing Data using Cubic Spline Function: Estimate missing data points by inputting them into the fitted cubic spline function to obtain the interpolated result.

2.3

Correlation analysis.

Similar to numerous other renewable energy sources PV power is heavily reliant on weather conditions. Correlations between the main meteorological factors in meteorological data and photovoltaic output are analyzed via Pearson correlation coefficient (PCC) [6].

00143_PSISDG13513_135133Z_page_4_1.jpg

Where rXY and cov(X, Y) are the PCC value and covariance of time series variables X, Y respectively. σX and σY represent the standard deviation of variable X,Y. E (•) denotes the mathematical expectation of the variable. The large absolute value of PCC demonstrate the strong the correlation between variables. The correlation coefficient between NWP and photovoltaic power is illustrated in Figure 2.

Figure 2.

Heatmap showing correlations between different features in the dataset.

00143_PSISDG13513_135133Z_page_4_2.jpg

3.

STACKING MODEL

Stacking ensemble learning scheme is designed to integrate the data-driven sub-models[7]. With two layers it is able to embody the advantages of different models to the greatest extent and obtains the best prediction effect. The data after generalized processing is partitioned into k training sets. The first layer in stacking ensemble learning scheme is composed of data-driven sub-models as the base models. The partitioned training sets are processed with the k-fold test and k-fold cross validation. Then in the second layer there is another machine learning algorithm as the meta learner. Because of the optimization ability in large difference degree XGBoost algorithm is selected. And 6-fold test and 6-fold cross validation are undertaken in our previous experiment as a good choice to achieve a strong generalization ability for the multiple sub-modes.

There are three stages compromised for the proposed method.

  • Stage-1: Generalized data preprocessing such as initial data identification data reconstruction Correlation analysis.

  • Stage-2: To improve the power forecasting of distributed photovoltaic stations cleaned and restructured data will be subjected to feature extraction through correlation analysis.

  • Stage-3: Split the processed dataset into 6 equal parts train base models each of which will result in 6 instances with different parameters. Utilize the predictions from the base models as inputs for the meta-learner and further train the meta-learner,Integrate the strengths of different algorithms to obtain power forecasts for distributed photovoltaic station

4.

CASE STUDY

To validate the efficacy of the proposed approach method in this study comparisons are drawn between the proposed method and classical approaches including data-driven model as the first-level base learners in the proposed method. The performance of the models can be greatly influenced by the selection of model hyperparameters. In order to enhance the persuasiveness of the experimental comparisons the hyperparameters of the individual models are optimized using a whale optimization method.

Collect data from a distributed photovoltaic power station for three representative days each representing different weather conditions: sunny cloudy rainy and the predicted results and indicators are illustrated in Fig. 3.

Figure 3.

Comparison of forecast results (a) in sunny day (b) in rainy day (c) in cloudy day.

00143_PSISDG13513_135133Z_page_5_1.jpg

Figure3 (a) presents the single-day forecasting results under clear sky conditions. The solid lines represent the predictions of various models while the dashed lines depict the absolute error rates relative to the ground truth values. During sunny conditions the photovoltaic curve exhibits a smooth variation and the data trends remain relatively stable. All models demonstrate good learning and forecasting capabilities. However when fluctuations occur in the photovoltaic power curve around 10:00 the Stacking ensemble model and LSTM exhibit better tracking of the power variation trends. According to the absolute error rate curve the accuracy of the predictions of all models is higher from 9:00 to 17:00. At the endpoints the error rates of the proposed models are below 0.1 indicating superior predictive performance through model fusion.

Figure3 (b) displays the single-day forecasting results for rainy weather conditions. During rainy weather the fluctuations in photovoltaic output increase noticeably. For periods with relatively stable trends all models perform well in their predictions. However for segments of photovoltaic power with strong and unpredictable fluctuations the models struggle to accurately track the variations. Mechanistic and SVM models show larger deviations during peak and inflection points while data-driven models perform relatively accurately. Additionally due to the low irradiance during rainy weather and the non-uniform impact on photovoltaic power the mechanistic model exhibits higher prediction error rates at the endpoints. Based on the error rate curves the Stacking model maintains a consistent level of prediction accuracy showing advantages in both accuracy and stability.

Figure3(c) illustrates photovoltaic forecasting under partly cloudy weather conditions. The thickness of the cloud cover indirectly affects the absorption of solar radiation by photovoltaic panels leading to frequent fluctuations in the photovoltaic curve albeit less intense than during rainy weather. During the rising phase of photovoltaic power the predictions of Stacking LSTM and LGBM closely align with the actual curve. In the middle of the curve with frequent fluctuations Stacking outperforms other models in capturing values closer to the actual ones except at peak points and their immediate vicinity. During the declining phase of photovoltaic power Stacking and LSTM exhibit relatively good tracking performance. Examining the relative error curve it is observed that models other than Stacking and LSTM experience increased error rates at power discontinuity points while LSTM shows larger prediction errors on both sides.

5.

SUMMARY

This study addresses the issue of insufficient meteorological data for power prediction in distributed photovoltaic (PV) sites. To overcome this challenge a hybrid data-driven approach is proposed. Experimental validation using real-world data yields the following findings:

  • 1) Through data cleaning and data reconstruction the issues of data missing and strong randomness in distributed photovoltaic power station data have been addressed.

  • 2) Feature engineering is applied to optimize the input and enhance the performance of individual learners in the data-driven model.

  • 3) By employing the Stacking ensemble learning framework which combines the strengths of data-driven models superior predictive results are achieved improving the model’s generalization performance and prediction accuracy.

6.

ACKNOWLEDGMENT

This work is supported by Science and Technology Project of State Grid Corporation of China(Grant No. 5108-202218280A-2-226-XG).

REFERENCES

[1] 

Zhang Tinghui Xie Mingcheng et al., “System dynamics simulation of shared value of distributed photovoltaic and its impact on distribution network,” Automation of Electric Power Systems, 45 (18), 35 –44 (2021). Google Scholar

[2] 

Barbieri F Rajakaryna S Ghosh A, “Very short-term photovoltaic power forecasting with cloud modeling : a review,” Renewable and Sustainable Energy Reviews, 2017 (75), 242 –263 Google Scholar

[3] 

Sheng H M Xiao J Cheng Y H et al., “Short-term solar power forecasting based on weighted Gaussian process regression,” IEEE Transactions on Industrial Electronics2018, 65 (1), 300 –308 https://doi.org/10.1109/TIE.2017.2714127 Google Scholar

[4] 

Liu F T Ting K M Zhou Z, “Isolation-based anomaly detection,” ACM Transactions on Knowledge Discovery from Data, 6 (1), 1 –39 (2012). https://doi.org/10.1145/2133360.2133363 Google Scholar

[5] 

Jiao Tianli Zhang Jianmin Li Xiong et al., “Spatial clustering method for large-scale distributed user photovoltaics based on spatial correlation,” Automation of Electric Power Systems, 43 (21), 97 –102 (2019). Google Scholar

[6] 

Lin S Feng Y, “Research on Stock Price Prediction Based on Orthogonal Gaussian Basis Function Expansion and Pearson Correlation Coefficient Weighted LSTM Neural Network,” Advances in Computer Signals and Systems20226, 5 Google Scholar

[7] 

Shi Jiaqi Zhang Jianhua, “Load forecasting based on multi-model by Stacking ensemble learning,” in Proceedings of the CSEE201939, 4032 –4042 Google Scholar
(2025) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Rui Li, Hui Hui, Ming Wang, Yang Zhao, Miao Zhao, and Qianfan Zhou "A fusion model for predicting distributed photovoltaic power", Proc. SPIE 13513, The International Conference Optoelectronic Information and Optical Engineering (OIOE2024), 135133Z (15 January 2025); https://doi.org/10.1117/12.3056624
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Photovoltaics

Data modeling

Solar radiation models

Interpolation

Machine learning

Data analysis

Meteorology

Back to Top