Open Access Paper
26 September 2024 Complementary ARIMA and XGBoost carbon emission timing forecasting methods for different terminal energy consumption industries
Yi Zhang, Jueheng Wang, Renjie Chen, Shu Zhang, Xingwei Liao, Jinlin Xie
Author Affiliations +
Proceedings Volume 13279, Fifth International Conference on Green Energy, Environment, and Sustainable Development (GEESD 2024) ; 1327946 (2024) https://doi.org/10.1117/12.3044690
Event: Fifth International Conference on Green Energy, Environment, and Sustainable Development, 2024, Mianyang, China
Abstract
“Carbon peak” and “carbon neutrality” are the major strategic decisions put forward by China as the largest developing country to cope with global temperature rise. Studying and predicting carbon dioxide emissions of different terminal energy consumption industries was an important work in carbon emission reduction. Based on the data of different terminal energy consumption industries in a central province, ARIMA model and XGBoost model were respectively used to forecast the carbon emissions of different terminal energy consumption industries in time series, and the outliers were analyzed. The results showed that the accuracy of time series prediction can be optimized by adjusting the model parameters, while ARIMA model was not suitable for the time series prediction of carbon emissions of the energy production and processing conversion industry. The XGBoost model was not suitable for the carbon emission time series prediction of residential life. The main reason was that the carbon emission time series prediction of energy production and processing conversion industry was nonlinear, while the ARIMA model mainly studies linear problems. And the carbon emission time series prediction of residential life has prominent linear characteristics, which cannot be described by non-linear model, resulting in a large deviation in the prediction results. According to the results of this paper, the carbon emissions of seven terminal energy consumption Sectors can be time series predicted by ARIMA model and XGBoost model in complementary, and the MPAE values between the real values and the predicted values are all much lower than 20%. Among them, agriculture, industry, construction, transportation and service industries and other industries can be predicted more accurately by using the two methods. Above all, it provides more accurate guidance for the carbon emission reduction policy of the whole field of terminal energy consumption industry.

1.

INTRODUCTION

In order to actively respond to the carbon peak carbon neutral development strategy, most countries in the world have taken the initiative to commit to carbon peak carbon neutral time node, and China declared in the general debate of the 75th session of the United Nations General Assembly that it will achieve carbon peak by 2030 and carbon neutral by 2060. Since then, provinces, cities and counties have responded positively, and are gradually carrying out research and exploration on the implementation path of carbon peaking, energy conservation and emission reduction path, and the formulation, implementation and implementation of these paths are inseparable from the accurate prediction of carbon emissions, especially the prediction of carbon emissions based on different energy consumption industries.

At present, from the perspective of research objects, most studies on carbon dioxide emissions are based on a single prediction of carbon emissions in the terminal energy industry. For example, LMDI method1,2 is used to analyze the influencing factors of carbon emissions in the terminal energy industry. STIRPAT model was used to analyze and predict the influencing factors of carbon emissions from agriculture3, industry4,5 and transportation6, and grey model7 was used to predict carbon emissions from agriculture. The construction industry often uses machine learning8,9 to predict carbon emissions, and the carbon emissions of residential life are predicted by ARIMA model. For example, Xu10 analyzed carbon emissions in Northwest China from the perspective of quantity and spatial pattern based on ARIMA model and standard deviation ellipse, and found that carbon emissions in Northwest China showed a slowly rising trend, and the main area of carbon emission moves to the northwest.

There are few studies on the carbon emission prediction of the service industry, especially that of the seven terminal energy consumption industries. The existing studies only comprehensively predict the carbon emission of some industries11. From the perspective of forecasting methods, different terminal energy consumption industries have different data hiding relationships, and there is a single model for carbon emission forecasting. At present, the mainstream models for CO2 emission prediction include STIRPAT model12, LEAP model13, BP neural network model, etc. For example, Pan14 applied STIRPAT model combined with scenario analysis to analyze the time of carbon peaking, and Huang et al.15 applied STIRPAT model to study the positive promoting effect of population, GDP and energy intensity on carbon emissions from energy consumption. LEAP forecasting model has a wide range of applications and can be used to study and analyze the carbon emission trend of various industries in cities, provinces and countries. Ma et al.16,17 applied LEAP model to study the provincial and municipal transportation industry, and She et al.18 studied the key influencing factors of China’s energy carbon emission by combining Markov model and LEAP model. The advantage of BP neural network model for carbon emission prediction is that the prediction accuracy is high, but the disadvantage is that the prediction path is unknown, and other models need to be combined to predict carbon emission. For example, Ji et al.19 combined grey analysis and BP neural network model, and Zhang et al.20 optimized BP neural network model with IPSO model, and found that the error of the BP neural network model optimized by IPSO was lower in predicting carbon emissions in Shandong Province.

Some scholars use ARIMA to forecast carbon emissions by combining other models, and find that it has the advantages of simple operation and few variable requirements. For example, Hu et al.21,22 combined BP and ARIMA models for analysis, and found that the prediction error of China’s carbon emission is closer to zero than that of BP and ARIMA separately. Han et al.23 combined the recursive least squares method with the forgetting factor and ARIMA model to make short-term one-step prediction of transportation.

In summary, combined with the current research status of carbon emission prediction, focusing on different end-use energy consumption industries, considering that XGBoost model is good at handling small samples and highly volatile data24,25, and the linear and nonlinear prediction relationship of different Sectors should be comprehensively considered. Therefore, ARIMA and XGBoost are complementary to each other to predict the carbon emission time series of different end energy consumption industries (agriculture, industry, construction, service industry and others, energy production and processing, transportation, and residential life), and MAPE (mean absolute percentage error) value is used to evaluate and verify the carbon emission prediction results. To ensure the accuracy and feasibility of the complementarity model, the above research provides more accurate guidance for the study of the two-carbon path.

2.

MATHEMATICAL MODELING

2.1

ARIMA model (differential integrated autoregressive moving average model)

The ARIMA model is mainly composed of three parts, namely differential process (I), autoregressive model (AR) and moving average model (MA). In general, the operation process of ARIMA model is to make the time series stable by difference first, and then make specific predictions by AR and MA models.

2.1.1

Differential

Difference is the calculation of the difference between adjacent data points in a set of numerical sequences. In time series, difference is used to transform a non-stationary series into a stationary series, reducing or eliminating trends and seasonal variations in time series. A new time series is obtained by first-order difference, where each value is the difference between two adjacent values in the original time series. For a time series Yt, the first difference of the series is defined as:

00151_PSISDG13279_1327946_page_2_1.jpg

where, Yt is the current value, Yt–1 is the previous value.

New sequence is obtained by second-order difference, each value of which is subtracted from two observations separated by one value in the original time series. The second-order difference of the sequence is defined as:

00151_PSISDG13279_1327946_page_2_2.jpg

where, ΔYt was the first-order difference of the current moment, ΔYt–1 was the first-order difference from the previous moment.

For the ARIMA model, multiple differences will lead to overfitting, so the difference is controlled within two times. If the time series is still non-stationary after two differences, it indicates that the problem studied is not suitable for ARIMA model.

2.1.2

ARMA models

The ARMA model is a combination of autoregressive model (AR) and moving average model (MA). The AR model takes into account the effects of past observations on the current values, and is a linear combination of the current value of the time series and the past p values (Order of autoregressive lag and determines the number of backtracking observations of the model). The mathematical expression of AR model is as follows:

00151_PSISDG13279_1327946_page_3_1.jpg

where, εt is the prediction error, conforming to normal distribution; φi is the regression coefficient, which automatically fits the optimal parameters when the model is trained.

The moving average model (MA) is a linear combination of past residuals or the prediction errors of the AR model. By combining residuals to observe the vibration of residuals, the prediction of random fluctuations can be effectively eliminated. MA is a linear combination of the current value of the time series and q lag errors in the past (the order of sliding average lag determines the amount of white noise in the model backtracking). Mathematical expression of MA model:

00151_PSISDG13279_1327946_page_3_2.jpg

In equation (4), εt–1, εt–2, ……εtp is the random disturbance term of the first q period (zero mean white noise sequence),

θi is the smoothing coefficient, and the optimal parameters will be automatically fitted during model training.

Therefore, the expression of the ARMA model is:

00151_PSISDG13279_1327946_page_3_3.jpg

2.2

XGBoost models

XGBoost, the full name of “Extreme Gradient Boosting”, was a gradient boosting model proposed by Chen26. The operation principle of XGBoost is to continuously add decision trees, and learn a new function to fit the residual predicted by all decision trees in the past. The whole process is similar to using gradient descent method to continuously optimize the objective function to achieve the best classification effect. Each decision tree in the process acts as a “weak learner” and can be integrated into a powerful classifier. The XGBoost model first scores the structure of the tree through the objective function to judge the quality of the tree structure, and then determines the shape and depth of each decision tree through node splitting to determine the structure of the tree.

The prediction of the XGBoost model for each sample can be expressed as:

00151_PSISDG13279_1327946_page_3_4.jpg

where, fk(xi) is the value corresponding to the sample in the kth decision tree, and K is the number of decision trees.

2.2.1

Objective function

Each decision tree corresponds to an objective function, and the objective function of XGBoost includes the loss function between the predicted value and the real value and the regular term. Be defined as:

00151_PSISDG13279_1327946_page_3_5.jpg

where, 00151_PSISDG13279_1327946_page_3_6.jpg is the loss function of the ith decision tree of the training data, and n represents the number of samples; Ω(fk) was a regular term that represents the complexity of the decision tree.

After considering the residual of past predictions, the newly generated prediction value of Kth tree can be expressed as:

00151_PSISDG13279_1327946_page_3_7.jpg

where, 00151_PSISDG13279_1327946_page_3_8.jpg denotes the predicted value of the first k-1 decision trees that have been fixed after training, and fk(xi) is the value corresponding to the sample in the Kth tree.

Therefore, the objective function can be rewritten as:

00151_PSISDG13279_1327946_page_3_9.jpg

XGBoost approximates the loss function by Taylor second-order expansion, so the objective function is approximated as:

00151_PSISDG13279_1327946_page_4_1.jpg

where, gi is the first derivative, hi is the second derivative:

00151_PSISDG13279_1327946_page_4_2.jpg

The regular term is used to reduce the complexity of the model. When XGBoost trains the model, the complexity can be controlled by the number of leaf nodes, the depth of the tree and the value of leaf nodes, and the regular term can be concretized into the following formula:

00151_PSISDG13279_1327946_page_4_3.jpg

Where T represents the number of leaf nodes of the Kth decision tree; ωj is the output value of the leaf node; 00151_PSISDG13279_1327946_page_4_4.jpg is the square of the L2 norm of the output value of all leaf nodes in the Kth tree (the square of the modulus length); λ, was a hyperparameter, the more leaf nodes, the greater the weight of leaf nodes, the greater λ, γ.

Since the loss function of the first k-1 tree has been determined, the objective function is a constant term except for the about part of fk(xi), which does not affect the optimal solution of fk(xi). The objective function can be converted into the following expression:

00151_PSISDG13279_1327946_page_4_5.jpg

In the Kth decision tree, there is and can only be one corresponding leaf node, so the sum of the values of all samples in the Kth tree and the sum of the values of all samples corresponding to leaf nodes is the same. The sample set on node j is defined as Ij = {xi|q(xi) = j},where q(xi) is the index function that maps the sample to the leaf node, and the regression value on leaf node j is ωj = fk(xi), i∈ Ij. The objective function can be written as follows:

00151_PSISDG13279_1327946_page_4_6.jpg

Further the expression is simplified, order

00151_PSISDG13279_1327946_page_4_7.jpg

Then the objective function can be simplified as follows:

00151_PSISDG13279_1327946_page_4_8.jpg

Only after the shape of the tree is given can the minimum value of the loss function and the optimal solution of the leaf node be determined. Therefore, it is assumed that the shape of the tree is fixed and the samples of each node are determined by the values of gi, hi, T, and the output ωj of each leaf node should minimize the objective function. From the extreme point of the quadratic function, the ωj expression obtained is:

00151_PSISDG13279_1327946_page_4_9.jpg

Finally, the final value of the objective function is obtained:

00151_PSISDG13279_1327946_page_4_10.jpg

The value of the objective function represents the score of the decision tree. The smaller the value is, the better the structure of the decision tree is. Obj represents the sum of the scores of all nodes.

2.2.2

Node splitting

In the previous section, only when the shape of the tree is fixed, can we find the minimum objective function ωj. This section discusses the structure of the tree. The difference in the shape of each tree is mainly the difference in the division of nodes. XGBoost relies on the greedy algorithm27 of recursive splitting of nodes to achieve tree generation, in addition to supporting approximation algorithms, that is, approximation of greedy algorithms, to solve the problem of excessive data volume over memory, or parallel computing needs.

In each iteration of the greedy algorithm, after sorting the features, it traverses all the feature values of the partition points, takes the best yield (Gain) as the split yield of the feature, and selects the feature with the best split yield as the partition feature of the current node. Node S is divided according to its optimal partition point, and new nodes L and R are split.

If Gain>0, split is feasible. The revenue calculation is defined as follows:

00151_PSISDG13279_1327946_page_5_1.jpg

In the formula, HL, HR, GL, GR and Hj, Gj are consistent with the definition in 1.2.1.

The structures of different trees are enumerated, added to the model, and then repeated. To avoid overfitting, the maximum depth of the tree is set. When the revenue brought by the introduced splitting is less than the set threshold, the splitting is stopped and pruning is performed. Then the shape of each XGBoost decision tree is obtained, and combined with the objective function of 1.2.1, the value of each item in the XGBoost model can be obtained.

To avoid overfitting, missing values, and sparse features, XGBoost also introduces column sampling and sparse perception strategies. Column sampling means that in the process of node splitting, instead of splitting all the remaining features, a subset is randomly selected from the remaining features as the selected features, which can prevent the joint action of different features and prevent overfitting. Sparse sensing is that XGBoost treats missing values and sparse feature values as missing values together and then treats these as a whole, and the traversal of split nodes will skip this whole, improving the efficiency of operation.

3.

RESULTS AND OUTLIER ANALYSIS

3.1

Analysis of ARIMA carbon emission time series prediction results

ARIMA (p, d, q) model is used to predict the carbon emissions of different terminal energy industries in time series. Firstly, the carbon emission time series data of different terminal energy consumption industries in a central province from 1995 to 2020 are tested for stationarity. If they are not stationary, it is necessary to conduct differential processing to determine the difference order d. Then determine the optimal p and q orders of ARIMA model, and finally test the feasibility of ARIMA (p, d, q) model.

3.1.1

Test data for stationarity

ARIMA model requires time series to have stationarity, which is a property of time series data, that is, the mean and variance of data are consistent in time. Therefore, the ADF test (unit root test, one of the stationarity test methods) is carried out by the division gate. The principle is to test whether the time series contains a unit root, that is, whether the time series starts with a constant term. If there is a unit root, the series is not stable. The ADF test of the original carbon emissions of each Sector is shown in Table 1. (1%, 5%, and 10% in the table are ADF corresponding to 1%, 5%, and 10% levels)

Table 1.

ADF test of original carbon emissions by Sectors.

SectorsADFp1%5%10%
Agriculture (Sector 1)-3.91150.0020-3.9240-3.0685-2.6739
Industry (Sector 2)-1.01850.7465-3.7239-2.9865-2.6328
Construction (Sector 3)1.71380.9982-3.8591-3.0421-2.6609
Services and Other (Sector 4)0.80250.9917-3.8893-3.0544-2.6670
Energy production and processing conversion (Sector 5)0.81120.9918-3.9240-3.0685-2.6739
Transport (Department 6)1.90110.9985-3.7377-2.9922-2.6358
Residents’ life (Sector 7)-0.15040.9442-3.8092-3.0217-2.6507

As can be seen from the table,p (the probability value corresponding to the ADF test result) of agriculture is less than 0.05, which indicates that the data is stable. In other sectors except agriculture, the corresponding ADF with p greater than 0.05 and ADF (T statistic) greater than 5%, which indicates the ADF test failed and needs to be further differential treatment. The first-difference results are shown in Figure 1.

Figure 1.

Results of first difference processing for Sectors 2-7 from 1995 to 2020.

00151_PSISDG13279_1327946_page_6_1.jpg

It can be seen from Figure 1 that the values of the six Sectors are different after the difference, and the carbon emission of Sectors 3 was small and fluctuated within 200. The carbon emissions of Sectors 2 and 5 fluctuate within 6000. Sectors 4, 6, and 7 fluctuate within 1000. The carbon emission values of some Sectors were distributed on both sides of the axis, and the periodicity is obvious. ADF test was conducted on the data after the difference. The ADF test for the one-time difference of Sectors 2 to 7 was shown in Table 2. It can be seen from the table that the P-values of Sectors 2 to 7 after the one-time difference are all less than 0.05, indicating that Sectors 2 to 7 were stationary sequences.

Table 2.

The first difference ADF test results of Sectors 2-7.

SectorsADFp1%5%10%
2-5.69290.0000-3.7377-2.9922-2.6357
3-5.06850.0000-3.7377-2.9922-2.6357
4-5.18010.0000-2.9922-3.7377-2.6357
5-3.33770.0133-3.9240-3.0685-2.6739
6-5.64580.0000-3.7377-2.9922-2.6357
7-5.17660.0000-3.7377-2.9922-2.6357

3.1.2

The optimal p and q orders of ARIMA were determined

ARMA is a linear combination of AR model (autoregressive model) and MA model (moving average model), and ARIMA is a differential data stabilization based on ARMA. AIC criteria28 (minimum information content criteria, to balance the relationship between the goodness of fit and the number of parameters of the model) were used for each Sector to determine the p and q parameters of the ARIMA model. In order to avoid over-fitting, AIC values corresponding to each pair of p and q values were calculated, and p and q values meeting the minimum AIC values were iteratively found. The optimal ARIMA (p,d,q) model was determined. The order of ARIMA model iteratively screened by various Sectors based on AIC criterion is shown in Table 3.

Table 3.

Order of ARIMA model screened by AIC criteria of each Sector.

Sectordpq
1011
2132
3114
4112
5125
6111
7155

It can be seen from Table 3 that under the AIC criterion, the model corresponding to Sector 1 with the lowest AIC value was ARIMA (1, 0, 1). The minimum AIC model corresponding to Sector 2 was ARIMA (3,1,2). In the same way, the corresponding AIC minimum model and the applicable ARIMA(p,d,q) can be obtained for Sectors 3 -7.

3.1.3

Feasibility test of ARIMA model

The data from 1995 to 2017 were used as the training set, and the data from 2018 to 2020 were used as the test set. The MAPE values were used to characterize the accuracy of model predictions. The MAPE value predicted by the model from 2018 to 2020 is shown in Figure 2.

Figure 2.

The MAPE values between the real and predicted carbon emissions of each Sector in the ARIMA model for 2018-2020.

00151_PSISDG13279_1327946_page_7_1.jpg

As can be seen from Figure 2, except for Sectors 5 and 7 in 2018 and 2019, the MAPE value of ARIMA prediction results of other Sectors was all within 20%, indicating that ARIMA was suitable for the carbon emission timing prediction of most end energy consumption industries. The predicted outliers of Sectors 5 and Sectors 7 require further analysis.

3.2

ARIMA carbon emission prediction outlier analysis

The possible causes of outliers were analyzed from the distribution curve of carbon emissions of the outlier Sectors over time. The real carbon emissions of Sectors 5 from 1995 to 2020 and the predicted values of the test set are shown in Figure 3.

Figure 3.

Distribution of the real carbon emissions of Sectors 5 and the predicted values of the test set from 1995 to 2020.

00151_PSISDG13279_1327946_page_7_2.jpg

As shown in Figure 3, the carbon emission of Sector 5 shows an overall downward trend over time, with certain cyclical characteristics. Among them, the predicted values of carbon emissions in 2018 and 2019 are 27.227 million tons and 23.354 million tons respectively, which is larger than the real value. This phenomenon may be caused by the inaccurate p and q values obtained by these two Sectors according to AIC standards, while BIC standards28 (Bayes criteria) add penalty items to the basis of AIC standards, and the more parameters, the more significant the penalty term, the BIC criterion was first used to correct the outlier, and the parameters p and q with the smallest BIC values were also selected to avoid overfitting.

The p and q values determined by AIC criteria were iteratively searched for the p and q values that met the minimum BIC values. When Sectors 5 used the ARIMA (4,1,4) model, the MAPE values from 2018 to 2020 were 6.76%, 46.17%, and 6.76%, respectively. When Sectors 5 uses the ARIMA (4,1,3) model, the MAPE values from 2018 to 2020 are 31.34%, 17.1%, and 0.47% respectively, which is a certain improvement compared with the predicted values under AIC criteria, but the accuracy still needs to be further improved. Therefore, it can be concluded that ARIMA cannot meet the carbon emission forecast of Sector 5, and other forecasting methods need to be found.

Similarly, the outliers of Sector 7 were analyzed, and the changes in the real and predicted carbon emission values with time were shown in Figure 4.

Figure 4.

Distribution of real and predicted carbon emissions of Sector 7 from 1995 to 2020.

00151_PSISDG13279_1327946_page_8_1.jpg

In Figure 4, the carbon emissions of Sector 7 showed an overall upward trend with time, and a plateau occurred in 2018, 2019, and 2020, and the cyclical performance was not obvious. The forecast value of Sector 7 in 2018 was 414.6 million tons, which was too large. The phenomenon may be caused by ARIMA (5,1,5) model, judging exponential growth in 2018. The BIC criterion was also selected to determine the order, and the ARIMA (5,1,5) model was still the best under the BIC criterion, which was consistent with the result obtained by AIC criterion, indicating that AIC criterion and BIC criterion have weak guiding value for the selection of optimal p and q of ARIMA (p, d, q) model of Sectors 5.

p and q values with good fitting effect were selected around the determined parameters p and q, and the ARIMA (3,1,5) model was selected. The predicted carbon emissions of Sectors 7 from 2018 to 2020 were 30.84 million tons, 4.0 million tons and 34.30 million tons, respectively, and MAPE values were 13.93%, 13.82% and 4.78%, respectively. MAPE is all below 15%, indicating that ARIMA (3,1,5) model has a good fitting effect for Sectors 7 and can be used for further prediction.

To sum up, ARIMA is obsessed with exploring the internal relationships of time series with linear relationships and fails to consider the nonlinear relationships, resulting in the prediction of outliers of Sectors 5. Machine learning has a strong ability to deal with nonlinear relationships, so the XGBoost model, which has been popular in machine learning in recent years, was introduced to predict the carbon emissions of different end-use energy consumption industries.

3.3

Analysis of XGBoost carbon emission time series prediction results

XGBoost is the current master of machine learning. For the prediction of carbon emissions in different end-energy consumption industries, the time series data from 1995 to 2020 were processed first, and then the periodicity and trend of the time series were enhanced, and then the predicted value of the XGBoost model was tested. Finally, the carbon emissions of different end energy consumption industries from 2018 to 2020 are predicted and compared with the real value to measure the feasibility of XGBoost model.

3.3.1

Data processing and preprocessing

The data formats need to be uniform, all two decimal places were reserved, and the code was used to check whether the data was successfully read.

3.3.2

Periodicity and trend enhancement of time series

Using XGBoost to make a time series forecast requires strong trends and periodicity of data. If the periodicity and trend of the data itself were not obvious, the past data can be used as the supplement of the trend item, and the past average value can be used as the supplement of the periodic item so that the periodicity and trend of the time series can be better highlighted. The characteristics of the original data, the year before the original data, the past two years, and the average of the previous two years are collected, in which the characteristics of the year before 1996, the past two years, and the average of the previous two years were represented by the characteristics of 1995. A data set characterized by carbon emissions and all independent variables, the independent variables of the previous year, the independent variables of the past two years, and the average of all independent variables of the previous two years was obtained.

3.3.3

Testing of XGBoost predicted values

This paper belongs to small sample prediction, and the training set and test set are obtained by randomly dividing the data according to the ratio of 8:2. In the training set, the characteristics of the current year, the characteristics of the past year, the characteristics of the past two years, and the average characteristics of the past two years were used to make predictions, and the MAPE value was compared with the real value. The real and predicted MAPE values of the test set randomly divided by various sectors of XGBoost were shown in Figure 5.

Figure 5.

The real and predicted MAPE values of the test set randomly divided by XGBoost Sectors.

00151_PSISDG13279_1327946_page_9_1.jpg

In Figure 5, the MAPE value of the real value and predicted value of a test set in Sector 4 was 21.38%, while the MPAE value of other Sectors was lower than 20%, indicating that the test set data was well predicted.

3.3.4

XGBoost predicts test set carbon emissions

Similarly, taking the data before 2018 as the training set and 2018, 2019, and 2020 as the test set, the MAPE values of the predicted and real XGBoost values of each Sector from 2018 to 2020 are shown in Figure 6.

Figure 6.

MAPE distribution of predicted and true XGBoost values for each sector from 2018 to 2020.

00151_PSISDG13279_1327946_page_9_2.jpg

In Figure 6, the MAPE values of Sectors 1 to 6 in the test set are all lower than 20%, and only the accuracy of Sector 7 in 2019 does not meet the requirements, so the outliers of Sectors 7 need to be analyzed.

3.4

XGBoost carbon emission prediction outlier analysis

Two approaches were taken for Sector 7 in XGBoost, one was to adjust the number of decision trees from 15 to 1000 tests, and the other was to replace each feature of the previous two-year average with the previous three-year average and the previous five-year average (enhancing cyclicality) without considering the past one year or the past two years (weakening trend). When the first method was used, the resulting MAPE values for 2018 were stable at around 63%, and the MAPE values for 2019 and 2020 were also stable at around 11% and 0.3%.

When the second method was used, the same result was obtained as the first method. The MAPE of the five groups of test sets randomly divided in step 2 has been stable below 20%. The failure of method 1 was most likely because the XGBoost trained by sector 7 before 2018 determined that the error between prediction and the true value was less than 0.1% after 15 decision trees, the predicted value is modified to less than 0.1% by the generated decision tree, which does not modify the predicted value. The failure of method 2 may be that the addition of past features and past average features does not affect the selection of features in Sectors 7. Therefore, XGBoost cannot meet the carbon emission prediction of Sector 7, but the prediction of Sector 5 was more accurate.

To sum up, XGBoost also has limitations and cannot meet the carbon emission forecast of Sectors 7. Carbon emissions of different energy industries need to be considered comprehensively by ARIMA to complement each other.

4.

CONCLUSION

According to the existing research results, the outliers obtained by the ARIMA model were in Sectors 5 and 7, the outliers in Sector 5 were not adjustable, and the prediction accuracy of the outliers in Sector 7 was significantly improved after other parameters p and q were changed. The outlier obtained by XGBoost model was in Sector 7, and the outlier was not adjustable. Based on the above results, the main conclusions were as follows:

  • (1) The carbon emissions of 7 end-consumption Sectors can be predicted by ARIMA and XGBoost models. The carbon emissions of Sector 5 can only be predicted by XGBoost model, the carbon emissions of Sector 7 can only be predicted by ARIMA model, and the carbon emissions of other end-consumption Sectors can be predicted by both ARIMA model and XGBoost model.

  • (2) According to the ARIMA model mainly represents the linear relationship, and XGBoost mainly represents the nonlinear relationship. The carbon emission prediction of energy production and processing transformation (Sector 5) was a nonlinear problem, and the linear characteristics of carbon emission of residential life (Sector 7) were more prominent, so it cannot be characterized by nonlinear models. The linear and nonlinear characteristics of other Sectors were not significant.

  • (3) When the two models were complementary to each other to predict the carbon emission time series of the whole field, the predicted MAPE value of each end consumption Sectors was lower than 20%, indicating that the prediction accuracy was good, and can provide detailed guidance for the development of carbon emission reduction paths in various industries.

ACKNOWLEDGEMENTS

Thanks for the funding of Hunan Provincial Key Research and Development Program: “Carbon Emissions Monitoring and Accounting Theory, Method and System Research” project (NO. 2023SK2078).

REFERENCE

[1] 

Zhang, G. X. and Su, Z. X., “Analysis of influencing factors and scenario prediction of transport carbon emissions in the Yellow River Basin,” Management Review, 32 (12), 283 –294 (2019). https://doi.org/10.14120/j.cnki.cn11-5057/f.2020.12.022 Google Scholar

[2] 

Feng, B., “Calculation and Analysis of Carbon Dioxide Emissions and Energy and Environmental Efficiency of Construction Industry,” Tianjin (2015). Google Scholar

[3] 

Kuang, A. P. and Hu, C., “Influencing factors and trend prediction of agricultural carbon emissions in Guangxi,” Journal of Southwest Forestry University,” Social Science, 4 (02), 5 –13 (2019). Google Scholar

[4] 

Yuan, X. L., Xi, J. H., Li, C. P., et al., “Research on peak carbon emission prediction and emission reduction potential of China’s industrial Sectors,” Statistics and Information Forum, 35 (09), 72 –82 (2019). Google Scholar

[5] 

Wang, Y., Bi, Y. and Wang, E. D., “Scenario prediction and emission reduction potential assessment of China’s industrial carbon emissions peaking,” China Population, Resources and Environment, 27 (10), 131 –140 (2017). Google Scholar

[6] 

Li, X. Y., Tan, X. Y., Wu, R., et al., “Research on carbon peak and carbon neutral path in transportation,” Engineering Science of China, 3 (06), 15 –21 (2019). Google Scholar

[7] 

Bai, Y. X., Wang, L. J. and Sheng, M. Y., “Empirical study on carbon emissions from agricultural production in the karst region of central Guizhou,” China Agricultural Resources and Regional Planning, 42 (03), 150 –157 (2019). Google Scholar

[8] 

Gao, S. H., Liu, Y. S., Li, X. T., et al., “Research on Influencing factors and prediction of carbon emission from construction industry in China,” Henan Science, 37 (08), 1344 –1350 (2019). Google Scholar

[9] 

Lv, Y, Song, H. and Nan, X. Y., “Prediction of peak carbon emissions from construction industry in Xinjiang based on scenario analysis,” Modern Electronic Technology, 46 (15), 121 –127 (2023). Google Scholar

[10] 

Xu, L., Qu, J. S., Li, H. J., et al., “Analysis and prediction of living carbon emissions in Northwest China,” Arid Land Geography, 42 (5), 1166 –1175 (2019). Google Scholar

[11] 

Zhang, F., Yin, X. Q. and Dong, H. Z., “Combined grey prediction model is applied in predicting carbon emissions in Shandong province,” Journal of Environmental Engineering, 33 (2), 147 –152 (2015). Google Scholar

[12] 

Wei, L., Feng, X., Liu, P. and Wang, N., Ore Geology Reviews,159, 105504, (2023). Google Scholar

[13] 

Zhang, C. and Luo, H., “Research on carbon emission peak prediction and path of China’s public buildings: Scenario analysis based on LEAP model,” Energy & Buildings, 289 113053 (2023). https://doi.org/10.1016/j.enbuild.2023.113053 Google Scholar

[14] 

Pan, D., Li, N., Li, F., et al., “Based on the forecast of China’s eastern region of the peak energy carbon strategy,” Journal of Environmental Science, 25 (3), (2021). Google Scholar

[15] 

Huang, R., Wang, Z., Ding, G. Q., et al., “Analysis of influencing factors and trend prediction of carbon emissions from energy consumption in Jiangsu province based on STIRPAT model,” Geographical Research, 35 (04), 781789 (2016). Google Scholar

[16] 

Ma, H. T. and Kang, L., “Spatial and temporal characteristics and regulation prediction of road passenger transport carbon emissions in Beijing-Tianjin-Hebei region,” Resources Science, 39 (07), 1361 –1370 (2017). Google Scholar

[17] 

Zhang, J. Z., Wang, X. C., Tai, Q. L., et al., Journal of Hainan University (Natural Science Edition), 28, (2017). Google Scholar

[18] 

She, L. H., Tang, J. H., Fu, Z. X., et al., “Research and scenario analysis of regional carbon emission prediction under dual-carbon background,” Electricity Demand Side Management, 25 (03), 62 –66 (2023). Google Scholar

[19] 

Ji, G. Y., “Application of BP neural network model based on grey correlation analysis to carbon emission prediction in China,” Mathematical Practice and Understanding, 44 (14), 243 –249 (2014). Google Scholar

[20] 

Zhang, D., Wang, T. T. and Zhi, J. H., “Prediction and eco-economic analysis of carbon emission in Shandong Province based on IPSO-BP neural network model,” Ecological Sciences, 17 (1), (2012). Google Scholar

[21] 

Hu, J. B., Luo, Z. P. and Li, F., “Prediction of China’s carbon emission intensity under the “carbon peak” target: Analysis based on LSTM and ARMI-BP models,” Science of Finance and Economics, 7 (2), (2022). Google Scholar

[22] 

Xiao, Z. H. and Wang, M. H., Journal of Chongqing Technology and Business University (Natural Science Edition), (2), (2016). Google Scholar

[23] 

Han, C., Song, S. and Wang, C. H., “Real-time adaptive prediction of short-term traffic flow based on ARIMA model,” Journal of System Simulation, 42 (7), (2004). Google Scholar

[24] 

Wang, Y. and Guo, Y. K., Computer Engineering and Applications, (2019). Google Scholar

[25] 

Liu, H. T. and Hu, D. W., Environmental Science, 1-17, (2004). Google Scholar

[26] 

Chen, T. and Guestrin, C., CoRR,abs/1603.02754, (2016). Google Scholar

[27] 

Dereventsov, A. and Temlyakov, V., “A unified way of analyzing some greedy algorithms,” Journal of Functional Analysis, 277 (12), 108286 –108286 (2019). https://doi.org/10.1016/j.jfa.2019.108286 Google Scholar

[28] 

“Information Technology—Information Theory; Data from Harvard University Provide New Insights into Information Theory (Bridging AIC and BIC: A New Criterion for Autoregression),” Computers, Networks & Communications, (2018). Google Scholar
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Yi Zhang, Jueheng Wang, Renjie Chen, Shu Zhang, Xingwei Liao, and Jinlin Xie "Complementary ARIMA and XGBoost carbon emission timing forecasting methods for different terminal energy consumption industries", Proc. SPIE 13279, Fifth International Conference on Green Energy, Environment, and Sustainable Development (GEESD 2024) , 1327946 (26 September 2024); https://doi.org/10.1117/12.3044690
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Carbon

Industry

Autoregressive models

Data modeling

Decision trees

Mathematical modeling

Analytical research

Back to Top