|
1.INTRODUCTIONUnderground drainage pipes are the “lifeline” on which the city relies for development, residents’ daily life and industrial development are inseparable from the underground drainage pipes, but in recent years, many cities have appeared urban flooding, road collapse, black smelling pipes and other urban diseases. Traditional underground pipeline inspection methods are carried out by taking images of the inside of the pipeline requiring manual subjective identification of defects, which is inefficient and does not accurately identify defects in the pipeline, while there are subjective and safety risks in the inspection process. The advancement and application of vision-based technologies result in the generation of vast quantities of inspection images and videos. Manually analyzing these data is not only inefficient but also prone to subjectivity, so automation technology is adopted to assist manual judgement. Since the 1990s there have been researchers using digital image technology for defect detection research in underground pipelines [1], and in order to propose automated detection techniques and accelerate the speed of equipment detection, researchers have continued to propose innovative algorithms. In 2001, Chae et al. used Sewer Scanner and Evaluation Technology(SSET) equipment to acquire defect data for image processing, image segmentation and other methods for pre-processing, and achieved condition assessment of underground pipelines by fusing multiple networks using a fuzzy logic system [2]. In 2006, Sinha et al. proposed a method based on mathematical morphological segmentation, which can effectively deal with underground pipeline images with different backgrounds and non-uniform illumination [3]. In 2008, Yang et al. propose an automated system that applies wavelet transforms and co-occurrence matrices to extract texture features from images of sewer pipe defects. These features are then used as inputs for three different machine learning classifiers: Backpropagation Neural Networks (BPN), radial Basis Function Networks (RBN), and Support Vector Machines (SVM). The system classifies defects such as joint dislocation, cracks, and fractures, and achieves an accuracy of 60% [4]. In 2014, Su et al. were analysing a large number of CCTV images and proposed a method based on the combination of threshold segmentation and edge detection, which can effectively deal with complex image background and noise [5]. In the past ten years, the rapid advancement of deep learning has spurred significant research across various domains, including the development of automated pipeline inspection methods. Specifically, approaches leveraging deep convolutional neural networks (CNNs) have been actively investigated for their ability to enhance the accuracy and efficiency of defect detection in pipelines. In 2018, Kumar et al. applied deep convolutional neural networks (CNNs) to process CCTV inspection footage to automate the identification and classification of sewer defects. The study demonstrates the effectiveness of CNNs in improving the accuracy and efficiency of defect detection compared to traditional manual methods, marking an important step towards automated infrastructure maintenance [6]. In 2019, Kumar et al. introduced a Faster R-CNN approach for detecting root intrusions and sedimentation in CCTV footage. This method was tested on a 335-meter sewer tributary video, achieving a defect detection accuracy of 91% [7]. In 2020, Yin et al. used the YOLOv3 network as a target detector, training and testing on 4056 samples that included six defect types (breakages, holes, deposits, cracks, fractures, and root intrusions), yielding an accuracy of 85% [8]. In 2023, Zhang et al. enhanced the YOLOv4 detection model by integrating the Spatial Pyramid Pooling (SPP) module, which improved the fusion of different receptive fields. The accuracy reached 92.3% in a dataset of 2700 images and four types of defects [9]. In 2024, Zhao et al. introduced YOLOv5-sewer, a lightweight model designed for sewer defect detection. The model uses the YOLOv5 architecture and focus on optimizing computational efficiency while maintaining detection accuracy [10]. Drainage pipe defect detection environment is complex, the defect location is different, this paper adopts YOLOv8 as the basic framework, its speed, accuracy, lightweight design and easy to deploy characteristics, in the drainage pipe defect detection has a significant advantage. Improvements have been made based on YOLOv8, adding Spatial and Channel Reconstruction Convolution(SCConv) [11] and Multi-Scale Dilated Transformer(MSDA) [12], to achieve efficien t detection of sixteen common underground pipe defects, which improves the expressiveness and detection accuracy compared to the original model. 2.IMPROVEMENTS TO THE YOLOV8 ALGORITHM2.1Model frameworkThis paper proposes two improvements to the YOLOv8 algorithm, 1) the introduction of the SCConv module, which is able to process both spatial and channel information of the image, and can better capture subtle defect features without bringing additional computational burden. 2) To improve the visual model’s feature extraction capabilities, the MSDA module integrates cavity convolution with a multi-scale self-attention mechanism. The improved YOLOv8 framework is illustrated in Figure 1 and is composed of five main parts: input preprocessing, trunk, neck, head, and output. The algorithm uses a series of initial convolution and downsampling to extract the image base features, and SCConv module is applied to feature maps ranging from 160x160 to 80x80, to better capture the spatial structure and color depth information at different scales when dealing with defective images of underground pipelines. MSDA module generates query, key, and value features by linear projection and processes the features in the head using multi-scale null convolution, then fuses the feature maps at different scales. 2.2Description of SCConv moduleSCConv is a highly efficient convolutional module designed to address spatial and channel redundancies within CNN. By minimizing unnecessary information in both dimensions, SCConv not only reduces the computational burden but also enhances the overall performance of the network. This is achieved through an optimized feature extraction process that makes more effective use of computational resources. The module is composed of two core components: the Spatial Reconfiguration Unit (SRU), which focuses on reorganizing spatial information, and the Channel Reconfiguration Unit (CRU), which refines feature representation across channels. Together, these units enable more precise and efficient feature extraction, leading to improved network efficiency and accuracy. SRU is part of the SCConv module and is responsible for reducing redundant features in the spatial dimension. This part receives the input features and processes them through the following steps: First, the input features are normalized to reduce the scale difference between different feature maps. Second, the features are weighted by a series of weights w1,w2,…,wc, These weights are computed from the channels of the features r1,r2,…,rc after normalization and nonlinear activation functions. Finally, the weighted features are partitioned into two parts and , These two parts are then each transformed and then reconstructed by addition to get the spatially refined feature Xw. SRU configuration is shown in Figure 2. The configuration of the channel reconstruction unit (CRU) is illustrated in Figure 3, which is designed to reduce the channel redundancy of the convolutional neural network features. The CRU further operates on the features that have been processed by the SRU, which starts with the spatially refined features Xw, splitting Xw into two parts. different 1x1 convolutions through different scaled α and (1 – α) paths, The two parts of the features are further transformed by Group wise convolution (GWC) and Point wise convolution (PWC), and two transformed convolved features Y1 and Y2 are pooled and SoftMax weighted fused to form the final channel refined feature Y. 2.3Description of MSDA ModuleThe working principle of the multiscale dilated attention MSDA is shown in Figure 4. Within the MSDA module, the feature map channels are first segmented into several heads. Each of these heads utilizes a distinct dilation rate to apply self-attention, ensuring that various scales of information are effectively processed across the different channel segments. These operations are performed between colored blocks within a window surrounding a red query block, and the example in the figure shows three different dilation rates that correspond to different receptive field sizes (3x3, 5x5, 7x7). The self-attention manipulation of each head targets its corresponding dilation rate and receptive field. The model efficiently captures features across multiple scales, which are then combined and passed through a linear layer to aggregate these features. This multi-scale design enhances the model’s ability to interpret the image by considering various levels of detail, leading to a more comprehensive understanding of the image’s content. This design allows the model to understand the image at different scales, thus improving the overall understanding of the image content. With this approach, MSDA not only captures local details, but also senses the contextual information of a wider region, enhancing the expressive power of the model. 3.EXPERIMENTAL PREPARATION3.1DatasetThe data used in the experiment were collected using CCTV equipment in a city in southern China, classified in accordance with the industry standard drainage pipe inspection and assessment technology (CJJ 181-2012), and a total of 19,039 images of underground drainage pipe defects using manual labeling of defect locations. In response to the presence of an imbalance in the number of categories in the dataset, data enhancement was performed on the categories with fewer categories in the dataset, a sum of 55,915 images were finally generated after adjustment. The dataset was divided into training, val and test sets in the ratio of 7:2:1, where 39,140 images were used as training set samples, 11,183 images were used as val set samples and 5,592 images were used as test samples. The dataset contains both structural and functional defects. Structural defects are crack, deformation, corrosion, misalignment, fluctuation, disjoint, leakage, penetratio, detachment, side-branch, functional defects are deposit, encrustation, scum, roots, obstacle, and broken-wall. A total of sixteen types cover the types of drain defects. 3.2Experimental platformTo guarantee the reliability and consistency of the experimental outcomes, all experiments in this study were carried out on a designated computing platform. The specific details of the experimental environment, including hardware and software configurations, are outlined in Table 1. This setup was carefully chosen to ensure that the results could be accurately replicated under consistent conditions. Table 1.Hardware and software configurations.
3.3MetricsIn order to evaluate the improved YOLOv8 algorithm network model detection performance, the model results were evaluated using Precision (P), Recall (R), Average Precision (AP), Mean Average Precision (mAP) as evaluation metrics. The relevant formulas are shown below: In the above equation, TP represents the number of true positive samples, which refers to instances where the model correctly identifies positive cases. FP stands for false positives, indicating the number of negative samples that the model mistakenly classifies as positive. FN refers to false negatives, which are negative samples that the model incorrectly classifies as positive. Lastly, C denotes the number of different defect types considered in the classification process. 3.4Experimental resultsAll the input images in the experiment were uniformly resized to 640×640. the training process used a configuration with an initial learning rate of 0.01, and the optimizer chose the Stochastic Gradient Descent (SGD) algorithm. The batch size was set to 8 and the number of training epochs was set to 350. Figure 6 shows the comparison of the PR curves of the original model and the improved model. Obviously, the area under the PR curve of the improved model is larger, which indicates that its performance is significantly better than that of the original model. 3.5Comparison experimentTo provide a more comprehensive evaluation of both the accuracy and efficiency of the improved model, we also conducted comparison experiments using several state-of-the-art object detection algorithms, including YOLOv8, YOLOv10, and RT-DETR. The results of these experiments, which are aimed at benchmarking the performance of the improved model against these advanced detection methods, are presented in Table 2 for detailed comparison and analysis. This allows for a clearer understanding of the model’s strengths in relation to other leading algorithms in the field. Table 2.Comparison results.
The experimental data indicate that the improved model achieves the optimal performance in all the metrics. Precision rate is 92.6%, recall rate is 89.9%, and it reaches 94.4% and 80.2% at mAP@0.5 and mAP@0.5:0.95 respectively, which are better than other compared models. 4.CONCLUSIONSIn this paper, a detection model based on improved YOLOv8 is proposed for the low efficiency problem of traditional underground drainage pipe defect detection methods. By introducing the SCConv module and MSDA module, the model’s ability to capture image features at different scales is improved. The experimental results show that the improved model has an improvement in detection accuracy compared with the original model. In the future, we will further optimize the model structure to reduce the model size and inference time. 5.ACKNOWLEDGMENTThis research was supported by the Fund for Special Projects in Key Areas of General Universities in Guangdong Province, China(New Generation Information Technology)under Grant No.2021ZDZX1033. 6.6.REFERENCESMorad, S., Zayed, T. and Golkhoo, F.,
“Review on Computer aided sewer pipeline defect detection and condition assessment,”
Infrastructure, 4
(1), 10
–25
(2019). https://doi.org/10.3390/infrastructures4010010 Google Scholar
Chae, M. and Abraham, D.,
“Neuro-fuzzy approaches for sanitary sewer pipeline condition assessment,”
J. Comput. Civ. Eng., 15
(1), 4
–14
(2001). https://doi.org/10.1061/(ASCE)0887-3801(2001)15:1(4) Google Scholar
Sinha, S. K. and Fieguth, P. W.,
“Segmentation of buried concrete pipe images,”
Autom. Constr., 15
(1), 47
–57
(2006). https://doi.org/10.1016/j.autcon.2005.02.007 Google Scholar
Yang, M. D. and Su, T. C.,
“Automated diagnosis of sewer pipe defects based on machine learning approaches,”
Expert Syst. Appl., 35
(3), 1327
–1337
(2008). https://doi.org/10.1016/j.eswa.2007.08.013 Google Scholar
Su, T. C. and Yang, M. D.,
“Application of morphological segmentation to leaking defect detection in sewer pipelines,”
Sensors, 14
(5), 8686
–8704
(2014). https://doi.org/10.3390/s140508686 Google Scholar
Kumar, S., Abraham, D. M. and Jahanshahi M. R.,
“Automated defect classification in sewer closed circuit television inspections using deep convolutional neural networks,”
Autom. Constr, 91 273
–283
(2018). https://doi.org/10.1016/j.autcon.2018.03.028 Google Scholar
Kumar, S., Wang, M., and Abraham, D.,
“Deep learning based automated detection of sewer defects in closed circuit television videos,”
submitted to J. Comput. Civ. Eng., 34
(1), 04019047
(2020). https://doi.org/10.1061/(ASCE)CP.1943-5487.0000866 Google Scholar
Yin, X., Chen, Y. and Bouferguene, A.,
“A deep learning-based framework for an automated defect detection system for sewer pipes,”
Autom. Constr, 109 1029
–1067
(2020). https://doi.org/10.1016/j.autcon.2019.102967 Google Scholar
Zhang, J., Liu, X. and Zhang, X.,
“Automatic detection method of sewer pipe defects using deep learning techniques,”
Appl. Sci., 13
(7), 4589
(2023). https://doi.org/10.3390/app13074589 Google Scholar
Zhao, X., Xiao, N. and Cai, Z.,
“YOLOv5-sewer: lightweight sewer defect detection Model,”
Appl. Sci., 14
(5), 1869
(2024). https://doi.org/10.3390/app14051869 Google Scholar
Li, J., Wen, Y. and He, L.,
“Scconv: spatial and channel reconstruction convolution for feature redundancy,”
in In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
6153
–6162
(2023). Google Scholar
Jiao, J., Tang, Y. M., Lin, K. Y., Gao, Y., Ma, A. J., Wang, Y. and Zheng, W. S.,
“Dilateformer: Multi-scale dilated transformer for visual recognition,”
IEEE Trans. Multimedia, 25 8906
–8919
(2023). https://doi.org/10.1109/TMM.2023.3243616 Google Scholar
|