|
1.INTRODUCTIONThe demand for ceramic tile industry is stabilizing, the competition within the industry is gradually deepening, and higher requirements are put forward for the quality of ceramic tile. At present, the inspection classification of ceramic tile products is mostly completed by manual, manual detection is affected by personal subjective factors, and long-term manual detection makes operators tired and increases the transaction rate, in addition, it increases the labor cost. In recent years, based on the development of deep learning technology, the intelligence in the field of ceramic tile production has been rapidly promoted. Li Zehui et al. proposed an improved texture tile defect detection algorithm by YOLOv3, aiming at the problems of tile defect detection algorithm relying on manual design features, debugging difficulties and insufficient robustness in practical application[1].Zhou et al. proposed a significant object detection method based on attention mechanism to solve the difficult problem of complex texture tile surface detection[2].Li Junhua and Quan Xiaxia adopted the improved SIFT and color moment fusion feature allocation weight coefficient to achieve feature-weighted fusion and improve the recognition rate of complex tile defects[3].Tile positioning and segmentation,Zhou Xiang et al. proposed to extract the contour according to the grey step degree, and then obtain the image of the target region through the contour mask[4]. In 2016, Google proposed DeepLab network model, which has become a classic algorithm in the field of semantic segmentation with its excellent segmentation effect. DeepLabV1[5] uses a deep convolutional network combined with full connection conditions to complete semantic segmentation, but the operations of maximum pooling and downsampling compress the image resolution, thus reducing the details of the image, while maintaining spatial invariance, but also limiting the accuracy of the model, resulting in the generated probability graph becoming blurred.On this basis, DeepLabV2[6] achieves semantic segmentation through fusion of void convolution method, which can effectively expand the size of convolution kernel on the premise of keeping the parameters and computational amount not increasing, so as to obtain a wider range of context information, and ensure that the relevant information of multi-scale targets can be learned in the training stage. DeeplabV3[7] improves the void convolution into a void space pyramid pooling (ASPP) structure model, which uses convolution cores with multiple adoption rates and effective sizes to detect incoming convolution features, capable of capturing the context of objects and images at multiple scales.In 2018, Google also proposed DeeplabV3+[8] network, which adopts the decoder-decoder structure and takes the V3 version network as the encoder. Aiming at the common problems of DeeplabV3+ network such as large network parameters and inaccurate edge detail segmentation, this paper proposes an improved DeeplabV3+ image preprocessing method. Due to the high requirement of tile segmentation speed on the tile detection site, the calculation amount is small and the segmentation accuracy is high. It is improved as follows:①At the coding layer, ResNet[9] of the backbone network is replaced by MobileNetV2[10], and the lightweight MobileNetV2 model can effectively reduce the model complexity and calculation amount.②The integration of ECA attention mechanism module increases the proportion of border in tile surface image segmentation, and reduces the attention to invalid features, further improving the model segmentation progress. ③In order to solve the problem of insufficient recognition of edge and other detailed features in the original model, the standard convolution in ASPP module is replaced by deep separable convolution (DSConv). ④In the decoder part, deep separable convolution is introduced to replace the common product operation in the decoder to further improve the real-time performance of the model. Compared with standard convolution, depth-separable convolution can greatly reduce the number of parameters in the training process, and can improve the training efficiency of network models without affecting the prediction accuracy. Finally, after the feature extraction pyramid extracts multi-scale information, ECA attention mechanism is added to enhance the network’s perception ability of edge position and small-scale defect features, solve the problem of missing detection and wrong detection caused by edge defects and small-scale defects, and refine the segmentation results. 2.IMPROVED DEEPLABV3+ NETWORKAlthough DeeplabV3+ network can be used to detect the defect area better, the tiny defects in the edge area of the actual production of tiles are more likely to be missed than in the middle area, resulting in lower segmentation accuracy. In order to improve network segmentation accuracy, DeeplabV3+ network is improved, and the improved network structure is shown in Figure 1. 2.1Introduction of the E-ASPP Attention Multi-Branch Expansion ModuleDeeplabV3+ model can obtain multi-level information through ASPP (Atrous Spatial Pyramid Pooling), but only through simple stitching of various dimensions, it cannot obtain rich contextual information, which is not enough for the extraction of large targets such as tile surface image positioning. Therefore, the E-ASPP [11]attention multi-branch cascade module is added to the encoder structure to improve the ability of context connection and enhance information extraction. The E-ASPP attention multi-branch expansion module is shown in Fig2. The E-ASPP multi-branch attention cascade module is divided into two parts: ECA attention module and ASPP multi-branch receptive field cascade fusion structure. The Attention Mechanism ECA module is an improved version of SENet[12](Squeeze excitation). Its structure is shown in Figure 3. ECA module proposes a local cross-channel interaction strategy without dimensionality reduction, which takes into account each channel after Global average pooling (GAP) and its k neighbors to capture local cross-channel interaction information. ECA module first performs spatial feature compression on input feature map to obtain 1×1×C feature map. Then the channel feature learning is carried out, and the cross-channel interaction information is captured by one-dimensional convolution of dynamic convolution kernel, and the coverage of local cross-channel interaction is determined by adaptive selection of convolution kernel adaptive function formula. Finally, the output feature channel weight vector generated by the activation function is multiplied with the original input feature graph by channel, and the feature graph with channel attention is output, thus improving the feature extraction. The ECA module avoids channel dimension reduction and allows the model to learn more efficient channel attention. The ECA module has fewer parameters, determined only by the size of its convolution kernel k. The formula for calculating the adaptive convolution kernel k is: In the formula: k is the convolution kernel size, C is the number of input channels, use γ, b to change the ratio between C and k, generally γ=2, b=1. In ASPP module, the semantic information of images can be extracted at different scales through cavity convolution and pyramid pooling with different sampling rates. However, the parallel branches in ASPP module lack the sharing of the extracted weed feature information, and can not make full use of the spatial features of the data. In order to overcome the limitations of ASPP module, a cascade fusion structure of multi-branch receptive fields of feature sets is proposed. As shown on the right side of Figure 2, the output of each branch is adjusted to 320 channels by 1×1 convolution, which is used as additional features of the next branch. Through the backbone feature extraction network, a rough feature representation is obtained for each branch. Then, these rough features are fused with the additional features of the previous branch to solve the problem of “void” in ASPP module, so as to ensure the complete transmission of information and reduce data loss. The five feature layers obtained from the output of each branch after the feature hierarchy fusion are spliced, that is, the output after four convolutions is stacked with the output of the pooling layer, so that the advanced semantic feature information with richer semantic information can be extracted by making full use of the feature layers of different scales. At the same time, the common convolution operation is replaced by 3×3 deep separable convolution block in ASPP, that is, deep separable cavity convolution, which makes the network more efficient. 2.2Backbone Network Replace the lightweight MobileNetV2 modelThe on-site production speed of ceramic tiles is fast, and 60 ceramic tiles with length of 800mm×800mm are detected every minute. The DeeplabV3+ backbone network before improvement uses ResNet. Although it has high segmentation accuracy for multiple types of image extraction features, its network complexity is high and the operation time is long. For ceramic tile surface defect images with diverse information and features, with the deepening of model training, the number of network parameters will gradually increase, and the computing speed will be greatly reduced. In order to improve the speed of feature extraction, it cannot meet the needs of ceramic tile on-site production. For this reason, MobileNetV2 is used in this paper instead of ResNet as the backbone network. The inverse residual structure and the linear bottleneck layer in MobileNetV2 network together constitute the linear inverse residual structure, as shown in Figure 4. In the inverse residual structure, point-by-point convolution and deep convolution are adopted. In the convolution process, the channel is first convolved by 1×1 point-by-point convolution Expansion layer, followed by 3×3 deep convolution for each channel, and finally, it is reduced by 1×1 point-by-point convolution dimension (Projection layer). Using shortcut, add the two together and output. The Relu activation function avoids losing the feature information at the tile border after dimensionality reduction. On the last 1×1 point-by-point convolution of the inverted residual structure, the linear bottleneck layer is replaced by the previous Relu activation function for dimensionality reduction, which greatly reduces the loss of low latitude feature information and increases the information dimension, which is of great significance for improving the segmentation accuracy. The number of parameters and computational complexity are shown in Table 1. Table1.Comparison of model parameters and calculation amount
2.3Improvement of decoderSince DeeplabV3+ ’s complex CNN structure has 47 convolutional layers, the use of standard convolution will inevitably lead to a large number of cases of overfitting of parameters and networks. To solve this problem, DeeplabV3+ networks use deep separable convolution in the decoder part[12,13] instead of the traditional standard convolution. Formula (2) results in the ratio of depth-separable convolution to standard convolution computation, which indicates that depth-separable convolution has high applicability in solving the number of trainable parameters or overfitting[14.15].The deep separable convolution is used to split the common convolution, reduce the redundant parameters of the model, and accelerate the convergence speed and deduction speed of the model training. At the same time, the reduction of model complexity reduces the risk of overfitting, and can better meet the needs of real-time segmentation of tile surface images. In formula (2), the multiple of the computation depends on the number of input channels X, Y represents the number of output channels, and the size of the convolution kernel is Ci · Ci.the size of the feature graph is Ce · Ce. 3.EXPERIMENTAL RESULTS AND ANALYSIS3.1Collection of data setsThe data set used in the experiment was taken from a tile production factory with 800 images taken from the tile production line by an industrial linear array camera. Among them, 200 images with a length of 700×1500 (mm size) images, 200 images with 800×800 pixels, 200 images with 400×800 pixels, and 200 images with 600×1200 pixels, divided the objects contained in each image into two categories: tile image and background, as shown in Figure 4. Label pixels other than the tile surface image as background classes. The image annotation tool LabelImg was used to generate the annotation mask of the tile, obtain the json file and convert it into an 8-bit grayscale image as a label, and produce the VOC format data set. At the same time, the overfitting problem was avoided and the generalization ability of the model was enhanced. In this paper, random rotation, flipping, Gaussian noise, random brightness and other data were enhanced for the samples, and the size of the dataset was expanded to 3 times of the original, a total of 2400 images and corresponding labels were obtained. The model is trained and verified by dividing training set and verification set according to ratio 8:2. There are 1920 training sets and 480 verification sets. 3.2Experimental environment configuration and parameter SettingsThe processing platform of this experiment is Intel(R) Xeon(R) Gold 5218R CPU, 2.10GHz, 64GB memory, and NVIDIA GeForce RTX 3090 GPU. The software environment is built by Cuda11.5+conda 4.9.2+Python 3.8.8, and the operating system is Ubuntu 18.04. The transfer learning strategy was adopted during model training, the weight parameters of the pre-trained Pascal VOC dataset were used, the initial learning rate was set to 0.009, and the Batch Size was set to 4. The random gradient descent method SGD was used as the optimizer to update the weight parameter, the momentum parameter was set to 0.9, the cross entropy loss function was adopted, and the training was attenuated once every 200 iterations, the number of iterations was set to 100, the learning rate was set to 0.001, and the learning rate gradually attenuated with the increase of the number of iterations. 3.3Evaluation criteriaIn order to objectively evaluate the segmentation effect of the network model on the tile surface image, three evaluation standard indicators were selected in this paper, and the Mean intersection over union (MIou) was used for quantitative evaluation. MIou comprehensively considered the Intersection over Union (IoU), Mean Pixel Accuracy (MPA) and F1 values of model extraction targets and non-targets. If the dataset contains k+1 categories, pij represents the number of pixels whose truth value is Class i that are predicted to be class j pixels.The number of pixels pij represented as true Yang (TP), pij The number (FP). of Pjj pixels for true shade (TN). pij is the number of pixels of false negative (FN), pij is the number of pixels of falsssse positive (FP). MIoU is the ratio of the intersection of the real label and the predicted label to its union, and the IoU of each class is calculated and then averaged, that is MPA is to calculate the proportion of pixels that are correctly classified in each category, and then average, that is F1 is an index to measure the accuracy of the binary classification model, taking into account the accuracy and recall rate of the classification model. The maximum value is 1 and the minimum value is 0. The larger the value, the higher the accuracy of the model The feasibility of the model in practical application is evaluated and verified by the speed of operation, the number of model parameters and the accuracy of operation. 3.4Analysis of experimental resultsExperimental analysis of segmentation performance of different backbone networks The average crossover ratio function curve and accuracy function curve in the improved model training iteration process are shown in Figure 6, and the accuracy value and crossover ratio tend to be stable as the number of iterations increases. The improved DeeplabV3+ network compares the MIou, MPA, F1 values, operation time and accuracy of different models respectively, and the test results in the tile data set are shown in Table 2. Table2.Compares with other split networks
As can be seen from Table 2, the effect of the improved DeeplabV3+ algorithm has been significantly improved. Compared with U-Net network, PSPNet network and DeeplabV3+ before improvement, MIou is increased by 4.44%, 12.36% and 5.07% respectively, MPA is increased by 4.7%, 3.55% and 4.44% respectively, and the operation time is increased by 139ms compared with DeeplabV3+ before improvement. Accuracy was improved by 5.71%, 9.53% and 1.02%, respectively. It also shows that the improved algorithm has significantly improved the segmentation accuracy and detection speed. In addition, the improved algorithm reduces the operation time and ensures efficient deployment and application under tight pipeline. Ablation results. A comparison ablation experiment between DeeplabV3+ model and DeeplabV3 + model was designed to verify the effectiveness of the improvement. The basic model before the improvement was denoted as A. Different modification methods for the model were shown in Table 3.The same data set and validation set are used for model training and testing. In order to ensure the reliability of the experiment, each network is trained twice, and the results are compared with the average value. The experimental results are shown in Table 4. Table 3.Different processing methods
Table 4.Influences of different methods on the model
As can be seen from Table 4, the improved DeeplabV3+ has been greatly improved in terms of performance. Each module proposed not only improves the model reasoning speed, but also effectively improves the segmentation accuracy. When lightweight backbone network MobileNetV2 is used as the backbone network, the model depth is reduced and more underlying features are retained. The model not only improves the segmentation accuracy, but also reduces the operation time. After ECA attention mechanism module is added to the model, the model’s attention ability and model accuracy are improved, but the introduction of attention mechanism increases the model complexity and processing time. After adding the cascade fusion of multi-branch receptive fields in ASPP module, MIou and MPA increased by 0.75 and 0.31 percentage points respectively due to the fusion of multi-level receptive fields. After replacing the convolutional layer of the decoder with a depth separable convolution, the internal structure of the decoder is optimized, and the overall performance of the model is improved again, MIou and MPA are improved by 0.27 and 0.18 percentage points, respectively. At the same time, the model processing time is shortened with the reduction of the number of model parameters. The improved model has stronger ability to extract the image of the tile surface, and better performance in real-time, which can better complete the task of tile image segmentation in industrial environment. 4.CONCLUSIONIn view of the slow speed and low accuracy of tile surface image segmentation, based on DeeplabV3+ model, lightweight network MobileNetV2 is used to replace ResNet as the backbone network, which reduces the complexity of the model, reduces the amount of computation, and improves the detection speed. Secondly, ECA attention mechanism combined with ASPP multi-branch sensitivity field concatenation structure is introduced. Pay more attention to the key information, improve the segmentation accuracy. The experimental results show that the improved DeeplabV3+ algorithm improves MIou by 4.07%, MPA by 4.44%, F1 value by 2.04, Accuracy by 1.02%, and has good segmentation performance for tile surface images. REFERENCESLi Zehui, Chen Xindu, Huang Jiasheng, et al.,
“Texture ceramic tile defect detection based on improved YOLOv3 [J],”
Progress in Laser and Optoelectronics, 59
(10), 284
–292
(2022). Google Scholar
Ouyang Zhou, Zhang Huailiang, Tang Ziyang, et al.,
“Research on surface defect detection algorithms for complex textured ceramic tiles [J],”
Journal of Northwest Polytechnical University, 40
(2), 414
–421
(2022). https://doi.org/10.1051/jnwpu/20224020414 Google Scholar
Li Junhua, Quan Xiaoxia, WANG Yuling,
“Research on Ceramic tile Surface defect Detection Algorithm based on Multi-feature Fusion [J],”
Computer Engineering and Applications, 56
(15), 191
–198
(2019). Google Scholar
Zhou Xiang, He Wei, He Tao et al.,
“A kind of ceramic tile locating method based on machine vision [J],”
Journal of Chinese ceramics, 52
(07), 43
–48
(2016). https://doi.org/10.16521/j.carol carroll nki Google Scholar
Liang Chieh Chen, George Papandreou, Iasonas Kokkinos, et al.,
“Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs[Z],”
arxiv,
(2016). Google Scholar
Liang Chieh Chen, George Papandreou, Iasonas Kokkinos, et al.,
“DeepLab: Semantic Image Segmentation with Deep Convolutional Nets,”
Atrous Convolution, and Fully Connected CRFs[Z].arxiv,
(2017). Google Scholar
Liang Chieh Chen, George Papandreou, Florian Schroff, et al.,
“Rethinking Atrous Convolution for Semantic Image Segmentation[Z],”
arxiv,
(2017). Google Scholar
Liang Chieh Chen, Yukun Zhu, George Papandreou,et al., Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation[Z]arxiv,
(2018). Google Scholar
HANG H, WU C R, ZHANG Z Y, et al.,
“Resnest: split-attention networks,”
Los Alamos: arXiv Preprint, 08955
(20202004). Google Scholar
SANDLER M, HOWARD A, Zhu M, et al.,
“MobileNetV2: inverted tesiduals and linear bottlenecks[C],”
in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. June 18-23, 2018, Salt Lake City, UT, USA. IEEE,
4510
–4520
(2018). Google Scholar
Wang Q,Wu B,Zhu P, et al.,
“ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks.[J],”
CoRR,2019,abs/1910.03151, Google Scholar
Jie H,Li S,Samuel A, et al.,
“Squeeze-and-Excitation Networks[J],”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 42
(8), 1
–1
(2019). Google Scholar
Zhou Su, Yi Yuqian, Zheng Shoujian, et al.,
“Vehicle target Detection Network based on depth-separable Convolution [J],”
Mechatronics, 2021
(3), 3
–12 Google Scholar
Mao Yuanhong, He Zhanzhuang, Liu Lulu, et al.,
“Pruning method based on depthwise separable convolution in target tracking [J],”
Journal of Xi’an Jiaotong University, 55
(1), 52
–59
(2021). Google Scholar
Andrew G Howard, Menglong Zhu, Bo Chen, et al.,
“MobileNets: Efficient Conssvolutional Neural Networks for Mobile Vision Applications[Z],”
arxiv,
(2017). Google Scholar
|