Paper
11 July 2024 CodeSift: Approach for detecting source code feature redundancy
Ao Xu, Yi Zhu, Qiao Yu, Guosheng Hao
Author Affiliations +
Proceedings Volume 13210, Third International Symposium on Computer Applications and Information Systems (ISCAIS 2024); 132100T (2024) https://doi.org/10.1117/12.3034805
Event: Third International Symposium on Computer Applications and Information Systems (ISCAIS 2023), 2024, Wuhan, China
Abstract
In the field of source-based software defect prediction, it is necessary to convert the source code into the data form that the model can process, which is called word embedding. The commonly used word embedding models are Word2Vec and BERT models. Meanwhile, in software defect prediction, accurate feature representation is crucial to the performance of the defect prediction model. However, feature redundancy problems in source code, such as highly similar word vectors caused by code appearing in pairs, result in a set of word vectors generated when training word embedding models that may degrade model performance when applied to defect prediction tasks. In order to alleviate this problem, a CodeSift method is proposed in this paper. By calculating the similarity of each pair of word vectors in the code word vectors generated by the word embedding model and interpolating the generated code word vectors, the problem of feature redundancy between word vectors is reduced. CodeSift generates a new word vector from a highly similar word vector, and retains the original information by assigning weights, thus improving the compactness and information richness of feature representation. Experiments show that the F1 value of the defect prediction model is improved by using CodeSift method, and the false positive rate is lower than that of the original model.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Ao Xu, Yi Zhu, Qiao Yu, and Guosheng Hao "CodeSift: Approach for detecting source code feature redundancy", Proc. SPIE 13210, Third International Symposium on Computer Applications and Information Systems (ISCAIS 2024), 132100T (11 July 2024); https://doi.org/10.1117/12.3034805
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Performance modeling

Data modeling

Interpolation

Semantics

Education and training

Mathematical optimization

Matrices

Back to Top