Stylometry for real-world expert coders: a zero-shot approach.

PeerJ Comput Sci

LTCI, Télécom Paris, Institut Polytechnique de Paris, Paris, France.

Published: November 2024

Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets. It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments. Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors). In this work we give a fresh look at code stylometry and challenge both these assumptions: (1) we recognize expert authors who contribute to real-world open-source projects, and (2) we show how to accurately recognize authors not present in the training set (out-distribution authors). We assemble a novel open dataset of code snippets for code stylometry tasks consisting of 114,400 code snippets, authored by 104 authors having contributed 1,100 snippets each. We develop a K-nearest neighbors algorithm (k-NN) classifier for the code stylometry task and train it on the dataset. Our system achieves a top accuracy of 69% among five randomly selected in-distribution authors, thus improving state of the art by more than 20%. We also show that when moving from in-distribution to out-distribution authors, the classification performances of the k-NN classifier remain the same, achieving a top accuracy of 71% among five randomly-selected out-distribution authors.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11623162PMC
http://dx.doi.org/10.7717/peerj-cs.2429DOI Listing

Publication Analysis

Top Keywords

code stylometry
20
code snippets
12
out-distribution authors
12
code
10
authors
9
recognize authors
8
authors training
8
in-distribution authors
8
k-nn classifier
8
top accuracy
8

Similar Publications

Stylometry for real-world expert coders: a zero-shot approach.

PeerJ Comput Sci

November 2024

LTCI, Télécom Paris, Institut Polytechnique de Paris, Paris, France.

Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets. It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments. Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!