An Empirical Study on the Usage of Transformer Models for Code Completion

This study extends our previous work: "An Empirical Study on the Usage of BERT Models for Code Completion".

In this work we present a large-scale study exploring the capabilities of state-of-the-art Transformer-based models in supporting code completion at different granularity levels, including single tokens, one or multiple entire statements, up to entire code blocks.

T5 model

Preliminary step

The training of the model is done on a TPU instance of Colab. A GCS Bucket is mandatory. To Set up a new GCS Bucket for training and fine-tuning a T5 Model, please follow the guide provided by Google here.

Pipeline

Dataset

You can find the datasets used for pretraining and fine-tuning the models here and here. For pretraining the tokenizer we also share the key.csv file with information about each method.
Tokenizer

We trained the tokenizer using the script in Tokenizer folder.
```
python3 tokenizer.py --input=<input_file> --model_prefix=code --vocab_size=32000 --bos_id=-1  --eos_id=1 --unk_id=2 --pad_id=0
```
Where:
- input: the path for the txt file containing the code to tokenize
- model_prefix: the prefix for the tokenizer (e.g. code => it generates code.vocab and code.model)
- vocab_size: the size of the vocabulary
- bos_id: begin of sentence id (this changes only the order or the tokens stored in the vocabulary
- eos_id: end of sentence id (this changes only the order or the tokens stored in the vocabulary)
- unk_id: unknown token id (this changes only the order or the tokens stored in the vocabulary)
- pad_id: padding id (this changes only the order or the tokens stored in the vocabulary)
You can find the tokenizer in Pretraining/tokenizer_model folder.
Pretraining

For pretraining the model you can find the notebook pretrain.ipynb in the Pretraining folder. The notebook has comments that explain how to run it. You can also find the gin file for config in the configuration_file folder and the trained tokenizer in the tokenizer_model folder. The pretrained model is available here

Hyper Parameter tuning

We did hyper parameter tuning to find the best model for the finetuning. We tested 4 configuration and trained the model for 100k steps. The configurations are the following:

constant learning rate (lr = 0.001)
Inverse Square Root (warmup_steps = 10000)
slanted (cut_fraction=0.1, ratio=32, max_learning_rate=0.01, start_step=0)
polynomial learning rate (starter_learning_rate=0.01, end_learning_rate=1e-6, decay_step=10000, power=0.5)

You can find the commented notebooks in HP_Tuning/pretraining_script. The configuration files for each HP tuning are in HP_Tuning/configuration_files. You can find the script to evaluate the performances in the HP_Tuning/evaluation folder.

python3 perfect_predictions.py --folder <folder_with_prediction>

In the --folder you have to save all the files generated during the evaluation by tensorflow. You can find here the HP tuning models and the files for the predictions

Then we evaluated the performance; the best model was slanted. Here the percentage of perfect predictions for each model:

DATASET	CONSTANT	SLANTED	ISR	POLYNOMIAL
java construct	50.51	52.11	50.77	31.36
java block	24.85	26.92	25.52	7.46
java token	65.42	66.45	65.43	44.75
android construct	48.20	49.98	48.59	27.98
android block	25.97	27.96	26.46	7.99
android token	68.23	69.37	68.38	46.70
overall	57.62	58.97	57.81	37.28

Finetuning

To evaluate the performance of each model we used a beam size of 1. We performed 3 different fine tunings:

Multi-task finetuning (Finetuning/multitask folder)
Single-task finetuning for each dataset (6 models) starting from pretrained model (Finetuning/single_task_from_pretrained folder)
Single-task finetuning for each dataset (6 models) starting from scratch (Finetuning/single_task_no_pretrained folder)

We finetuned the multi-task model for 400k steps (around 29 epochs). We chose the number of steps of the other models so that the number of training epochs is 29. The following table contains the number of training steps for each model:

DATASET	STEPS
java construct	85000
java block	34000
java token	85000
android construct	85000
android block	24000
android token	85000

You can finetune and evaluate the models by running the Fine_tuning.ipynb and evaluate.ipynb notebooks (check the comments in the notebook). For the evaluation you have to load on the Bucket the input file containing the methods you want to predict and use the path of this file as the input_file in the predict method. For the multi-task finetuning you have to merge all the input files for each task in order to predict all methods in one single step

Multi-task finetuning

You can evaluate the number of perfect predictions running:

python3 perfect_predictions.py --input_path <path_to_input_file>  --target_path <path_to_target_file> --prediction_path <path_to_prediction_file>

Where

input_path contains the file you want to predict
target_path contains the file with the correct value that the model should predict
prediction_path contains the file with the T5 predictions

Single-task finetuning from pretrained model

You can evaluate the number of perfect predictions running:

python3 perfect_predictions.py --target_path <path_to_target_file> --prediction_path <path_to_prediction_file>

Where

target_path contains the file with the correct value that the model should predict
prediction_path contains the file with the T5 predictions

Single-task finetuning from scratch

You can evaluate the number of perfect predictions running:

python3 perfect_predictions.py --target_path <path_to_target_file> --prediction_path <path_to_prediction_file>

Where

target_path contains the file with the correct value that the model should predict
prediction_path contains the file with the T5 predictions

A recap table with the percentage of perfect prediction is below:

DATASET	MULTI-TASK PRETRAINED	SINGLE-TASK PRETRAINED	SINGLE-TASK FROM SCRATCH
java construct	52.29	51.20	48.41
java block	28.84	27.18	22.90
java token	66.31	62.86	60.92
android construct	50.78	49.30	46.75
android block	29.74	27.48	22.76
android token	69.26	64.83	63.77
overall	59.29	56.24	54.10

You can find the models and the predictions here

Analysis of the Prediction Confidence

We chose the best model found (i.e., multi-task model with pre-training) to further analyze its performance, and in particular the reliability of the prediction confidence as a proxy for the prediction quality. T5 returns a score for each prediction, ranging from minus infinity to 0. This score is the log likelihood of the prediction itself. It means that if the score is 0 then the log likelihood (ln) of the prediction is 0. So the likelihood is 1 and this implies that the model has confidence 100% that the prediction is correct (high confidence). If the score is -2 then the log likelihood of the prediction is -2. So the likelihood is 0.14 and this implies that the model has confidence 14% that the prediction is correct (low confidence).

To compute the score file you can run the notebook evaluate_score.ipynb in Score folder.

To see the percentage of perfect prediction for each confidence interval and the analysis of Levenshtein distance among all confidence intervals you can run:

python3 score_analysis.py --input_path <folder_with_input_files> --score_path <path_to_score_file> --score

where

input_path is the folder that contains inputs.txt, targets.txt, predictions.txt files (respectively the input given to T5 model, the target you want T5 to predict and the prediction file)
score_path is the path to the file with the scores

To evaluate the average length of the predictions for each confidence interval you can run:

python3 score_analysis.py  --input_path <folder_with_input_files> --score_path <path_to_score_file> --length

where

input_path is the file that contains the files cited above (and a new file lengths.txt with the number of tokens of each prediction)
score_path is the path to the file with the scores

You can find all the results (and the scores.txt files) in the folder Scores/Results

Models comparison

You can find here two csv files that compare T5, RoBERTa and n-gram models. Some analysis were done by using Excel filters.

Comparison with n-gram with cloning

You can run:

python3 prediction_analysis.py --input_folder <input_for_n_gram> --output_folder  <path_to_output_folder> --result_csv <path_to_csv_file>

where:

input_folder is the path containing the output of n-gram model trained in the MSR paper (useful to retrieve the id of the trained methods)
output folder is the path where you want to save results
result_csv is the path with the results of the T5 model

You can find all files and results in Cloning_Comparison folder

Comparison with RoBERTa model

We compared T5 single-task without pretraining with RoBERTa model (fig 1 and table 4 in the paper). You can find the script in No_Pretrain_Comparison folder.

To computed the number of perfect prediction related to the number of tokens to predict, you can run:

python3 metrics_analysis.py --input_path <folder_with_input_files> --perfect

where

input_path is the file that contains the files cited in the confidence section (you need to use the prediction of the model without pretraining)

To assess the metrics like BLEU score and Levenshtein distance you can run:

python3 metrics_analysis.py --input_path <folder_with_input_files>  --metrics

RoBERTa Model

You can find details about the implementation of the RoBERTa model here

Additional Results

BLEU and Levenshtein distance for T5 model with pretraining

Here the results for the T5 single-task (Single column) and multi-task (Multi column) with pretraining:

Semantical Equivalence

In Semantical_Equivalence folder you can find the methods we analyzed to check whether non perfect predictions have the same behaviour (semantically equivalent predictions)

Statistical Tests

In Statistical Tests folder you can find details about statistical tests we performed

Qualitative analysis

We reported some examples of each model predictions.

T5

RoBERTa

n-gram

All these files can be found in Perfect_Prediction_Examples folder

How to cite

For the original study based on the RoBERTa model, use the following BibTeX

@inproceedings{Ciniselli2021,
	author = {Ciniselli, Matteo and Cooper, Nathan and Pascarella, Luca and Poshyvanyk, Denys and Di Penta, Massimiliano and Bavota, Gabriele},
	title = {An Empirical Study on the Usage of BERT Models for Code Completion},
	booktitle = {Proceedings of the 18th Working Conference on Mining Software Repositories},
	series = {MSR '21},
	year = {2021},
	pages = {To Appear}
}

For this extension based on the T5 model, use the following BibTeX

@article{ciniselli2021empirical,
  title={An Empirical Study on the Usage of Transformer Models for Code Completion},
  author={Ciniselli, Matteo and Cooper, Nathan and Pascarella, Luca and Mastropaolo, Antonio and Aghajani, Emad and Poshyvanyk, Denys and Di Penta, Massimiliano and Bavota, Gabriele},
  journal={IEEE Transactions on Software Engineering},
  year={2021},
  publisher={IEEE}
}

Contributors

Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Antonio Mastropaolo, Emad Aghajani, Denys Poshyvanyk, Massimiliano Di Penta, Gabriele Bavota,

License

This software is licensed under the MIT License.

This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 851720). W&M team was supported in part by the NSF CCF-1955853, CCF-2007246 and CCF-1815186 grants. Any opinions, findings, and conclusions expressed herein are the authors’ and do not necessarily reflect those of the sponsors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Empirical Study on the Usage of Transformer Models for Code Completion

T5 model

Preliminary step

Pipeline

Dataset

Tokenizer

Pretraining

Hyper Parameter tuning

Finetuning

Analysis of the Prediction Confidence

Models comparison

Comparison with n-gram with cloning

Comparison with RoBERTa model

RoBERTa Model

Additional Results

BLEU and Levenshtein distance for T5 model with pretraining

Semantical Equivalence

Statistical Tests

Qualitative analysis

T5

RoBERTa

n-gram

How to cite

Contributors

License

About

Releases 2

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
BLEU_Score		BLEU_Score
Cloning_Comparison		Cloning_Comparison
Finetuning		Finetuning
HP_Tuning		HP_Tuning
No_Pretrain_Comparison		No_Pretrain_Comparison
Perfect_Prediction_Examples		Perfect_Prediction_Examples
Pretrain		Pretrain
RoBERTa_Code		RoBERTa_Code
Score		Score
Semantical_Equivalence/Results		Semantical_Equivalence/Results
Statistical Tests		Statistical Tests
Tokenizer		Tokenizer
LICENSE		LICENSE
README.md		README.md

License

mciniselli/T5_Replication_Package

Folders and files

Latest commit

History

Repository files navigation

An Empirical Study on the Usage of Transformer Models for Code Completion

T5 model

Preliminary step

Pipeline

Dataset

Tokenizer

Pretraining

Hyper Parameter tuning

Finetuning

Analysis of the Prediction Confidence

Models comparison

Comparison with n-gram with cloning

Comparison with RoBERTa model

RoBERTa Model

Additional Results

BLEU and Levenshtein distance for T5 model with pretraining

Semantical Equivalence

Statistical Tests

Qualitative analysis

T5

RoBERTa

n-gram

How to cite

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Languages

Packages