Top Pass: improve code generation by pass@k-maximized code ranking
Zhicun LYU1,2, Xinye LI1,2, Zheng XIE1, Ming LI1,2()
. National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China . School of Artificial Intelligence, Nanjing University, Nanjing 210023, China
Code generation has been greatly enhanced by the profound advancements in Large Language Models (LLMs) recently. Nevertheless, such LLM-based code generation approaches still struggle to generate error-free code in a few tries when faced with complex problems. To address this, the prevailing strategy is to sample a huge number of candidate programs, with the hope of any one in them could work. However, users of code generation systems usually expect to find a correct program by reviewing or testing only a small number of code candidates. Otherwise, the system would be unhelpful. In this paper, we propose Top Pass, a code ranking approach that identifies potential correct solutions from a large number of candidates. Top Pass directly optimizes the pass@k loss function, enhancing the quality at the top of the candidate list. This enables the user to find the correct solution within as few tries as possible. Experimental results on four benchmarks indicate that our Top Pass method enhances the usability of code generation models by producing better ranking results, particularly achieving a 32.9% relative improvement in pass@1 on CodeContests when compared to the state-of-the-art ranking method.
Fig.1 Code generation system with or without Top Pass. The user can only afford testing or reviewing a few code candidates, thus Top Pass enhances the practical value of code generation systems significantly
Fig.2 Top Pass minimizes a novel pass@k loss function that enhances the ranking quality at the top of the code candidate list, so that the user can solve the programming task with fewer attempts
Fig.3 Examples for (a) top positive, (b) top negative, (c) bottom positive, and (d) bottom negative. Pass@k loss gives more significance to the top positive/negative codes, directing the ranking model towards identifying high-quality solutions instead of indistinguishable wrong codes
Category
Method
Pass@1
Pass@2
Pass@3
Pass@5
Pass@10
Standalone LLM
PG-TD
0.7
1.1
?
?
2.5
Codex
0.7
1.2
?
?
3.0
WizardCoder
2.0
?
?
3.3
?
WizardCoder + CodeChain
2.5
?
?
3.3
?
ChatGPT
2.9
4.8
6.3
8.6
12.1
DeepSeek-Coder
5.2
7.9
9.8
12.5
16.4
LLM with testcases
CodeT
2.1
2.3
?
?
5.3
ALGO
5.6
5.6
?
?
7.7
LLM with ranker
CodeRanker w/ ChatGPT
6.1
9.1
9.7
10.3
12.7
Top Pass w/ ChatGPT
7.3+19.7%
10.3+13.2%
12.7+30.9%
13.3+29.1%
15.2+19.7%
CodeRanker w/ DeepSeek-Coder
7.3
9.1
10.3
13.3
17.0
Top Pass w/ DeepSeek-Coder
9.7+32.9%
12.7+39.6%
13.3+29.1%
14.5+9.0%
18.2+7.1%
Tab.1 Experiment results on CodeContests dataset. The best results of each metric are presented in bold. and the numbers in red indicate the relative improvements of pass@k compared to CodeRanker
Method
Pass@1
Pass@5
Intro
Inter
Comp
Total
Intro
Inter
Comp
Total
Codex
4.1
0.1
0.0
0.9
9.7
0.5
0.1
2.3
AlphaCode
?
?
?
?
14.4
5.6
4.6
7.2
Code-LLAMA 34B
?
?
?
?
32.8
8.8
2.9
12.4
StarCoder
7.3
6.9
4.1
6.4
?
?
?
?
WizardCoder
26.0
4.2
0.8
7.9
?
?
?
?
code-davinci-002
19.1
4.3
1.0
6.6
42.4
13.1
4.0
17.1
CodeT
34.6
8.1
2.2
12.2
?
?
?
?
DeepSeek-Coder
40.6
13.8
4.3
17.3
60.0
27.3
11.4
30.7
CodeRanker
44.6
15.0
6.0
19.1
60.1
27.6
13.0
31.2
Top Pass
46.6+4.5%
16.3+8.7%
6.6+10.0%
20.4+6.8%
60.4+0.5%
28.8+4.3%
13.3+2.3%
32.0+2.6%
Tab.2 Experiment results on APPS dataset. The “Intro”, “Inter”, “Comp” represent introductory, interview and competition-level tasks, respectively, while “Total” encompasses the whole dataset
Method
MBPP
HumanEval
Pass@1
Pass@5
Pass@1
Pass@5
CodeLLaMa
47.0
?
36.0
?
WizardCoder
51.8
?
57.3
?
code-davinci-002
58.1
?
47.0
?
CodeT
61.9
?
50.2
?
DeepSeek-Coder
68.7
81.0
60.0
74.5
CodeRanker
67.6
77.0
62.8
73.2
Coder-Reviewer
66.2
?
62.5
?
Top Pass
69.2
79.6
64.6
76.2
Tab.3 Experiment results on MBPP and HumanEval
Method
Pass@1
Pass@2
Pass@3
Pass@5
Top Pass
9.7
12.7
13.3
14.5
Top Pass w/o Lpass@k
7.3
9.1
10.3
13.3
Tab.4 Ablation study on on CodeContests
Fig.4 The influence of the false positive rate in the training dataset on various methods, observed through the metric pass@k, where
Fig.5 The impact on pass@1 of different sample numbers during test
Fig.6 The impact on pass@1 of different hyperparameters p and q
ListingA1 Task description of CodeForces 1619B
Top Pass CORRECT
Program:
import math
def count_special_numbers(n):
special_numbers = set()
for i in range(1, math.isqrt(n) + 1):
sq = i * i
special_numbers.add(sq)
cb = sq * i
if cb <= n:
special_numbers.add(cb)
return len(special_numbers)
t = int(input())
for _ in range(t):
n = int(input())
print(count_special_numbers(n))
Table A1 The program ranked as the top choice by Top Pass
CodeRanker WRONG
Program:
import math
def count_likes(n):
cnt = 0
i = 1
while i * i <= n:
cnt += 1
j = i * i
while j <= n:
cnt += 1
j *= i
if j <= n:
cnt -= 1
i += 1
return cnt
t = int(input())
for _ in range(t):
n = int(input())
print(count_likes(n))
Table A2 The program ranked as the top choice by CodeRanker
Pass@1
Pass@2
Pass@3
Pass@5
Pass@10
0.0
7.9
8.5
10.9
13.9
18.8
0.1
7.3
9.7
11.5
13.3
17.0
0.2
7.9
9.7
13.3
15.8
18.8
0.3
9.7
12.7
13.3
14.5
18.2
0.4
8.5
9.7
13.3
15.8
18.2
Table A3 Influence of different on pass@k
Fig.A1 Influence of different sampling temperature on pass@k. T04, T06, T08, T10 represent the sampling temperature is 0.4, 0.6, 0.8, 1.0, respectively
p
q
Pass@1
Pass@2
Pass@3
Pass@5
Pass@10
0.6
0.4
7.9
13.9
15.2
15.8
18.2
0.5
7.3
12.7
15.2
15.8
17.6
0.6
7.3
9.7
11.5
13.3
15.8
0.7
7.3
10.9
12.1
15.2
17.0
0.7
0.4
7.9
12.1
13.3
16.4
19.4
0.5
7.9
10.9
12.1
13.9
17.6
0.6
7.3
8.5
12.1
15.8
17.6
0.7
7.3
10.9
12.1
12.7
17.0
0.8
0.4
7.9
8.5
11.5
16.4
17.0
0.5
8.5
10.3
12.7
16.4
19.4
0.6
7.3
9.1
10.9
12.7
16.4
0.7
7.9
9.1
10.9
14.5
19.4
0.9
0.4
8.5
10.3
12.7
15.8
17.6
0.5
9.7
12.7
13.3
14.5
18.2
0.6
8.5
10.3
12.1
15.2
17.0
0.7
7.9
9.1
10.9
13.3
16.4
1.0
0.4
7.9
9.7
10.3
13.9
17.6
0.5
8.5
9.1
10.3
12.1
13.9
0.6
7.9
9.1
10.3
11.5
17.0
0.7
7.9
9.1
10.3
13.9
17.0
Table A4 Influence of different values of hyper-parameters p and q on pass@k
1
R, Li L B, Allal Y, Zi N, Muennighoff D, Kocetkov , et al.. StarCoder: may the source be with you! 2016, arXiv preprint arXiv: 2305.06161
2
E, Nijkamp B, Pang H, Hayashi L, Tu H, Wang Y, Zhou S, Savarese C Xiong . CodeGen: an open large language model for code with multi-turn program synthesis. In: Proceedings of the 11th International Conference on Learning Representations. 2023
3
B, Rozière J, Gehring F, Gloeckle S, Sootla I, Gat X E, Tan Y, Adi J, Liu R, Sauvestre T, Remez J, Rapin A, Kozhevnikov I, Evtimov J, Bitton M, Bhatt C C, Ferrer A, Grattafiori W, Xiong A, Défossez J, Copet F, Azhar H, Touvron L, Martin N, Usunier T, Scialom G Synnaeve . Code LLaMa: open foundation models for code. 2024, arXiv preprint arXiv: 2308.12950
4
Y, Hu H, Jiang Z Hu . Measuring code maintainability with deep neural networks. Frontiers of Computer Science, 2023, 17( 6): 176214
5
B, Chen F, Zhang A, Nguyen D, Zan Z, Lin J G, Lou W Chen . CodeT: code generation with generated tests. In: Proceedings of the 11th International Conference on Learning Representations. 2023
6
K, Zhang D, Wang J, Xia W Y, Wang L Li . ALGO: synthesizing algorithmic programs with LLM-generated oracle verifiers. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 54769−54784
7
M, Chen J, Tworek H, Jun Q, Yuan Oliveira Pinto H P, de , et al.. Evaluating large language models trained on code. 2021, arXiv preprint arXiv: 2107.03374
8
J P, Inala C, Wang M, Yang A, Codas M, Encarnación S K, Lahiri M, Musuvathi J Gao . Fault-aware neural code rankers. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 13419−13432
9
A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez L, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
10
A, Radford K, Narasimhan T, Salimans I Sutskever . Improving language understanding by generative pre-training. 2018
11
S, Black G, Leo P, Wang C, Leahy S Biderman . GPT-Neo: large scale autoregressive language modeling with mesh-tensorflow. 2021
12
J, Achiam S, Adler S, Agarwal L, Ahmad I, Akkaya , et al.. GPT-4 technical report. 2024, arXiv preprint arXiv: 2303.08774
13
R, Anil A M, Dai O, Firat M, Johnson D, Lepikhin , et al.. PaLM 2 technical report. 2023, arXiv preprint arXiv: 2305.10403
14
A, Chowdhery S, Narang J, Devlin M, Bosma G, Mishra . et al.. PaLM: scaling language modeling with pathways. The Journal of Machine Learning Research, 2024, 24( 1): 240
15
Y, Wang W, Wang S, Joty S C H Hoi . CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 8696−8708
16
C, Raffel N, Shazeer A, Roberts K, Lee S, Narang M, Matena Y, Zhou W, Li P J Liu . Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21( 1): 140
17
Y, Li D, Choi J, Chung N, Kushman J, Schrittwieser R, Leblond T, Eccles J, Keeling F, Gimeno Lago A, Dal T, Hubert P, Choy Masson d’Autume C, de I, Babuschkin X, Chen P S, Huang J, Welbl S, Gowal A, Cherepanov J, Molloy D J, Mankowitz Robson E, Sutherland P, Kohli Freitas N, de K, Kavukcuoglu O Vinyals . Competition-level code generation with AlphaCode. Science, 2022, 378( 6624): 1092–1097
18
Z, Luo C, Xu P, Zhao Q, Sun X, Geng W, Hu C, Tao J, Ma Q, Lin D Jiang . WizardCoder: Empowering code large language models with Evol-Instruct. In: Proceedings of the 12th International Conference on Learning Representations. 2024
19
S, Gunasekar Y, Zhang J, Aneja C C T, Mendes Giorno A, Del S, Gopi M, Javaheripi P, Kauffmann Rosa G, de O, Saarikivi A, Salim S, Shah H S, Behl X, Wang S, Bubeck R, Eldan A T, Kalai Y T, Lee Y Li . Textbooks are all you need. 2023, arXiv preprint arXiv: 2306.11644
20
X, Bi D, Chen G, Chen S, Chen D, Dai , et al.. DeepSeek LLM: scaling open-source language models with longtermism. 2024, arXiv preprint arXiv: 2401.02954
21
Q, Zheng X, Xia X, Zou Y, Dong S, Wang Y, Xue L, Shen Z, Wang A, Wang Y, Li T, Su Z, Yang J Tang . CodeGeeX: a pre-trained model for code generation with multilingual benchmarking on HumanEval-X. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023, 5673−5684
22
D, Fried A, Aghajanyan J, Lin S, Wang E, Wallace F, Shi R, Zhong S, Yih L, Zettlemoyer M Lewis . InCoder: a generative model for code infilling and synthesis. In: Proceedings of the 11th International Conference on Learning Representations. 2023
23
X, Chen M, Lin N, Schärli D Zhou . Teaching large language models to self-debug. In: Proceedings of the 12th International Conference on Learning Representations. 2024
24
J, Liu C S, Xia Y, Wang L Zhang . Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2024, 943
25
Y, Deng C S, Xia H, Peng C, Yang L Zhang . Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 2023, 423−435
26
W, Wang G, Li B, Ma X, Xia Z Jin . Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: Proceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering. 2020, 261−271
27
J, Gu Z, Chen M Monperrus . Multimodal representation for neural code search. In: Proceedings of 2021 IEEE International Conference on Software Maintenance and Evolution. 2021, 483−494
28
S, Arakelyan A, Hakhverdyan M, Allamanis L, Garcia C, Hauser X Ren . NS3: neuro-symbolic semantic code search. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2024, 761
29
Z, Li M, Pan Y, Pei T, Zhang L, Wang X Li . Empirically revisiting and enhancing automatic classification of bug and non-bug issues. Frontiers of Computer Science, 2024, 18( 5): 185207
30
A, Kanade P, Maniatis G, Balakrishnan K Shi . Learning and evaluating contextual embedding of source code. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 474
31
Z, Feng D, Guo D, Tang N, Duan X, Feng M, Gong L, Shou B, Qin T, Liu D, Jiang M Zhou . CodeBERT: a pre-trained model for programming and natural languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, 1536−1547
32
D, Guo S, Ren S, Lu Z, Feng D, Tang S, Liu L, Zhou N, Duan A, Svyatkovskiy S, Fu M, Tufano S K, Deng C B, Clement D, Drain N, Sundaresan J, Yin D, Jiang M Zhou . GraphCodeBERT: pre-training code representations with data flow. In: Proceedings of the 9th International Conference on Learning Representations. 2021
33
D, Guo S, Lu N, Duan Y, Wang M, Zhou J Yin . UniXcoder: unified cross-modal pre-training for code representation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 7212−7225
34
W, Ahmad S, Chakraborty B, Ray K W Chang . Unified pre-training for program understanding and generation. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 2655−2668
35
X, Wang Y, Wang F, Mi P, Zhou Y, Wan X, Liu L, Li H, Wu J, Liu X Jiang . SynCoBERT: Syntax-guided multi-modal contrastive pre-training for code representation. 2021, arXiv preprint arXiv: 2108.04556
36
K, Clark M T, Luong Q V, Le C D Manning . Electra: pre-training text encoders as discriminators rather than generators. In: Proceedings of the 8th International Conference on Learning Representations. 2020
37
D, Hendrycks S, Basart S, Kadavath M, Mazeika A, Arora E, Guo C, Burns S, Puranik H, He D, Song J Steinhardt . Measuring coding challenge competence with APPS. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021
38
J, Austin A, Odena M, Nye M, Bosma H, Michalewski D, Dohan E, Jiang C, Cai M, Terry Q, Le C Sutton . Program synthesis with large language models. 2021, arXiv preprint arXiv: 2108.07732
39
OpenAI. ChatGPT: optimizing language models for dialogue. 2022
40
T, Zhang T, Yu T B, Hashimoto M, Lewis W T, Yih D, Fried S I Wang . Coder reviewer reranking for code generation. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 41832−41846