Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2025, Vol. 19 Issue (8) : 198341    https://doi.org/10.1007/s11704-024-40415-9
Artificial Intelligence
Top Pass: improve code generation by pass@k-maximized code ranking
Zhicun LYU1,2, Xinye LI1,2, Zheng XIE1, Ming LI1,2()
. National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
. School of Artificial Intelligence, Nanjing University, Nanjing 210023, China
 Download: PDF(1512 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Code generation has been greatly enhanced by the profound advancements in Large Language Models (LLMs) recently. Nevertheless, such LLM-based code generation approaches still struggle to generate error-free code in a few tries when faced with complex problems. To address this, the prevailing strategy is to sample a huge number of candidate programs, with the hope of any one in them could work. However, users of code generation systems usually expect to find a correct program by reviewing or testing only a small number of code candidates. Otherwise, the system would be unhelpful. In this paper, we propose Top Pass, a code ranking approach that identifies potential correct solutions from a large number of candidates. Top Pass directly optimizes the pass@k loss function, enhancing the quality at the top of the candidate list. This enables the user to find the correct solution within as few tries as possible. Experimental results on four benchmarks indicate that our Top Pass method enhances the usability of code generation models by producing better ranking results, particularly achieving a 32.9% relative improvement in pass@1 on CodeContests when compared to the state-of-the-art ranking method.

Keywords machine learning      data mining      software engineering     
Corresponding Author(s): Ming LI   
Issue Date: 21 November 2024
 Cite this article:   
Zhicun LYU,Xinye LI,Zheng XIE, et al. Top Pass: improve code generation by pass@k-maximized code ranking[J]. Front. Comput. Sci., 2025, 19(8): 198341.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-024-40415-9
https://academic.hep.com.cn/fcs/EN/Y2025/V19/I8/198341
Fig.1  Code generation system with or without Top Pass. The user can only afford testing or reviewing a few code candidates, thus Top Pass enhances the practical value of code generation systems significantly
Fig.2  Top Pass minimizes a novel pass@k loss function that enhances the ranking quality at the top of the code candidate list, so that the user can solve the programming task with fewer attempts
Fig.3  Examples for (a) top positive, (b) top negative, (c) bottom positive, and (d) bottom negative. Pass@k loss gives more significance to the top positive/negative codes, directing the ranking model towards identifying high-quality solutions instead of indistinguishable wrong codes
Category Method Pass@1 Pass@2 Pass@3 Pass@5 Pass@10
Standalone LLM PG-TD 0.7 1.1 ? ? 2.5
Codex 0.7 1.2 ? ? 3.0
WizardCoder 2.0 ? ? 3.3 ?
WizardCoder + CodeChain 2.5 ? ? 3.3 ?
ChatGPT 2.9 4.8 6.3 8.6 12.1
DeepSeek-Coder 5.2 7.9 9.8 12.5 16.4
LLM with testcases CodeT 2.1 2.3 ? ? 5.3
ALGO 5.6 5.6 ? ? 7.7
LLM with ranker CodeRanker w/ ChatGPT 6.1 9.1 9.7 10.3 12.7
Top Pass w/ ChatGPT 7.3+19.7% 10.3+13.2% 12.7+30.9% 13.3+29.1% 15.2+19.7%
CodeRanker w/ DeepSeek-Coder 7.3 9.1 10.3 13.3 17.0
Top Pass w/ DeepSeek-Coder 9.7+32.9% 12.7+39.6% 13.3+29.1% 14.5+9.0% 18.2+7.1%
Tab.1  Experiment results on CodeContests dataset. The best results of each metric are presented in bold. and the numbers in red indicate the relative improvements of pass@k compared to CodeRanker
Method Pass@1 Pass@5
Intro Inter Comp Total Intro Inter Comp Total
Codex 4.1 0.1 0.0 0.9 9.7 0.5 0.1 2.3
AlphaCode ? ? ? ? 14.4 5.6 4.6 7.2
Code-LLAMA 34B ? ? ? ? 32.8 8.8 2.9 12.4
StarCoder 7.3 6.9 4.1 6.4 ? ? ? ?
WizardCoder 26.0 4.2 0.8 7.9 ? ? ? ?
code-davinci-002 19.1 4.3 1.0 6.6 42.4 13.1 4.0 17.1
CodeT 34.6 8.1 2.2 12.2 ? ? ? ?
DeepSeek-Coder 40.6 13.8 4.3 17.3 60.0 27.3 11.4 30.7
CodeRanker 44.6 15.0 6.0 19.1 60.1 27.6 13.0 31.2
Top Pass 46.6+4.5% 16.3+8.7% 6.6+10.0% 20.4+6.8% 60.4+0.5% 28.8+4.3% 13.3+2.3% 32.0+2.6%
Tab.2  Experiment results on APPS dataset. The “Intro”, “Inter”, “Comp” represent introductory, interview and competition-level tasks, respectively, while “Total” encompasses the whole dataset
Method MBPP HumanEval
Pass@1 Pass@5 Pass@1 Pass@5
CodeLLaMa 47.0 ? 36.0 ?
WizardCoder 51.8 ? 57.3 ?
code-davinci-002 58.1 ? 47.0 ?
CodeT 61.9 ? 50.2 ?
DeepSeek-Coder 68.7 81.0 60.0 74.5
CodeRanker 67.6 77.0 62.8 73.2
Coder-Reviewer 66.2 ? 62.5 ?
Top Pass 69.2 79.6 64.6 76.2
Tab.3  Experiment results on MBPP and HumanEval
MethodPass@1Pass@2Pass@3Pass@5
Top Pass9.712.713.314.5
Top Pass w/o Lpass@k7.39.110.313.3
Tab.4  Ablation study on Lpass@k on CodeContests
Fig.4  The influence of the false positive rate in the training dataset on various methods, observed through the metric pass@k, where k=1,3
Fig.5  The impact on pass@1 of different sample numbers during test
Fig.6  The impact on pass@1 of different hyperparameters p and q
  
  
  
  
  ListingA1 Task description of CodeForces 1619B
Top Pass CORRECT
Program:
import math
def count_special_numbers(n):
special_numbers = set()
for i in range(1, math.isqrt(n) + 1):
sq = i * i
special_numbers.add(sq)
cb = sq * i
if cb <= n:
special_numbers.add(cb)
return len(special_numbers)
t = int(input())
for _ in range(t):
n = int(input())
print(count_special_numbers(n))
  Table A1 The program ranked as the top choice by Top Pass
CodeRanker WRONG
Program:
import math
def count_likes(n):
cnt = 0
i = 1
while i * i <= n:
cnt += 1
j = i * i
while j <= n:
cnt += 1
j *= i
if j <= n:
cnt -= 1
i += 1
return cnt
t = int(input())
for _ in range(t):
n = int(input())
print(count_likes(n))
  Table A2 The program ranked as the top choice by CodeRanker
λ Pass@1 Pass@2 Pass@3 Pass@5 Pass@10
0.0 7.9 8.5 10.9 13.9 18.8
0.1 7.3 9.7 11.5 13.3 17.0
0.2 7.9 9.7 13.3 15.8 18.8
0.3 9.7 12.7 13.3 14.5 18.2
0.4 8.5 9.7 13.3 15.8 18.2
  Table A3 Influence of different λ on pass@k
  Fig.A1 Influence of different sampling temperature on pass@k. T04, T06, T08, T10 represent the sampling temperature is 0.4, 0.6, 0.8, 1.0, respectively
p q Pass@1 Pass@2 Pass@3 Pass@5 Pass@10
0.6 0.4 7.9 13.9 15.2 15.8 18.2
0.5 7.3 12.7 15.2 15.8 17.6
0.6 7.3 9.7 11.5 13.3 15.8
0.7 7.3 10.9 12.1 15.2 17.0
0.7 0.4 7.9 12.1 13.3 16.4 19.4
0.5 7.9 10.9 12.1 13.9 17.6
0.6 7.3 8.5 12.1 15.8 17.6
0.7 7.3 10.9 12.1 12.7 17.0
0.8 0.4 7.9 8.5 11.5 16.4 17.0
0.5 8.5 10.3 12.7 16.4 19.4
0.6 7.3 9.1 10.9 12.7 16.4
0.7 7.9 9.1 10.9 14.5 19.4
0.9 0.4 8.5 10.3 12.7 15.8 17.6
0.5 9.7 12.7 13.3 14.5 18.2
0.6 8.5 10.3 12.1 15.2 17.0
0.7 7.9 9.1 10.9 13.3 16.4
1.0 0.4 7.9 9.7 10.3 13.9 17.6
0.5 8.5 9.1 10.3 12.1 13.9
0.6 7.9 9.1 10.3 11.5 17.0
0.7 7.9 9.1 10.3 13.9 17.0
  Table A4 Influence of different values of hyper-parameters p and q on pass@k
1 R, Li L B, Allal Y, Zi N, Muennighoff D, Kocetkov , et al.. StarCoder: may the source be with you! 2016, arXiv preprint arXiv: 2305.06161
2 E, Nijkamp B, Pang H, Hayashi L, Tu H, Wang Y, Zhou S, Savarese C Xiong . CodeGen: an open large language model for code with multi-turn program synthesis. In: Proceedings of the 11th International Conference on Learning Representations. 2023
3 B, Rozière J, Gehring F, Gloeckle S, Sootla I, Gat X E, Tan Y, Adi J, Liu R, Sauvestre T, Remez J, Rapin A, Kozhevnikov I, Evtimov J, Bitton M, Bhatt C C, Ferrer A, Grattafiori W, Xiong A, Défossez J, Copet F, Azhar H, Touvron L, Martin N, Usunier T, Scialom G Synnaeve . Code LLaMa: open foundation models for code. 2024, arXiv preprint arXiv: 2308.12950
4 Y, Hu H, Jiang Z Hu . Measuring code maintainability with deep neural networks. Frontiers of Computer Science, 2023, 17( 6): 176214
5 B, Chen F, Zhang A, Nguyen D, Zan Z, Lin J G, Lou W Chen . CodeT: code generation with generated tests. In: Proceedings of the 11th International Conference on Learning Representations. 2023
6 K, Zhang D, Wang J, Xia W Y, Wang L Li . ALGO: synthesizing algorithmic programs with LLM-generated oracle verifiers. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 54769−54784
7 M, Chen J, Tworek H, Jun Q, Yuan Oliveira Pinto H P, de , et al.. Evaluating large language models trained on code. 2021, arXiv preprint arXiv: 2107.03374
8 J P, Inala C, Wang M, Yang A, Codas M, Encarnación S K, Lahiri M, Musuvathi J Gao . Fault-aware neural code rankers. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 13419−13432
9 A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez L, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
10 A, Radford K, Narasimhan T, Salimans I Sutskever . Improving language understanding by generative pre-training. 2018
11 S, Black G, Leo P, Wang C, Leahy S Biderman . GPT-Neo: large scale autoregressive language modeling with mesh-tensorflow. 2021
12 J, Achiam S, Adler S, Agarwal L, Ahmad I, Akkaya , et al.. GPT-4 technical report. 2024, arXiv preprint arXiv: 2303.08774
13 R, Anil A M, Dai O, Firat M, Johnson D, Lepikhin , et al.. PaLM 2 technical report. 2023, arXiv preprint arXiv: 2305.10403
14 A, Chowdhery S, Narang J, Devlin M, Bosma G, Mishra . et al.. PaLM: scaling language modeling with pathways. The Journal of Machine Learning Research, 2024, 24( 1): 240
15 Y, Wang W, Wang S, Joty S C H Hoi . CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 8696−8708
16 C, Raffel N, Shazeer A, Roberts K, Lee S, Narang M, Matena Y, Zhou W, Li P J Liu . Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21( 1): 140
17 Y, Li D, Choi J, Chung N, Kushman J, Schrittwieser R, Leblond T, Eccles J, Keeling F, Gimeno Lago A, Dal T, Hubert P, Choy Masson d’Autume C, de I, Babuschkin X, Chen P S, Huang J, Welbl S, Gowal A, Cherepanov J, Molloy D J, Mankowitz Robson E, Sutherland P, Kohli Freitas N, de K, Kavukcuoglu O Vinyals . Competition-level code generation with AlphaCode. Science, 2022, 378( 6624): 1092–1097
18 Z, Luo C, Xu P, Zhao Q, Sun X, Geng W, Hu C, Tao J, Ma Q, Lin D Jiang . WizardCoder: Empowering code large language models with Evol-Instruct. In: Proceedings of the 12th International Conference on Learning Representations. 2024
19 S, Gunasekar Y, Zhang J, Aneja C C T, Mendes Giorno A, Del S, Gopi M, Javaheripi P, Kauffmann Rosa G, de O, Saarikivi A, Salim S, Shah H S, Behl X, Wang S, Bubeck R, Eldan A T, Kalai Y T, Lee Y Li . Textbooks are all you need. 2023, arXiv preprint arXiv: 2306.11644
20 X, Bi D, Chen G, Chen S, Chen D, Dai , et al.. DeepSeek LLM: scaling open-source language models with longtermism. 2024, arXiv preprint arXiv: 2401.02954
21 Q, Zheng X, Xia X, Zou Y, Dong S, Wang Y, Xue L, Shen Z, Wang A, Wang Y, Li T, Su Z, Yang J Tang . CodeGeeX: a pre-trained model for code generation with multilingual benchmarking on HumanEval-X. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023, 5673−5684
22 D, Fried A, Aghajanyan J, Lin S, Wang E, Wallace F, Shi R, Zhong S, Yih L, Zettlemoyer M Lewis . InCoder: a generative model for code infilling and synthesis. In: Proceedings of the 11th International Conference on Learning Representations. 2023
23 X, Chen M, Lin N, Schärli D Zhou . Teaching large language models to self-debug. In: Proceedings of the 12th International Conference on Learning Representations. 2024
24 J, Liu C S, Xia Y, Wang L Zhang . Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2024, 943
25 Y, Deng C S, Xia H, Peng C, Yang L Zhang . Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 2023, 423−435
26 W, Wang G, Li B, Ma X, Xia Z Jin . Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: Proceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering. 2020, 261−271
27 J, Gu Z, Chen M Monperrus . Multimodal representation for neural code search. In: Proceedings of 2021 IEEE International Conference on Software Maintenance and Evolution. 2021, 483−494
28 S, Arakelyan A, Hakhverdyan M, Allamanis L, Garcia C, Hauser X Ren . NS3: neuro-symbolic semantic code search. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2024, 761
29 Z, Li M, Pan Y, Pei T, Zhang L, Wang X Li . Empirically revisiting and enhancing automatic classification of bug and non-bug issues. Frontiers of Computer Science, 2024, 18( 5): 185207
30 A, Kanade P, Maniatis G, Balakrishnan K Shi . Learning and evaluating contextual embedding of source code. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 474
31 Z, Feng D, Guo D, Tang N, Duan X, Feng M, Gong L, Shou B, Qin T, Liu D, Jiang M Zhou . CodeBERT: a pre-trained model for programming and natural languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, 1536−1547
32 D, Guo S, Ren S, Lu Z, Feng D, Tang S, Liu L, Zhou N, Duan A, Svyatkovskiy S, Fu M, Tufano S K, Deng C B, Clement D, Drain N, Sundaresan J, Yin D, Jiang M Zhou . GraphCodeBERT: pre-training code representations with data flow. In: Proceedings of the 9th International Conference on Learning Representations. 2021
33 D, Guo S, Lu N, Duan Y, Wang M, Zhou J Yin . UniXcoder: unified cross-modal pre-training for code representation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 7212−7225
34 W, Ahmad S, Chakraborty B, Ray K W Chang . Unified pre-training for program understanding and generation. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 2655−2668
35 X, Wang Y, Wang F, Mi P, Zhou Y, Wan X, Liu L, Li H, Wu J, Liu X Jiang . SynCoBERT: Syntax-guided multi-modal contrastive pre-training for code representation. 2021, arXiv preprint arXiv: 2108.04556
36 K, Clark M T, Luong Q V, Le C D Manning . Electra: pre-training text encoders as discriminators rather than generators. In: Proceedings of the 8th International Conference on Learning Representations. 2020
37 D, Hendrycks S, Basart S, Kadavath M, Mazeika A, Arora E, Guo C, Burns S, Puranik H, He D, Song J Steinhardt . Measuring coding challenge competence with APPS. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021
38 J, Austin A, Odena M, Nye M, Bosma H, Michalewski D, Dohan E, Jiang C, Cai M, Terry Q, Le C Sutton . Program synthesis with large language models. 2021, arXiv preprint arXiv: 2108.07732
39 OpenAI. ChatGPT: optimizing language models for dialogue. 2022
40 T, Zhang T, Yu T B, Hashimoto M, Lewis W T, Yih D, Fried S I Wang . Coder reviewer reranking for code generation. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 41832−41846
[1] Lan-Zhe GUO, Lin-Han JIA, Jie-Jing SHAO, Yu-Feng LI. Robust semi-supervised learning in open environments[J]. Front. Comput. Sci., 2025, 19(8): 198345-.
[2] Jiyu JIANG, Zongqi WAN, Tongyang LI, Meiyue SHAO, Jialin ZHANG. Shadow tomography of quantum states with prediction[J]. Front. Comput. Sci., 2025, 19(7): 197907-.
[3] Zhaoyang HUANG, Zhichao XIAO, Chunyan AO, Lixin GUAN, Liang YU. Computational approaches for predicting drug-disease associations: a comprehensive review[J]. Front. Comput. Sci., 2025, 19(5): 195909-.
[4] Haochen ZHAO, Jian ZHONG, Xiao LIANG, Chenliang XIE, Shaokai WANG. Application of machine learning in drug side effect prediction: databases, methods, and challenges[J]. Front. Comput. Sci., 2025, 19(5): 195902-.
[5] Mengting NIU, Yaojia CHEN, Chunyu WANG, Quan ZOU, Lei XU. Computational approaches for circRNA-disease association prediction: a review[J]. Front. Comput. Sci., 2025, 19(4): 194904-.
[6] Wenzheng BAO, Bin YANG. Protein acetylation sites with complex-valued polynomial model[J]. Front. Comput. Sci., 2024, 18(3): 183904-.
[7] Fengxia LIU, Zhiming ZHENG, Yexuan SHI, Yongxin TONG, Yi ZHANG. A survey on federated learning: a perspective from multi-party computation[J]. Front. Comput. Sci., 2024, 18(1): 181336-.
[8] Yan LIN, Jiashu WANG, Xiaowei LIU, Xueqin XIE, De WU, Junjie ZHANG, Hui DING. A computational model to identify fertility-related proteins using sequence information[J]. Front. Comput. Sci., 2024, 18(1): 181902-.
[9] Bin-Bin JIA, Jun-Ying LIU, Jun-Yi HANG, Min-Ling ZHANG. Learning label-specific features for decomposition-based multi-class classification[J]. Front. Comput. Sci., 2023, 17(6): 176348-.
[10] Lerina AVERSANO, Mario Luca BERNARDI, Marta CIMITILE, Martina IAMMARINO, Debora MONTANO. Forecasting technical debt evolution in software systems: an empirical study[J]. Front. Comput. Sci., 2023, 17(3): 173210-.
[11] Sedigheh KHOSHNEVIS. A search-based identification of variable microservices for enterprise SaaS[J]. Front. Comput. Sci., 2023, 17(3): 173208-.
[12] Zhengxiong HOU, Hong SHEN, Xingshe ZHOU, Jianhua GU, Yunlan WANG, Tianhai ZHAO. Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions[J]. Front. Comput. Sci., 2022, 16(5): 165107-.
[13] Zhen SONG, Yu GU, Zhigang WANG, Ge YU. DRPS: efficient disk-resident parameter servers for distributed machine learning[J]. Front. Comput. Sci., 2022, 16(4): 164321-.
[14] Kaimin WEI, Tianqi LI, Feiran HUANG, Jinpeng CHEN, Zefan HE. Cancer classification with data augmentation based on generative adversarial networks[J]. Front. Comput. Sci., 2022, 16(2): 162601-.
[15] Yu OU, Lang LI. Side-channel analysis attacks based on deep learning network[J]. Front. Comput. Sci., 2022, 16(2): 162303-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed