Introducing Compiler Semantics into Large Language Models as Programming Language Translators: A Case Study of C to x86 Assembly

Oct 21, 2024ยท
Shuoming Zhang
,
Jiacheng Zhao
Chunwei Xia
Chunwei Xia
,
Zheng Wang
,
Yunji Chen
,
Huimin Cui
ยท 0 min read
Tensor Compiler
Abstract
Compilers are complex software containing millions of lines of code, taking years to develop. This paper investigates to what extent Large Language Models (LLMs) can replace hand-crafted compilers in translating high-level programming languages to machine instructions, using C to x86 assembly as a case study. We identify two challenges of using LLMs for code translation and introduce two novel data pre-processing techniques to address the challenges: numerical value conversion and training data resampling. While only using a 13B model, our approach achieves a behavioral accuracy of over 91%, outperforming the much larger GPT-4 Turbo model by over 50%. Our results are encouraging, showing that LLMs have the potential to transform how compilation tools are constructed.
Type
Publication
In The 2024 Conference on Empirical Methods in Natural Language Processing