Introducing Compiler Semantics into Large Language Models as Programming Language Translators: A Case Study of C to x86 Assembly

Oct 21, 2024·

Shuoming Zhang

Jiacheng Zhao

Chunwei Xia

Zheng Wang

Yunji Chen

Huimin Cui

· 0 min read

PDF Cite Code Dataset DOI

Tensor Compiler

Abstract

Compilers are complex software containing millions of lines of code, taking years to develop. This paper investigates to what extent Large Language Models (LLMs) can replace hand-crafted compilers in translating high-level programming languages to machine instructions, using C to x86 assembly as a case study. We identify two challenges of using LLMs for code translation and introduce two novel data pre-processing techniques to address the challenges: numerical value conversion and training data resampling. While only using a 13B model, our approach achieves a behavioral accuracy of over 91%, outperforming the much larger GPT-4 Turbo model by over 50%. Our results are encouraging, showing that LLMs have the potential to transform how compilation tools are constructed.

Type

Conference paper

Publication

In The 2024 Conference on Empirical Methods in Natural Language Processing

Last updated on Oct 21, 2024

Large Language Model for Compiler

Authors

Chunwei Xia

Lecturer (Assistant Professor)

← Leveraging Compilation Statistics for Compiler Phase Ordering Jan 2, 2025

Optimizing Deep Learning Inference via Global Analysis and Tensor Expression Apr 29, 2024 →