Anyone who has taken a basic computer science class has undoubtedly spent time devising a sorting algorithm—code that will take an unordered list of items and put them in ascending or descending order. It’s an interesting challenge because there are so many ways of doing it and because people have spent a lot of time figuring out how to do this sorting as efficiently as possible.
Sorting is so basic that algorithms are built into most standard libraries for programming languages. And, in the case of the C++ library used with the LLVM compiler, the code hasn’t been touched in over a decade.
But Google’s DeepMind AI group has now developed a reinforcement learning tool that can develop extremely optimized algorithms without first being trained on human code examples. The trick was to set it up to treat programming as a game.
It’s all a game
DeepMind, among other things, is notable for having developed software that teaches itself how to play games. That approach has proven highly effective, conquering games as varied as chess, Go, and StarCraft. While the details vary depending on which game it’s tackling, the software learns by playing itself and discovers options that allow it to maximize a score.
Because it isn’t trained on games humans play, the DeepMind system can discover approaches to the games that humans haven’t thought of. Of course, since it’s always playing against itself, there are cases where it has developed blind spots that humans can exploit.
This approach is very relevant to programming. Large language models write effective code because they have seen plenty of human examples. But because of that, they’re unlikely to develop something that humans haven’t done previously. If we’re looking to optimize well-understood algorithms, like sorting functions, then basing something on existing human code is, at best, going to get you equivalent performance. But how do you get an AI to identify a truly new approach?
The people at DeepMind took the same approach as they had with chess and Go: They turned code optimization into a game. The AlphaDev system developed x86 assembly algorithms that treated the latency of the code as a score and tried to minimize that score while ensuring that the code ran to completion without errors. Through reinforcement learning, AlphaDev gradually develops the ability to write tight, highly efficient code.
Saying that the system optimizes for latency is very different from explaining how it operates. Like most other complex AI systems, AlphaDev consists of several distinct components. One of them is a representation function, which tracks the overall performance of the code as it’s developed. This includes the general structure of the algorithm, as well as the use of x86 registers and memory.
The system adds assembly instructions individually, chosen by a Monte Carlo tree search—again, an approach borrowed from game-playing systems. The “tree” aspect of this approach allows the system to quickly narrow in on a limited area of the large range of potential instructions, while the Monte Carlo adds a degree of randomness to the precise instruction that gets chosen from that branch. (Note that “instruction” in this context includes things like the specific registers chosen to create a valid and complete assembly.)
The system then evaluates the state of the assembly code for latency and validity and assigns it a score, comparing that to the score of the previous one. And, through reinforcement learning, it hangs on to information about how going down different branches of the tree work, given the program’s state. Over time, it “learns” how to achieve a winning game state—a completed sorting—with a maximum score, meaning a minimum latency.
The main benefit of this system is that its training doesn’t have to involve any code examples. Instead, the system generates its own code examples and then evaluates them. In the process, it hangs on to information about combinations of instructions that are effective in sorting.