Why Google stores billions of lines of code in a single repository

Rachel Potvin, Josh Levenberg
[doi] [Google Scholar] [DBLP] [Citeseer] [url]
Read: 21 September 2020

Communications of the ACM 59(7)
Association for Computing Machinery
New York, NY, USA
Pages 78-87
June 2016
Note(s): google
Papers: corbett:tocs:2013, sadowski:icse:2015, sadowski:cacm:2018, sadowski:icse-seip:2018

This paper describes Google’s monorepo approach to managing source code in a single repository, the pros, the cons and what was needed to support this approach.

Since there is a lot of code (1 billion lines in January 2015) Google developed “Piper” – their own source code version control system (because Perforce was no longer scaling). Piper is built on top of the Spanner database (corbett:tocs:2013).

Google practices “trunk based development”: there are (virtually) no branches (apart from a few release branches), all new code is merged into the trunk. This focus on a single trunk avoids the diamond-import problem, lets them focus testing in one place, ensures that there is only one “true” copy of code, etc. Besides “piper”, some of the tools that take advantage of or support this are

Critique (sadowski:icse-seip:2018) code review system
CodeSearch
TriCorder (sadowski:cacm:2018) static analysis tools
automated testing infrastructure
Clipper to find dead code. (Or maybe to find dependencies?)

The monorepo is organized hierarchially with different owners for different subtrees of the code. These owners review all code changes affecting their code. To support some of the global cleanups that are a strength of monorepos, the “Rosie” tool splits changes up into separate review requests for each owner.

A key usage pattern with a huge monorepo with many subowners seems to be accepting that getting all the reviews simultaneously is not going to happen. So changes to an API (say) are made by first adding support for the new API with conditional compilation; then gradually turning off all uses of the old API; then, eventually, deleting the old API and all the inactive calls to it.

Lessons from building static analysis tools at Google [sadowski:cacm:2018]
Modern code review: A case study at Google [sadowski:icse-seip:2018]
Tricorder: Building a program analysis ecosystem [sadowski:icse:2015]
Large-scale automated refactoring using ClangMR [wright:icsm:2013]

Why Google stores billions of lines of code in a single repository

Papers related to Why Google stores billions of lines of code in a single repository