How do git servers scale?

12/27/2019

Popular version control servers (like github) are likely having an immense amount of traffic and need a scalable & durable data storage. I was wondering how is this implemented in the background.

I have few guesses/assumptions on how it works but I'm not sure if they are 100% accurate:

  • Repositories are probably stored on disk instead of some database solution (because git server is already self sufficient AFAIK)
  • A single host to serve the entire traffic is probably not enough, so some load balancing is needed
  • Since multiple servers are needed, each having their own storage, there is no point in keeping all repositories in all servers. (So I would expect each repository to be mapped to a host)
  • For reliability, probably servers are not running on single hosts but rather on a cluster of replicates that are actually synced (maybe using kubernetes etc) and these are probably backed up periodically along with database backups.
  • There probably is a main load balancer application that redirects the request to appropriate cluster (so it knows which repository is mapped to which cluster)

One other possibility is just storing the entire .git in a database as blob and have a scalable stateless application fetch that .git for each request, do operations, store the result again and the send response however this is probably a really inefficient solution so I thought it is unlikely to be the underlying mechanism.

So my main questions are:

  • Do assumptions above make sense / are they accurate?
  • How would one implement a load balancer application that all git requests are directed to the appropriate cluster? (eg. would mapping repositories with cluster id&ips, storing this in a database, and putting up a nodejs application that redirects the incoming requests to matching cluster ip work?)
  • How would one go about implementing a git server that scales in general if above is inaccurate? (in case there are any better approaches)
-- ozgeneral
git
github
horizontal-scaling
kubernetes
load-balancing

1 Answer

12/27/2019

No need to rely on guesses.

For GitHub specifically, the githubengineering blog details what they had to use in order to scale to their current usage level.

Beside upgrading Rails or removing JQuery, on the frontend side, they have:

Regarding Kubernetes:

-- VonC
Source: StackOverflow