How much GNU is there in GNU/Linux?

After building the infrastructure to analyse the code in an Ubuntu release I decided to satisfy a simple curiosity and figure out how much GNU software is actually part of a modern distribution. I picked Ubuntu natty (released in April) as a reference, am counting lines of code (LOC) as the rough metric for size of a given project, and am considering only the “main” repository, supposedly the core of the distribution, actually packaged by Ubuntu and not repackaged from Debian.

Figure 1: Total LOC split by project in Ubuntu natty's main repository

Figure 1 shows the total LOC in Ubuntu natty split by the major projects that produce it. By this metric GNU software is about 8%. I didn’t include GNOME in the GNU category because it seems to now be effectively run outside GNU but including that the total for GNU would be around 13%.

I found two things to be really surprising in this chart. The first is that the kernel is actually comparable in size to all the GNU software1. The second is that small projects actually dominate the total amount. It seems that at least for what Ubuntu packages, the origin of the software is highly dispersed.

Figure 2: LOC Split of GNU packages in Ubuntu natty's main repository

Figure 2 shows the split of the GNU category into its components. As you’d expect glibc/gcc/binutils/gdb are the big ticket items. What strikes me from this split is that nearly all these packages have popular alternatives in use. It seems you could put together a fully functional distribution without any GNU software and not cause too much disruption to end-users. gdb is probably the notable exception and is still shipped even by those that would rather avoid GNU software, like FreeBSD.

It seems that when it comes to modern Linux-based distributions the tendency has been for the distribution to be the organization point of a highly dispersed set of software sources. No single project accounts for more than 10% of the total and a complete modern system is only formed by this aggregation.

As before the code for these comparisons is reasonably tidy and GPL2. It’s up on GitHub. This is very much a work in progress and although I’m reasonably confident the LOC counts are reasonably correct I welcome bug reports.

  1. The kernel slice includes not just the kernel but also its direct dependencies like iptables and udev. The kernel itself is 6% of the total.