Move Along

Nothing to see here... I finally got fed up with blogspot being generally crappy, so I've moved to a new location. Join me there!

Visualising the Ubuntu Package Repository

Like most geeks, I like data. I also like making pretty pictures. This weekend, I found a way to make pretty pictures from some data I had lying around.

The Data: Ubuntu Packages

Ubuntu is made up of thousands of packages. Each package contains a control file that provides some meta-data about the package. For example, here is the control data for the autopilot tool:

Package: python-autopilot
Priority: optional
Section: universe/python
Installed-Size: 1827
Maintainer: Ubuntu Developers 
Original-Maintainer: Thomi Richards 
Architecture: all
Source: autopilot
Version: 1.2daily12.12.10-0ubuntu1
Depends: python (>= 2.7.1-0ubuntu2), python (<< 2.8), gir1.2-gconf-2.0, gir1.2-glib-2.0, gir1.2-gtk-2.0, gir1.2-ibus-1.0, python-compizconfig, python-dbus, python-junitxml, python-qt4, python-qt4-dbus, python-testscenarios, python-testtools, python-xdg, python-xlib, python-zeitgeist
Filename: pool/universe/a/autopilot/python-autopilot_1.2daily12.12.10-0ubuntu1_all.deb
Size: 578972
MD5sum: c36f6bbab8b5ee10053b63b41ad7189a
SHA1: 749cb0df1c94630f2b3f7a4a1cd50357e0bf0e4d
SHA256: 948eeee40ad025bfb84645f68012e6677bc4447784e4214a5512786aa023467c
Description-en: Utility to write and run integration tests easily
 The autopilot engine enables to ease the writing of python tests
 for your application manipulating your inputs like the mouse and
 keyboard. It also provides a lot of utilities linked to the X server
 and detecting applications.
Homepage: https://launchpad.net/autopilot
Description-md5: 1cea8e2d895c31846b8d3482f96a24d4
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Origin: Ubuntu

As you can see, there's a lot of information here. The bit I'm interested in is the 'Depends' line. This lists all the packages that are required in order for this package to work correctly. When you install autopilot, your package manager will install all it's dependencies, and the dependencies of all those packages etc. This is (in my opinion), the best feature of a modern Linux distribution, compared with Windows.

Packages and their dependant packages form a directed graph. My goal is to make pretty pictures of this graph to see if I can learn anything useful.

Process: The Python

First, I wanted to extract the data from the apt package manager and create a graph data structure I could fiddle with. Using the excellent graph-tool library, I came up with this horrible horrible piece of python code:

#!/usr/bin/env python

from re import match, split
from subprocess import check_output
from debian.deb822 import Deb822
from graph_tool.all import *

graph = Graph()
package_nodes = {}
package_info = {}


def get_package_list():
    """Return a list of packages, both installed and uninstalled."""
    output = check_output(['dpkg','-l','*'])
    packages = []
    for line in output.split('\n'):
        parts = line.split()
        if not parts or not match('[uirph][nicurhWt]', parts[0]):
            continue
        packages.append(parts[1])
    return packages


def get_package_info(pkg_name):
    """Get a dict-like object containing information for a specified package."""
    global package_info
    if pkg_name in package_info:
        return package_info.get(pkg_name)
    else:
        try:
            yaml_stream = check_output(['apt-cache','show',pkg_name])
        except:
            print "Unable to find info for package: '%s'" % pkg_name
            package_info[pkg_name] = {}
            return {}
        d = Deb822(yaml_stream)
        package_info[pkg_name] = d
        return d


def get_graph_node_for_package(pkg_name):
    """Given a package name, return the graph node for that package. If the graph
    node does not exist, it is created, and it's meta-data filled.

    """
    global graph
    global package_nodes

    if pkg_name not in package_nodes:
        n = graph.add_vertex()
        package_nodes[pkg_name] = n
        # add node properties:
        pkg_info = get_package_info(pkg_name)
        graph.vertex_properties["package-name"][n] = pkg_name
        graph.vertex_properties["installed-size"][n] = int(pkg_info.get('Installed-Size', 0))
        return n
    else:
        return package_nodes.get(pkg_name)


def get_sanitised_depends_list(depends_string):
    """Given a Depends string, return a list of package names. Versions are
    stripped, and alternatives are all shown.

    """
    if depends_string == '':
        return []
    parts = split('[,\|]', depends_string)
    return [ match('(\S*)', p.strip()).groups()[0] for p in parts]

if __name__ == '__main__':
    # Create property lists we need:
    graph.vertex_properties["package-name"] = graph.new_vertex_property("string")
    graph.vertex_properties["installed-size"] = graph.new_vertex_property("int")

    # get list of packages:
    packages = get_package_list()

    # graph_nodes = graph.add_vertex(n=len(packages))
    n = 0
    for pkg_name in packages:
        node = get_graph_node_for_package(pkg_name)
        pkg_info = get_package_info(pkg_name)
        # purely virtual packages won't have a package info object:
        if pkg_info:
            depends = pkg_info.get('Depends', '')
            depends = get_sanitised_depends_list(depends)
            for dependancy in depends:
                graph.add_edge(node, get_graph_node_for_package(dependancy))
        n += 1
        if n % 10 == 0:
            print "%d / /%d" % (n, len(packages))

    graph.save('graph.gml')


Yes, I realise this is terrible code. However, I also wrote it in 10 minutes time, and I'm not planning on using it for anything serious - this is an experiment!

Running this script gives me a 2.6MB .gml file (it also takes about half an hour - did I mention that the code is terrible?). I can then import this file into gephi, run a layout algorithm over it for the best part of an hour (during which time my laptop starts sounding a lot like a vacuum cleaner), and start making pretty pictures!

The Pretties:

Without further ado - here's the first rendering. This is the entire graph. The node colouring indicates the node degree (the number of edges connected to the node) - blue is low, red is high. Edges are coloured according to their target node.

These images are all rendered small enough to fit on the web page. Click on them to get the full image.
A few things are fairly interesting about this graph. First, there's a definite central node, surrounded by a community of other packages. This isn't that surprising - most things (everything?) relies on the standard C library eventually.
The graph has several other distinct communities as well. I've produced a number of images below that show the various communities, along with a short comment.

C++

These two large nodes are libgcc1 (top), and libstdc++ (bottom). As we'll see soon, the bottom-right corder of the graph is dominated by C++ projects.

Qt and KDE

This entire island of nodes is made of up the Qt and KDE libraries. The higher nodes are the Qt libraries (QtCore, QtGui, QtXml etc), and the nodes lower down are KDE libraries (kcalcore4, akonadi, kmime4 etc).

Python

 The two large nodes here are 'python' and 'python2.7'. Interestingly, 'python3' is a much smaller community, just above the main python group.

System

Just below the python community there's a large, loosely-connected network of system tools. Notable members of this community include the Linux kernel packages, upstart, netbase, adduser, and many others.

Gnome


This is GNOME. At it's core is 'libglib', and it expands out to libgtk, libgdk-pixbuf (along with many other libraries), and from there to various applications that use these libraries (gnome-settings-daemon for example).

Mono

At the very top of the graph, off on an island by themselves are the mono packages.

Others

The wonderful thing about this graph is that the neighbourhoods are fractal. I've outlined several of the large ones, but looking closer reveals small clusters of related packages. For example: multimedia packages:

This is Just the Beginning...

This is an amazing dataset, and really this is just the beginning. There's a number of things I want to look into, including:
  • Adding 'Recommends' and 'Suggests' links between packages - with a lower edge weight than Depends.
  • Colour coding nodes according to which repository section the package can be found in.
  • Try to categorise libraries vs applications - do applications end up clustered like libraries do?
I'm open to suggestions however - what do you think I should into next?