Visualising the Ubuntu Package Repository

Like most geeks, I like data. I also like making pretty pictures. This weekend, I found a way to make pretty pictures from some data I had lying around.

The Data: Ubuntu Packages

Ubuntu is made up of thousands of packages. Each package contains a control file that provides some meta-data about the package. For example, here is the control data for the autopilot tool:

Package: python-autopilot
Priority: optional
Section: universe/python
Installed-Size: 1827
Maintainer: Ubuntu Developers 
Original-Maintainer: Thomi Richards 
Architecture: all
Source: autopilot
Version: 1.2daily12.12.10-0ubuntu1
Depends: python (>= 2.7.1-0ubuntu2), python (<< 2.8), gir1.2-gconf-2.0, gir1.2-glib-2.0, gir1.2-gtk-2.0, gir1.2-ibus-1.0, python-compizconfig, python-dbus, python-junitxml, python-qt4, python-qt4-dbus, python-testscenarios, python-testtools, python-xdg, python-xlib, python-zeitgeist
Filename: pool/universe/a/autopilot/python-autopilot_1.2daily12.12.10-0ubuntu1_all.deb
Size: 578972
MD5sum: c36f6bbab8b5ee10053b63b41ad7189a
SHA1: 749cb0df1c94630f2b3f7a4a1cd50357e0bf0e4d
SHA256: 948eeee40ad025bfb84645f68012e6677bc4447784e4214a5512786aa023467c
Description-en: Utility to write and run integration tests easily
 The autopilot engine enables to ease the writing of python tests
 for your application manipulating your inputs like the mouse and
 keyboard. It also provides a lot of utilities linked to the X server
 and detecting applications.
Homepage: https://launchpad.net/autopilot
Description-md5: 1cea8e2d895c31846b8d3482f96a24d4
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Origin: Ubuntu

As you can see, there's a lot of information here. The bit I'm interested in is the 'Depends' line. This lists all the packages that are required in order for this package to work correctly. When you install autopilot, your package manager will install all it's dependencies, and the dependencies of all those packages etc. This is (in my opinion), the best feature of a modern Linux distribution, compared with Windows.

Packages and their dependant packages form a directed graph. My goal is to make pretty pictures of this graph to see if I can learn anything useful.

Process: The Python

First, I wanted to extract the data from the apt package manager and create a graph data structure I could fiddle with. Using the excellent graph-tool library, I came up with this horrible horrible piece of python code:

#!/usr/bin/env python

from re import match, split
from subprocess import check_output
from debian.deb822 import Deb822
from graph_tool.all import *

graph = Graph()
package_nodes = {}
package_info = {}


def get_package_list():
    """Return a list of packages, both installed and uninstalled."""
    output = check_output(['dpkg','-l','*'])
    packages = []
    for line in output.split('\n'):
        parts = line.split()
        if not parts or not match('[uirph][nicurhWt]', parts[0]):
            continue
        packages.append(parts[1])
    return packages


def get_package_info(pkg_name):
    """Get a dict-like object containing information for a specified package."""
    global package_info
    if pkg_name in package_info:
        return package_info.get(pkg_name)
    else:
        try:
            yaml_stream = check_output(['apt-cache','show',pkg_name])
        except:
            print "Unable to find info for package: '%s'" % pkg_name
            package_info[pkg_name] = {}
            return {}
        d = Deb822(yaml_stream)
        package_info[pkg_name] = d
        return d


def get_graph_node_for_package(pkg_name):
    """Given a package name, return the graph node for that package. If the graph
    node does not exist, it is created, and it's meta-data filled.

    """
    global graph
    global package_nodes

    if pkg_name not in package_nodes:
        n = graph.add_vertex()
        package_nodes[pkg_name] = n
        # add node properties:
        pkg_info = get_package_info(pkg_name)
        graph.vertex_properties["package-name"][n] = pkg_name
        graph.vertex_properties["installed-size"][n] = int(pkg_info.get('Installed-Size', 0))
        return n
    else:
        return package_nodes.get(pkg_name)


def get_sanitised_depends_list(depends_string):
    """Given a Depends string, return a list of package names. Versions are
    stripped, and alternatives are all shown.

    """
    if depends_string == '':
        return []
    parts = split('[,\|]', depends_string)
    return [ match('(\S*)', p.strip()).groups()[0] for p in parts]

if __name__ == '__main__':
    # Create property lists we need:
    graph.vertex_properties["package-name"] = graph.new_vertex_property("string")
    graph.vertex_properties["installed-size"] = graph.new_vertex_property("int")

    # get list of packages:
    packages = get_package_list()

    # graph_nodes = graph.add_vertex(n=len(packages))
    n = 0
    for pkg_name in packages:
        node = get_graph_node_for_package(pkg_name)
        pkg_info = get_package_info(pkg_name)
        # purely virtual packages won't have a package info object:
        if pkg_info:
            depends = pkg_info.get('Depends', '')
            depends = get_sanitised_depends_list(depends)
            for dependancy in depends:
                graph.add_edge(node, get_graph_node_for_package(dependancy))
        n += 1
        if n % 10 == 0:
            print "%d / /%d" % (n, len(packages))

    graph.save('graph.gml')


Yes, I realise this is terrible code. However, I also wrote it in 10 minutes time, and I'm not planning on using it for anything serious - this is an experiment!

Running this script gives me a 2.6MB .gml file (it also takes about half an hour - did I mention that the code is terrible?). I can then import this file into gephi, run a layout algorithm over it for the best part of an hour (during which time my laptop starts sounding a lot like a vacuum cleaner), and start making pretty pictures!

The Pretties:

Without further ado - here's the first rendering. This is the entire graph. The node colouring indicates the node degree (the number of edges connected to the node) - blue is low, red is high. Edges are coloured according to their target node.

These images are all rendered small enough to fit on the web page. Click on them to get the full image.
A few things are fairly interesting about this graph. First, there's a definite central node, surrounded by a community of other packages. This isn't that surprising - most things (everything?) relies on the standard C library eventually.
The graph has several other distinct communities as well. I've produced a number of images below that show the various communities, along with a short comment.

C++

These two large nodes are libgcc1 (top), and libstdc++ (bottom). As we'll see soon, the bottom-right corder of the graph is dominated by C++ projects.

Qt and KDE

This entire island of nodes is made of up the Qt and KDE libraries. The higher nodes are the Qt libraries (QtCore, QtGui, QtXml etc), and the nodes lower down are KDE libraries (kcalcore4, akonadi, kmime4 etc).

Python

 The two large nodes here are 'python' and 'python2.7'. Interestingly, 'python3' is a much smaller community, just above the main python group.

System

Just below the python community there's a large, loosely-connected network of system tools. Notable members of this community include the Linux kernel packages, upstart, netbase, adduser, and many others.

Gnome


This is GNOME. At it's core is 'libglib', and it expands out to libgtk, libgdk-pixbuf (along with many other libraries), and from there to various applications that use these libraries (gnome-settings-daemon for example).

Mono

At the very top of the graph, off on an island by themselves are the mono packages.

Others

The wonderful thing about this graph is that the neighbourhoods are fractal. I've outlined several of the large ones, but looking closer reveals small clusters of related packages. For example: multimedia packages:

This is Just the Beginning...

This is an amazing dataset, and really this is just the beginning. There's a number of things I want to look into, including:
  • Adding 'Recommends' and 'Suggests' links between packages - with a lower edge weight than Depends.
  • Colour coding nodes according to which repository section the package can be found in.
  • Try to categorise libraries vs applications - do applications end up clustered like libraries do?
I'm open to suggestions however - what do you think I should into next?

23 comments:

Anonymous said...

Awesome project!
Id like to see what Lua is connected to.

Darren Gallagher said...

Hi,

What settings in gephi did you use?

Thanks
DazzaG

Jooyoung Hahn said...

It looks very interesting. Maybe, it can be used for a graphical profiling-analysis of C++ or Fortran code. It will show where is a bottle neck of complicated function calling structures.

ajx said...

Sounds like you got way too much time on your hands.

Anonymous said...

I would love these at desktop background resolutions! Amazing work.

Ankur said...

holy crap this is beautiful. i want to put this on my wall.

ajx said...
This comment has been removed by the author.
Eric Swanson said...

First off, I love this post. I plan on taking a look at the code and giving it a run when I get off of work.

Second, could you put the .gml file up somewhere so that we could take a look at it?

Sam Walton said...

Quite amazing and very good work.

Thomi Richards said...

Hi Everyone,

Thanks for the feedback. Next week I'll post a data dump containing the gml file, the gephi file, and some renderings at desktop resolutions - maybe some more interesting analysis as well.

Until then - cheers!

Alex said...

How about ignoring libc6, since anyway everything depends on it ?
Maybe you get clearer structures then.

Alex

naisanza said...

I can't describe how much I loved this. Thanks for creating this beautiful post.

Sergey "Shnatsel" Davidoff said...

I recall getting very similar graphs from apt-rdepends for debugging purposes. The scrips was as simple as "apt-rdepends --dotty | springgraph" with some parameters. But your solution is more elaborate and admittedly fancier!

Anonymous said...

Really cool! You've inspired me to try something similar with the dependencies of a web-cms.

Anonymous said...

From a security prespective it'd be interesting to see the dependencies stemming from libraries/apps with 0day vulnerabilities.

Anonymous said...

you could integrate popcon data.

alessandrozonin said...

That's great! Published into complex network analysis pinboard http://pinterest.com/alessandrozonin/

Ciao

Andrew said...

Looks nice, but is useless as a static image. You need to be able to make those graphs searchable and explorable to users, otherwise it's little more than (tiny) lines on a page.

There is a way to export these Gephi maps into the web and make them dynamic, but it's currently beyond me. Anyone know how? Is there a tutorial?

I'd also be interested in finding out what Gephi settings you used.

--Andrew
Twitter: @ajduffs

Martin Thoma said...

Those graphics are really cool!

I tried to run your script on my Ubuntu 10.04 machine, but I got:

> Traceback (most recent call last):
> File "autopilotproj.py", line 4, in
> from subprocess import check_output
> ImportError: cannot import name check_output

I have Python 2.6 (I guess I should update it...)
Work-around: http://stackoverflow.com/a/2924457/562769

Next:
> ImportError: No module named debian.deb822

Seems to be in python-debian, but I have python-debian :-(
Well, this has to wait until I update my system...

Thomi Richards said...

Hi everyone,

First, thanks for the feedback and words of encouragement. This is the first time I've done any sort of visualisation, so I'm still learning how to do this stuff.

I promise I'll do a follow-up post some time soon (hopefully this week), and release both the .gml and .gephi files, so you can play around with the data as well.

Finally, I'd love to produce Andrew's suggestion - an interactive visualisation of the graph. I can produce a 12MB SVG image, but most browsers choke when I try and load it. So I'm looking for a JS library to selectively load & display parts of a large SVG file. Any pointers most welcome :)

Cheers,

Anonymous said...

It'd be amazing if someone made this into an interactive HTML5 map.

Cacasodo said...

I second Andrew's comment to have a navigable, searchable version. Great work, Thomi!
'sodo

Thomi Richards said...

I actually started working on a navigable version, but quickly ran into difficulty with webGL - now that I've learned a bit more about webGL, I may be able to take this on again.

BTW, this is my old blog. If you haven't seen it already, I posted a follow-up to this articvle over here: http://www.tech-foo.net/ubuntu-package-repository-visualisation-take-2.html

Cheers,

Post a Comment