Move Along

Nothing to see here... I finally got fed up with blogspot being generally crappy, so I've moved to a new location. Join me there!

Visualising the Ubuntu Package Repository

Like most geeks, I like data. I also like making pretty pictures. This weekend, I found a way to make pretty pictures from some data I had lying around.

The Data: Ubuntu Packages

Ubuntu is made up of thousands of packages. Each package contains a control file that provides some meta-data about the package. For example, here is the control data for the autopilot tool:

Package: python-autopilot
Priority: optional
Section: universe/python
Installed-Size: 1827
Maintainer: Ubuntu Developers 
Original-Maintainer: Thomi Richards 
Architecture: all
Source: autopilot
Version: 1.2daily12.12.10-0ubuntu1
Depends: python (>= 2.7.1-0ubuntu2), python (<< 2.8), gir1.2-gconf-2.0, gir1.2-glib-2.0, gir1.2-gtk-2.0, gir1.2-ibus-1.0, python-compizconfig, python-dbus, python-junitxml, python-qt4, python-qt4-dbus, python-testscenarios, python-testtools, python-xdg, python-xlib, python-zeitgeist
Filename: pool/universe/a/autopilot/python-autopilot_1.2daily12.12.10-0ubuntu1_all.deb
Size: 578972
MD5sum: c36f6bbab8b5ee10053b63b41ad7189a
SHA1: 749cb0df1c94630f2b3f7a4a1cd50357e0bf0e4d
SHA256: 948eeee40ad025bfb84645f68012e6677bc4447784e4214a5512786aa023467c
Description-en: Utility to write and run integration tests easily
 The autopilot engine enables to ease the writing of python tests
 for your application manipulating your inputs like the mouse and
 keyboard. It also provides a lot of utilities linked to the X server
 and detecting applications.
Homepage: https://launchpad.net/autopilot
Description-md5: 1cea8e2d895c31846b8d3482f96a24d4
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Origin: Ubuntu

As you can see, there's a lot of information here. The bit I'm interested in is the 'Depends' line. This lists all the packages that are required in order for this package to work correctly. When you install autopilot, your package manager will install all it's dependencies, and the dependencies of all those packages etc. This is (in my opinion), the best feature of a modern Linux distribution, compared with Windows.

Packages and their dependant packages form a directed graph. My goal is to make pretty pictures of this graph to see if I can learn anything useful.

Process: The Python

First, I wanted to extract the data from the apt package manager and create a graph data structure I could fiddle with. Using the excellent graph-tool library, I came up with this horrible horrible piece of python code:

#!/usr/bin/env python

from re import match, split
from subprocess import check_output
from debian.deb822 import Deb822
from graph_tool.all import *

graph = Graph()
package_nodes = {}
package_info = {}


def get_package_list():
    """Return a list of packages, both installed and uninstalled."""
    output = check_output(['dpkg','-l','*'])
    packages = []
    for line in output.split('\n'):
        parts = line.split()
        if not parts or not match('[uirph][nicurhWt]', parts[0]):
            continue
        packages.append(parts[1])
    return packages


def get_package_info(pkg_name):
    """Get a dict-like object containing information for a specified package."""
    global package_info
    if pkg_name in package_info:
        return package_info.get(pkg_name)
    else:
        try:
            yaml_stream = check_output(['apt-cache','show',pkg_name])
        except:
            print "Unable to find info for package: '%s'" % pkg_name
            package_info[pkg_name] = {}
            return {}
        d = Deb822(yaml_stream)
        package_info[pkg_name] = d
        return d


def get_graph_node_for_package(pkg_name):
    """Given a package name, return the graph node for that package. If the graph
    node does not exist, it is created, and it's meta-data filled.

    """
    global graph
    global package_nodes

    if pkg_name not in package_nodes:
        n = graph.add_vertex()
        package_nodes[pkg_name] = n
        # add node properties:
        pkg_info = get_package_info(pkg_name)
        graph.vertex_properties["package-name"][n] = pkg_name
        graph.vertex_properties["installed-size"][n] = int(pkg_info.get('Installed-Size', 0))
        return n
    else:
        return package_nodes.get(pkg_name)


def get_sanitised_depends_list(depends_string):
    """Given a Depends string, return a list of package names. Versions are
    stripped, and alternatives are all shown.

    """
    if depends_string == '':
        return []
    parts = split('[,\|]', depends_string)
    return [ match('(\S*)', p.strip()).groups()[0] for p in parts]

if __name__ == '__main__':
    # Create property lists we need:
    graph.vertex_properties["package-name"] = graph.new_vertex_property("string")
    graph.vertex_properties["installed-size"] = graph.new_vertex_property("int")

    # get list of packages:
    packages = get_package_list()

    # graph_nodes = graph.add_vertex(n=len(packages))
    n = 0
    for pkg_name in packages:
        node = get_graph_node_for_package(pkg_name)
        pkg_info = get_package_info(pkg_name)
        # purely virtual packages won't have a package info object:
        if pkg_info:
            depends = pkg_info.get('Depends', '')
            depends = get_sanitised_depends_list(depends)
            for dependancy in depends:
                graph.add_edge(node, get_graph_node_for_package(dependancy))
        n += 1
        if n % 10 == 0:
            print "%d / /%d" % (n, len(packages))

    graph.save('graph.gml')


Yes, I realise this is terrible code. However, I also wrote it in 10 minutes time, and I'm not planning on using it for anything serious - this is an experiment!

Running this script gives me a 2.6MB .gml file (it also takes about half an hour - did I mention that the code is terrible?). I can then import this file into gephi, run a layout algorithm over it for the best part of an hour (during which time my laptop starts sounding a lot like a vacuum cleaner), and start making pretty pictures!

The Pretties:

Without further ado - here's the first rendering. This is the entire graph. The node colouring indicates the node degree (the number of edges connected to the node) - blue is low, red is high. Edges are coloured according to their target node.

These images are all rendered small enough to fit on the web page. Click on them to get the full image.
A few things are fairly interesting about this graph. First, there's a definite central node, surrounded by a community of other packages. This isn't that surprising - most things (everything?) relies on the standard C library eventually.
The graph has several other distinct communities as well. I've produced a number of images below that show the various communities, along with a short comment.

C++

These two large nodes are libgcc1 (top), and libstdc++ (bottom). As we'll see soon, the bottom-right corder of the graph is dominated by C++ projects.

Qt and KDE

This entire island of nodes is made of up the Qt and KDE libraries. The higher nodes are the Qt libraries (QtCore, QtGui, QtXml etc), and the nodes lower down are KDE libraries (kcalcore4, akonadi, kmime4 etc).

Python

 The two large nodes here are 'python' and 'python2.7'. Interestingly, 'python3' is a much smaller community, just above the main python group.

System

Just below the python community there's a large, loosely-connected network of system tools. Notable members of this community include the Linux kernel packages, upstart, netbase, adduser, and many others.

Gnome


This is GNOME. At it's core is 'libglib', and it expands out to libgtk, libgdk-pixbuf (along with many other libraries), and from there to various applications that use these libraries (gnome-settings-daemon for example).

Mono

At the very top of the graph, off on an island by themselves are the mono packages.

Others

The wonderful thing about this graph is that the neighbourhoods are fractal. I've outlined several of the large ones, but looking closer reveals small clusters of related packages. For example: multimedia packages:

This is Just the Beginning...

This is an amazing dataset, and really this is just the beginning. There's a number of things I want to look into, including:
  • Adding 'Recommends' and 'Suggests' links between packages - with a lower edge weight than Depends.
  • Colour coding nodes according to which repository section the package can be found in.
  • Try to categorise libraries vs applications - do applications end up clustered like libraries do?
I'm open to suggestions however - what do you think I should into next?

Writing UI Tests with Autopilot

At the UDS-r sprint in Copenhagen, I'll be running a session for anyone interested in Autopilot. There will be a demo, and Autopilot "experts" on hand to answer any questions you might have.

Autopilot is a tool for automating UI testing. We've been using it with great success in the last two Ubuntu cycles, and we're starting to support testing traditional Qt4, Qt5, Qml, and Gtk applications.

I hope to see you all there!

So you missed PyCon US...

If you're anything like me you've watched another PyCon US come and go. Living in New Zealand makes attending overseas conferences an expensive proposition. So you've missed the conference. You've watched all the talks on pyvideo.org, but it's still not enough. You'd love to attend a PyCon in person, perhaps one in an exotic location (what a great opportunity for a family vacation). Of course, I have a solution: Come to Kiwi PyCon!



In contrast to the US PyCon, Kiwi PyCon is a smaller, more intimate affair, with a few hundred delegates, two streams, and plenty of chances to meet other python hackers from the Australia/New Zealand/Pacific region. Places are limited, and registrations are open, so here's what you need to do to beat the post-PyCon blues:
  1. Go to nz.pycon.org, and register for the conference. While you're there, check out the sponsorship options!
  2. If you're feeling brave, submit a talk proposal!
  3. Book accommodation and flights (we will soon have accommodation options listed on the website).
  4. Count down the days to the conference!
It's that simple. Do it now!

Experimenting with C++ std::make_shared

C++11 is upon us, and one of the more utilitarian changes in the new standard is the inclusion of the new smart pointer types: unique_ptr, shared_ptr and weak_ptr. An interesting related feature is std::make_shared - a function that returns a std::shared_ptr wrapping a type you specify. The documentation promises efficiency gains by using this method. From the documentation:

This function allocates memory for the T object and for the shared_ptr's control block with a single memory allocation. In contrast, the declaration std::shared_ptr p(new T(Args...)) performs two memory allocations, which may incur unnecessary overhead.
I was curious: How much faster is make_shared than using new yourself? Like any good scientist, I decided to verify the claim that make_shared gives better performance than new by itself.

I wrote a small program and tested it. Here's my code:

#include <memory>
#include <string>

class Foo
{
public:
    typedef std::shared_ptr<Foo> Ptr;

    Foo()
    : a(42)
    , b(false)
    , c(12.234)
    , d("FooBarBaz")
    {}

private:
    int a;
    bool b;
    float c;
    std::string d;
};

const int loop_count = 100000000;
int main(int argc, char** argv)
{
    for (int i = 0; i < loop_count; i++)
    {
#ifdef USE_MAKE_SHARED
        Foo::Ptr p = std::make_shared<Foo>();
#else
        Foo::Ptr p = Foo::Ptr(new Foo);
#endif
    }
    return 0;
}
This is pretty simple - we either allocation 100 million pointers using new manually, or we use the new make_shared. I wanted my 'Foo' class to be simple enough to fit into a couple of lines, but contain a number of different types, and at least one complex type. I built both variants of this small application with g++, and used the 'time' utility to measure it's execution time. I realise this is a pretty crude measurement, but the results are interesting nontheless:
My initial results are confusing - it appears as if std::make_shared is slower than using new. Then I realised that I had not enabled any optimisations. Sure enough, adding '-O2' to the g++ command line gave me some more sensible results:
OK, so make_shared only seems to be faster with optimisations turned on, which is interesting in itself. At this point, I started wondering how other compilers would fare. I decided to pick on clang and run exactly the same tests once more:
Once again we see a very similar pattern between the optimised and non-optimised code. We can also see that clang is slightly slower than g++ (although it was significantly faster at compiling). For those of you who want the numbers:
Now I have evidence for convincing people to use make_shared in favor of new!

Kiwi PyCon Sponsorship Drive

 Kiwi PyCon is approaching! You probably think that's a good thing, but if you're one of the poor volunteer organisers, that's a scary thought. Why? We have bills to pay, and very little income. That means it's time to shill for some cash! Below is an excerpt from our public sponsorship announcement email. If your company is willing to sponsor a good cause, please get in touch with me.

Kiwi PyCon is organised by the New Zealand Python User group - a not-for-profit organisation. We don’t make any profit from the conference, and the organisers donate their free time to make the event a success. We rely entirely on companies’ sponsorship to pay the bills.

Sponsorship has several advantages for you:

  • It’s an opportunity to get brand exposure in front of the foremost Python experts from New Zealand and around the world.
  • Presents a fantastic networking opportunity if you are looking to employ engineers now, or in the future.
  • Align yourself with market leaders and past sponsors such as Github, Weta Digital, Catalyst IT, Mozilla. Become known as a Python promoter and industry leader. 
  • Gold sponsors receive five complimentary tickets to the conference and their logo on the conference shirt and all print materials.

If you’d like to sponsor the conference, a document describing sponsorship opportunities is available here.

To get in touch, email kiwipycon-sponsorship@nzpug.org.

Python GObject Introspection oddness

I recently ported indicator-jenkins to Gtk3 using the python GObject Introspection Repository (gir) bindings. Ted Gould did most of the work, I just cleaned some bits up and made sure everything worked. One issue that puzzled me for a while is that the GObject library changed the way it's "notify" signal works between GObject 2 and GObject 3. I've not seen any documentation of this change, so I'll describe it here.

For this example, let's make a very simple class that has a single property:

import gobject

class MyClass(gobject.GObject):
    prop = gobject.property(type=int)

...and a very simple callback function that we want to call whenever the value of 'prop' changes:

def cb(sender, prop):
    print "property '%s' changed on %r." % (prop.name, sender)

Finally, with GObject 2 we can create an instance of 'MyClass' and connect to the 'notify' signal like this:

inst = MyClass()
inst.connect("notify", cb)
inst.prop = 42

When we run this simple program we get the following output:
property 'prop' changed on .
... which is what we expected. However, if we port this code to GObject 3, it should look like this:

from gi.repository import GObject

class MyClass(GObject.GObject):
    prop = GObject.property(type=int)


def cb(sender, prop):
    print "property '%s' changed on %r." % (prop.name, sender)


inst = MyClass()
inst.connect("notify", cb)
inst.prop = 42

However, running this gives an error:

/usr/lib/python2.7/dist-packages/gi/_gobject/propertyhelper.py:171: Warning: g_value_get_object: assertion `G_VALUE_HOLDS_OBJECT (value)' failed
  instance.set_property(self.name, value)
Traceback (most recent call last):
  File "gobject3.py", line 8, in cb
    print "property '%s' changed on %r." % (prop.name, sender)
AttributeError: 'NoneType' object has no attribute 'name'

The 'prop' parameter in the callback is set to None.

There is a solution however - connecting the callback to a more specific notification signal works as expected:

from gi.repository import GObject

class MyClass(GObject.GObject):
    prop = GObject.property(type=int)


def cb(sender, prop):
    print "property '%s' changed on %r." % (prop.name, sender)


inst = MyClass()
inst.connect("notify::prop", cb)
inst.prop = 42

 It took me a while to figure this out - hopefully I've saved someone else that work.