My Diversions

January 23, 2008

The Language of The Year for 2008 is Scala

Filed under: Computer Science — Tom Davies @ 8:00 am

Last week a colleague pointed my at this post by Steve Yegge. It discusses ‘code’s worst enemy’ (size) and asks which new language, by being more expressive, can reduce line count.

Dijkstra said (via Overcoming Bias):

“If we wish to count lines of code, we should not regard them as ‘lines produced’ but as ‘lines spent’: the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.”

Less expressive languages mean that algorithms need to be duplicated to be applied to different data structures. Yegge doesn’t give any examples, but he’s presumably thinking of things like the paradise benchmark which requires that you can write a function which increases salaries for all employees of a company, without any knowledge of the complicated data structure which represents the company other than the type which represents a Salary.

Steve wants a language which runs on the JVM, which is entirely reasonable. It gives you immediate cross-platform consistency, and the potential to use the thousands of libraries written for Java, if the language has been written to support it.

He rejects JRuby because his colleagues don’t like it and chooses the next generation of Javascript, EcmaScript 4. This has an optional type system which looks as though it is a little less expressive than Java 5’s, and of course has the usual Javascript dynamism.

Many of the comments to Steve’s post suggest Scala, a functional/object oriented language with an advanced type system. I have looked at Scala in the past, but have never been able to ‘get into it’. I think this is because it is too much like Java in syntax, and supports the OO paradigm. In Haskell, by contrast, all you have are functions, algebraic types, and type classes, so you just have to get on with it. Scala left me not knowing where to start.

To remedy that I’ve forked out hard cash for the PDF+printed version of Odersky et. al.’s “Programming in Scala”. I hope that the combination of having spent money, and having a physical book will motivate me to learn the language.

January 1, 2008

Confluence, Spotlight and Hessian

Filed under: Computer Science, Java — Tom Davies @ 9:42 pm

My choice for the highlight of OS X 10.5, Leopard, is that Spotlight is worth using. It’s fast enough that it’s replaced Quicksilver (I was never a power Quicksilver user) as my application launcher.

I’ve also found it useful for general searching for emails, contacts, PDF documents and so on.

This inspired my second current spare-time project, a Spotlight importer and Quickview generator for Confluence.

There are two parts to this project, one which is easy and fun and another which is difficult and (relatively) dull. Of course I’ve only done the first part!

The fun part is getting Spotlight (and Quickview) working with Confluence data. Spotlight provides search results at the file level, so each distinct entity you want to be able to find must be represented by a file on your disk. To index a Confluence instance, all you do is crawl a space via the Confluence remote API (OS X has a good XML-RPC library), and create a file for each page/blog-post. The file contains enough information for the Spotlight indexer (which comes along after the file has been created and extracts metadata for the index) to do its job, plus what Quickview needs to create thumbnails and previews in the Finder:

  • Page metadata such as creator, editor, creation date, modification date, labels, title and so on.
  • The page markup.
  • A pre-generated thumbnail — this is just a ’screenshot’ of the page being rendered. This is shown in the finder’s coverflow mode. (In fact it is shown in list mode too, but I think that may be a bug in the finder)
  • The HTML Confluence renders for the page, plus all the images shown on the page, so that Quickview can produce an HTML preview view without making any requests to Confluence.
  • The original URL the page is at, so that opening the file opens the original page in your default browser.

The Confluence demonstration space looks like this in the Finder’s Coverflow mode: Coverflow view of the Confluence demo space.

And previews of pages look like this: Preview of a Confluence page.

The previews are accessible off-line, but the links all point to the original Confluence instance. A useful enhancement would be to send them through a helper application which determined whether the server was available, and if it was not, opened the locally stored preview HTML.

The duller part is making the process efficient enough that many users can keep their Spotlight indexes up to date without imposing too much load on the Confluence server.

I plan to achieve this using a plugin which provides a specialised RPC service which returns only the data required, in the minimum number of requests, and shares some of the work done between clients by caching. As I have an irrational dislike of XML I’m writing the plugin using the Hessian protocol, which uses a binary encoding, and is available for Objective C. At present the plugin does not do any incremental updates, and doesn’t share data between clients — it just dumps the contents of a space. It also assembles the entire response in memory, rather than streaming it back, so I won’t be installing the plugin on confluence.atlassian.com any time soon :-)

How should the plugin work? I plan to host the data on Amazon S3, with base data produced each night and incremental data produced every few minutes. Each set of data will comprise a file for each ‘level’ of Confluence authorisation. That is, there will be a file containing the data visible to an anonymous user, then a file for each group containing data visible to a member of that group, but not visible to an anonymous user and finally a file for each user containing the data they can see by virtue of their identity, rather than their membership of a group. The user files are likely to be mostly empty.

Files stored in S3 are either private to the account owner, or publicly visible, so data will be encrypted with symmetric encryption. The client requests the keys to which it is entitled from the Confluence instance before downloading the data files from S3.

The advantage of using S3 is that the load on the IT infrastructure hosting Confluence is reduced. If this is not an issue the files could be placed on any available http server.

December 22, 2007

EC2, Confluence, S3 and PostgreSQL

Filed under: Computer Science — Tom Davies @ 4:39 am

Update: EC2 will have persistent storage!

My side projects at the moment are both work related — that is, they’re related to Confluence, our enterprise Wiki. Even though I don’t work on that team any more the product still has a lot of mindshare with me.

My latest spare time project is running Confluence on EC2.

Amazon announced SimpleDB recently. I haven’t decided what it’s good for yet — its lack of transactions mean that it isn’t a drop in replacement for a relational database, but rather a component of massively scalable distributed systems.

While looking at SimpleDB I had another look at EC2. EC2 provides virtual machines, running Linux (or other operating systems virtualised under Linux, as far as I know) which can be created and destroyed at will, and cost 10 cents an hour to run.

Getting Confluence running on EC2 was very easy — I picked one of Amazon’s pre-built Fedora Core 4 images, installed PostgreSQL and a JDK, and that was that. After you’ve customised an image you save it to S3, and start it (or many copies of it) again when you wish.

Confluence feels as though it’s running on a machine with the specs Amazon specify, with the stated 1.7GB of physical memory. I haven’t yet tried the high end instances — quad 2GHz Xeon equivalent.

The catch with EC2 is that while a virtual machine has a 150GB disk, this data isn’t persisted when you shut the machine down — or when it dies unexpectedly due to either a software problem in the hosted operating system, or because Amazon shuts down your instance without you asking. It isn’t clear how often instances die unexpectedly, but Amazon do say “We recommend you should not rely on a single instance to provide reliability for your data.”

So simply saving the image to S3 once a day isn’t the way to go:

  • You aren’t guarding against unexpected crashes — you may lose a day’s data.
  • You have to shut down your application and DB down to get a consistent image.
  • You are storing your entire OS image in each backup. This is particularly wasteful if you have many instances of the same application, as these should all be able to start from identical virtual machine images, simply being passed a parameter to tell them where to get their data from.

Confluence can be configured to keep essentially all its dynamic data in the database, so what we need to do is to keep an ‘up to date’ backup of our PostgreSQL database on S3 at ‘all times’ — that is, a backup which contains or current data minus at most n minutes of transactions. (OK, there’s also a Lucene index — if you deliberately shut down your machine every day, e.g. if you save money by keeping the instance up only during business hours, you would need to copy that index to S3 too, but if you are recovering from rare unexpected restarts you probably don’t mind reindexing)

Fortunately Amazon provides ample free (as in beer) bandwidth between EC2 and S3, and PostgreSQL provides good on-line backup facilities, so this task is relatively easy.

PostgreSQL on line backup allows you to specify a command to archive each log segment as it is filled — to restore you can roll these forward from a full backup.

To summarise:

  • Use PostgreSQL 8.2 — on-line backups have some significant improvements over 8.1 and 8.0.
  • Read the excellent documentation

PostgreSQL must be configured to use Write Ahead Log (WAL) archiving, so postgresql.conf contains an archive command like:

archive_command = '/Users/tomd/bin/archivewal %p %f'

where archivewal is:

#!/bin/sh
bzip2 -c $PGDATA/$1 >/tmp/$$ &&
s3put confluence/local-db-wal-$2.bz2 /tmp/$$ &&
rm /tmp/$$

s3put is a command from Tim Kay’s aws utility.

The backup procedure is:

  • Tell PostgreSQL that you are starting a backup
  • Copy your PGDATA directory to S3 — a simple tar archive is fine, because we are also copying the logs the backup doesn’t have to be consistent.
  • Tell PostgreSQL that you have finished the backup. At this point the log files used during the backup will also be archived.
  • Check that those logs have made it to S3 and mark your backup as OK to restore from
  • Optionally delete older backups and older log file archives.

The restore procedure (which you would use when restarting an instance) is:

  • Provide the EC2 virtual machine with a startup parameter which tells it which S3 bucket to look in for backups.
  • Unpack PGDATA from your tar file (because we are starting a fresh copy of our EC2 image, PGDATA won’t exist when we start up, and postgres won’t be running)
  • Create a recovery.conf file in PGDATA which tells postgres which command to use to copy log files back from S3.
  • Start PostgreSQL. PostgreSQL will request all the log files it needs, that is, all the log files which were archived since the backup.

Once I have all this working and tested I’ll post a followup with the details, but I can tell you now that EC2 is the most fun you can have for 10 cents an hour!

October 3, 2007

CAL and Tapestry 5, Part 2: Algebraic Types and Forms

Filed under: CAL and Open Quark, Computer Science, Java, Tapestry — Tom Davies @ 5:29 am

In my previous post I described how to use a CAL function as part of the implementation of a Java class.

This post looks at interfacing CAL to Tapestry 5 using the ‘Java Bean’ conventions of getter and setter methods for the fields in an object.

Tapestry 5 provides a BeanEditForm component which simplifies providing CRUD operations for Beans. This is described in the second part of the Tapestry 5 tutorial.

By creating a Java class which provides a Bean with fields equivalent to the constructor parameters of a CAL algebraic data type we can use CAL to provide the data model for a web UI created with Tapestry. (more…)

September 28, 2007

Simplifying Code

Filed under: Computer Science, Java — Tom Davies @ 2:50 am

In the course of trying to adapt an ANTLR lexer to be used in an Intellij IDEA language plugin, I found that I’d written this:

token = lexer.nextToken();
if (token != null)
{
    tokenStart = findCharPos(token.getLine(), token.getColumn());

    if (token.getText() == null)
    {
        tokenEnd = tokenStart;
    }
    else
    {
        tokenEnd = tokenStart + token.getText().length() - 1;
        state = tokenEnd;
    }
    if (token.getType() == 1)
    {
        state = -1;
    }
}
else
{
    tokenStart = 0;
    tokenEnd = 0;
    state = -1;
}

What a mess! Which parts of the code influence the final value of tokenEnd? Should state be modified in the token.getText() == null case?

The problem with the code above is that it is structured as cases of the state of token, rather than as the functions which determine the values of the three variables.

I changed it to:

token = lexer.nextToken();
tokenStart = calcTokenStart(token);
tokenEnd = calcTokenEnd(tokenStart, token);
state = calcState(tokenEnd, token);
...
private int calcTokenStart(Token token)
{
    return token == null ? 0 : findCharPos(token.getLine(), token.getColumn());
}

private int calcTokenEnd(int tokenStart, Token token)
{
    if (token == null)
    {
        return 0;
    }
    else
    {
        String tokenText = token.getText();
        return tokenText == null ? tokenStart : tokenStart + tokenText.length();
    }
}

private int calcState(int tokenEnd, Token token)
{
    if (token == null || token.getType() == 1)
    {
        return -1;
    }
    else
    {
        return tokenEnd;
    }
}

This shows us each calculation in isolation, and allows us to see its dependencies.Note that this isn’t an ‘extract method’ refactoring — it comes from changing your view of the function from ‘processing a new token’ to ‘calculating the new values of the three state variables’.

Note that the functions above don’t use the state of the object, even though, for instance, this.tokenStart has been assigned when calcTokenEnd is called. That would reduce the clarity of the solution.

In fact, unless this class was specifically designed to be extensible, perhaps making these functions static would be clearer and safer.

September 24, 2007

CAL and the Tapestry 5 Tutorial

Filed under: CAL and Open Quark, Computer Science, Tapestry — Tom Davies @ 10:13 pm

The technique described in my previous post can be used to create Tapestry 5 pages which call CAL functions. Tapestry also uses Javassist to enhance pages, so adding CAL integration requires that Tapestry is reconfigured to apply the CAL transformations in addition to its own — I wasn’t able to find a way to transparently modify the classes before Tapestry sees them.

I’ve modified the Hi/Lo Guessing Game from the Tapestry Tutorial to use CAL implementations for some functions:

(more…)

September 23, 2007

Javassist and Annotations for Interfacing Java to CAL

Filed under: CAL and Open Quark, Computer Science — Tom Davies @ 6:41 am

In order for CAL to interoperate with Java frameworks we need to provide a Java class which delegates method calls to CAL functions. The framework sees only the Java class, and is unaware of the delegation which is taking place.

How we do this depends on a number of factors:

  • How easy it is to hook into the class/object creation process used by the framework.
  • Whether you need a Java class there at compile-time — perhaps you’ll be using a mixture of Java and CAL, so that you need an interface or a class to compile against, or perhaps the only use the framework will make will make of the class is reflectively at runtime, in which case the entire class can be synthesised.

In both cases we want the minimum amount of textual overhead, and the ability to type check our code as early as possible.

The first method I’ll describe uses Javassist:

(more…)

September 17, 2007

Quote of the day, Tyler Cowen on Google

Filed under: Computer Science, Economics, General Interest — Tom Davies @ 5:46 am

“Everyone is smart and beautiful, and I didn’t want to leave.”

Interfacing CAL to Java Frameworks — Part 1

Filed under: CAL and Open Quark, Computer Science — Tom Davies @ 4:14 am

This article discusses the differences between CAL as a client of Java libraries and CAL modules as clients of Java frameworks.

CAL is a functional programming language which runs on the JVM. One of the advantages of a language which compiles to Java bytecode is that it is simple to call any of the many available Java libraries from CAL.

(more…)

September 15, 2007

I’ll miss you, SCO

Filed under: Computer Science — Tom Davies @ 6:20 am

Perhaps not many people will miss SCO, given their desperate intellectual property shenanigans, but I only remember the good times.

My second job out of university was at an investment bank, Dominguez Barry Samuel Montagu, as the co-developer of an equities options trading system. We had a brand new Compaq 80386, running at 16MHz. I don’t recall how much memory or disk that machine had — perhaps 1MB and 40MB, but it was an exciting machine to have at that point in time. Unfortunately the existing system, while well written, ran on Windows 3.11, with 16 bit addressing and the memory and performance problems that entailed. The 386 was actually a very nice processor, as long as you ran it in 32 bit mode, and so I convinced my boss to buy SCO Xenix, newly available for the 386. This made a vast difference to the performance of the application — once we got an 80387 with two sigma signs printed on it (the 80387 was a floating point co-processor — the 80386 not having any floating point hardware. And early versions had a hardware bug which froze the system occasionally.)

SCO made that job much more enjoyable for me, and the success of a Unix based system ensured that future projects were also done on Unix rather than Microsoft platforms (for pedants, of course Xenix is a Microsoft platform).

So I’ll remember SCO for what it was, not what it became, which is the right way to think about any lost love.

« Older PostsNewer Posts »

Powered by WordPress