New look

It’s time to freshen things up so I’ve changed the theme. Something a bit lighter. The picture is of St Kilda beach in Melbourne. In case you were wondering….

Posted in Uncategorized | Leave a comment

Importing data to HBase

I’ve begun to experiment with Hadoop (with the aim of eventually running jobs on EC2) for a project with

Henry Garner (’s CTO) provided exported HBase tables containing tweets as well as content and urls extracted from the tweets. So the first job was to import that data back into HBase.

I’m working on Ubuntu and initially installed the Cloudera Hadoop distribution. However, I found that the way the configuration files and jars were distributed made it harder for me to understand what was going on. Coupled with the fact that I’m running Natty Narwhal and the distribution is based on Lucid Lynx, I decided to uninstall it and  use a fresh (and up-to-date) version (1.0.3) from the Hadoop website.

The instructions from Michael Noll on running a single cluster Hadoop installation on Ubuntu were clear and easy to follow. Hadoop was therefore installed in /usr/local/hadoop and I run jobs as hduser. I also installed the latest HBase (0.92.1 at the time of writing)  in /usr/local/hbase.

After exporting the class paths and related variables:

export HBASE_HOME=/usr/local/hbase/
export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.92.1.jar:$HBASE_HOME:\

Hadoop and HBase are then started (as hduser):

hduser:~$ /usr/local/hadoop/bin/
hduser:~$ /usr/local/hbase/bin/

I then created the schemas for the tables in the HBase shell (as not doing so leads to an exception):

hduser:~$ hbase shell
hbase(main):001:0> create 'twitter_accounts', 'raw', 'base', 'extra'
hbase(main):002:0> create 'content', 'raw', 'base', 'extra'
hbase(main):003:0> create 'tweets', 'raw', 'base', 'extra'
hbase(main):004:0> create 'short_urls', 'rel'

Here the table name is followed by the names of the columns to be created. For example the ‘tweets’ table is created with three columns: ‘raw’, ‘base’ and ‘extra’ which matches the schema of’s data.

The data were provided as gzipped tar files so they were uncompressed into a local directory (hbase-likely). To import the files into HBase the next step is to copy from the local file system into the Hadoop file system (HFS).

hduser:~$ mkdir localtable
hduser:~$ hadoop fs -copyFromLocal hbase-likely/short_urls\
hduser:~$ hadoop fs -copyFromLocal hbase-likely/content\
hduser:~$ hadoop fs -copyFromLocal hbase-likely/tweets\
hduser:~$ hadoop fs -copyFromLocal hbase-likely/twitter_accounts\

Now I could finally import the data (which takes quite a while for big tables):

hduser:~$ hadoop jar $HBASE_HOME/hbase-0.92.1.jar import twitter_accounts\
hduser:~$ hadoop jar $HBASE_HOMEhbase-0.92.1.jar import tweets\
hduser:~$ hadoop jar $HBASE_HOME/hbase-0.92.1.jar import content\
hduser:~$ hadoop jar $HBASE_HOME/hbase-0.92.1.jar import short_urls\

Ta da! Now we can look at the data. Back in the HBase shell:

scan 'tweets', {COLUMNS => 'base', LIMIT=>1}

returns just the first row from the ‘base’ column of the ‘tweets’ table. From this we can see what the data looks like and, importantly, what qualifiers are applied to the column data.

We can count the number of rows (which may take a long time):
hbase(main):008:0> count 'short_urls'

In another post I’ll look at how to do more complex queries using Clojure and Cascalog.

Posted in Clojure | Tagged , , | 6 Comments

Reflections on Euroclojure

I spent Thursday and Friday last week attending the first Clojure conference outside of the US – Euroclojure – in London. I’ve got to say that I’ve come back feeling really excited by what I saw there.

I’ve been to many conferences before and this one was quite small in comparison with those. However, with 200 enthusiastic coders from the UK, Europe, the US and Canada, it was a really great and friendly place to be. More importantly there were plenty of great ideas talked about as well.

Clojure is a really exciting young language. It’s embracement of the Java and the JVM has allowed many at the conference to bring in elegant solutions to existing legacy (Java) software without wholesale rewrites. In every case this has been achieved with less code and faster results. That has to be great news!

Better still was hearing about the things that are really fresh and new such as experimenting with dynamic programming of music in Overtone (like modifying waveforms in real time to see how the sound is affected) or demonstrating JS Bach’s Canons in Overtone or dynamic algorithmic art displays using Clojure. Not that programming is just about art! There were also great talks on logic programming, solving concurrency issues in databases with Datomic, automating deployments in the Cloud with Pallet, and data processing in Hadoop with Cascalog to name a few.

So many talks in such a small amount of time was pretty mind blowing, especially given the heat, but meeting so many great people was fantastic. Roll on Euroclojure 2013!

Posted in Clojure | Tagged , , , , , | Leave a comment

Using EBS and S3 on Amazon EC2

Over the past couple of days I’ve been getting up to speed with Amazon Web Services (AWS). The aim is to run HBase on EC2 to store a large amount of data that I will later query with cascalog.

The process hasn’t been straightforward as information is scattered about in many different places. Here I’ll present an overview of the steps that I’ve taken and where I’ve got to so far.

Having created an AWS account and noted down the access and secret keys, as well as creating the key pair and X.509 certificate, I uploaded the data I’ve been given to analyse into an S3 storage “bucket” which I created through the AWS console. This gave me somewhere to store the data and from which I could easily access it from other Amazon services.

I then launched a vanilla AWS micro instance in the “free tier” of AWS and logged in:

ssh -i mykey.pem

where the “x”s represent the public IP of the AWS instance.

Having I logged in I enabled the “Extra Packages for Enterprise Linux” (EPEL) repository by editing /etc/yum.repos.d/epel.repo and changing “enabled” from 0 to 1.

I then ran:

sudo yum update

and updated all the packages before running:

sudo yum install s3cmd

to install the s3cmd tools package from EPEL. These are used to transfer the data that I’d uploaded to my S3 bucket into an Elastic Block Storage (EBS) volume which I could then attach to my EC2 instance.

The advantage of EBS is that it persists after the instance is shutdown which is useful as it means that I can attach it to my HBase instance later.

Back on my home machine I created my news EBS volume with:

ec2-create-volume -s 10 -z eu-west-1a -K privatekey.pem -C certificate.pem –region eu-west-1

which created a 10Gb volume for me (in the same region as my running instance and my S3 bucket). This process returned a volume id. I then attached the volume to my EC2 instance:

ec2-attach-volume -d /dev/sdh -i (instance-id) (volume-id) -K privatekey.pem -C certificate.pem –region eu-west-1

So far so good.

In order to use the EBS volume in our instance though I needed to mount it and make a file system on it. So back on EC2 see if xfs already exists, if not, install the module in the kernel:

grep -q xfs /proc/filesystems || sudo modprobe xfs

To use an XFS file system xfsprogs need to be installed:

sudo yum install -y xfsprogs

Great. Now I needed to mount the volume in the instance:

echo “/dev/sdh /vol xfs noatime 0 0” | sudo tee -a /etc/fstab

sudo mkdir -m 000 /vol

sudo mount /vol

and I’m good to go. I made a new directory for my data under /vol, and made it accessible to the ec2-user account:

cd /vol

sudo mkdir mydata
sudo chown ec2-user mydata/
sudo chgrp ec2-user mydata/
cd mydata/

Now I could copy the data over from S3. First I needed to configure the S3 tools:

s3cmd –configure

and added my acces and secret keys. Then, finally, I could copy the data over so it’s accessible in EC2:
s3cmd get s3://mybucket/data.tar.gz .

My experience of AWS is that it has not been possible to do this all in the “free tier” so it is costing me a bit of money, though not too much a present (around $10).

That’s quite a lot of information so I’ll stop here. I hope that someone find it useful! If you do, please let me know.

In the next post I’ll describe how to get the data into HBase.

Resources used:

Posted in AWS | 1 Comment

New beginnings

After 10 years of sea level science it’s time to move on. I’ve been thinking of doing something new for a long time now and having the option to take voluntary redundancy from NOC is just what I needed. Time to explore and money to do it!

So what next? Over the past couple of years I’ve mainly looked for other sea level related posts. These days, things are different. For some reason it never occurred to me that I could really go and work for myself. I love programming, particularly in Lisp/Clojure and AI/Machine Learning. I’m used to statistics and use R almost every day. I have experience running small businesses (even if they are charities). So why not do something useful with those skills and make some money while I’m at it?

Time will tell how successful I am. I’m trying to do this in the most scientific way I can and follow Lean Startup methods. Hypothesis testing suits me very well and being prepared to fail in order to learn is OK with me too. From now on this blog will follow my adventures in trying to work out what to do next.

Posted in Uncategorized | 1 Comment

Sea level research and “open source” science

I’ve been inspired to start this blog after reading a Nature article by Timothy Gowers on the subject of “Massively Collaborative Mathematics”. Now, I’m no Fields Medallist like Timothy Gowers, but the possibility of conducting research out in the open in a such a way that anyone can contribute, is very interesting to me.

The aim of this blog is less a discussion of climate science in general, than an attempt to engage with whoever is interested in the details of sea level research.

Sea level research is far wider than simply “climate science”. It has applications in many areas of Earth sciences.  I find it interesting to be reminded that the Permanent Service for Mean Sea Level (PSMSL), where I work, was originally founded (in 1933) with the objective of having a fixed datum against which to measure land movements. It was only after collecting data for some years that people started realising that sea level is far from static. Both land and sea levels change, which is one of the aspects of this research which makes it so challenging.

The longer term aim of this blog then, is to see if collaborative sea level research is possible, in a similar manner to that described by Timothy Gowers in the post which started the PolyMath project.

Sea level research is very different from the mathematics that Timothy Gowers describes. To start with it is very much applied science which is dealing with observational datasets, and all the complexities that come with “real” data. However, as a minimum I think that is worth carrying out what is sometimes described as Open Notebook Science which at least allows anyone who is interested to follow the development of ideas, and perhaps, to make useful comments on weaknesses in the approach or alternative ways to tackle the problem.

I hope that this blog will be far more than a diary of the work that I’m doing though. I’d like other people to contribute ideas and to work on the problems collaboratively. In my next post I’ll talk in a bit more detail about the kind of approach that I envisage and try to develop the ideas a bit further.

Posted in Uncategorized | 4 Comments