Importing data to HBase

I’ve begun to experiment with Hadoop (with the aim of eventually running jobs on EC2) for a project with Likely.co.

Henry Garner (Likely.co’s CTO) provided exported HBase tables containing tweets as well as content and urls extracted from the tweets. So the first job was to import that data back into HBase.

I’m working on Ubuntu and initially installed the Cloudera Hadoop distribution. However, I found that the way the configuration files and jars were distributed made it harder for me to understand what was going on. Coupled with the fact that I’m running Natty Narwhal and the distribution is based on Lucid Lynx, I decided to uninstall it and  use a fresh (and up-to-date) version (1.0.3) from the Hadoop website.

The instructions from Michael Noll on running a single cluster Hadoop installation on Ubuntu were clear and easy to follow. Hadoop was therefore installed in /usr/local/hadoop and I run jobs as hduser. I also installed the latest HBase (0.92.1 at the time of writing)  in /usr/local/hbase.

After exporting the class paths and related variables:

export HBASE_HOME=/usr/local/hbase/
export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.92.1.jar:$HBASE_HOME:\
$HBASE_HOME/lib/zookeeper-3.4.3.jar:$HBASE_HOME/conf:\
$HBASE_HOME/lib/guava-r09.jar

Hadoop and HBase are then started (as hduser):

hduser:~$ /usr/local/hadoop/bin/start-all.sh
hduser:~$ /usr/local/hbase/bin/start-hbase.sh

I then created the schemas for the tables in the HBase shell (as not doing so leads to an exception):

hduser:~$ hbase shell
hbase(main):001:0> create 'twitter_accounts', 'raw', 'base', 'extra'
hbase(main):002:0> create 'content', 'raw', 'base', 'extra'
hbase(main):003:0> create 'tweets', 'raw', 'base', 'extra'
hbase(main):004:0> create 'short_urls', 'rel'

Here the table name is followed by the names of the columns to be created. For example the ‘tweets’ table is created with three columns: ‘raw’, ‘base’ and ‘extra’ which matches the schema of Likely.co’s data.

The data were provided as gzipped tar files so they were uncompressed into a local directory (hbase-likely). To import the files into HBase the next step is to copy from the local file system into the Hadoop file system (HFS).

hduser:~$ mkdir localtable
hduser:~$ hadoop fs -copyFromLocal hbase-likely/short_urls\
 localtable/short_urls
hduser:~$ hadoop fs -copyFromLocal hbase-likely/content\
 localtable/content
hduser:~$ hadoop fs -copyFromLocal hbase-likely/tweets\
 localtable/tweets
hduser:~$ hadoop fs -copyFromLocal hbase-likely/twitter_accounts\
 localtable/twitter_accounts

Now I could finally import the data (which takes quite a while for big tables):

hduser:~$ hadoop jar $HBASE_HOME/hbase-0.92.1.jar import twitter_accounts\
 localtable/twitter_accounts
hduser:~$ hadoop jar $HBASE_HOMEhbase-0.92.1.jar import tweets\
 localtable/tweets
hduser:~$ hadoop jar $HBASE_HOME/hbase-0.92.1.jar import content\
 localtable/content
hduser:~$ hadoop jar $HBASE_HOME/hbase-0.92.1.jar import short_urls\
 localtable/short_urls

Ta da! Now we can look at the data. Back in the HBase shell:

scan 'tweets', {COLUMNS => 'base', LIMIT=>1}

returns just the first row from the ‘base’ column of the ‘tweets’ table. From this we can see what the data looks like and, importantly, what qualifiers are applied to the column data.

We can count the number of rows (which may take a long time):
hbase(main):008:0> count 'short_urls'

In another post I’ll look at how to do more complex queries using Clojure and Cascalog.

About simonholgate

I'm CEO of Sea Level Research Ltd (www.sealevelresearch.com) - a Liverpool, UK based startup that uses machine learning to predict sea level surges and optimise shipping movements into and out of port. I'm an oceanographer and I'm also a Clojure developer who is interested in democracy and Big Data.
This entry was posted in Clojure and tagged , , . Bookmark the permalink.

6 Responses to Importing data to HBase

  1. For tweets, for example, you can use compression – this gives pretty good space saving…
    I’m also using HBase from Clojure – you can look to my fork of clojure-hbase-schemas: https://github.com/alexott/clojure-hbase-schemas

    • simonholgate says:

      Thanks for the comments and compression suggestion. It wasn’t a feature of HBase I was aware of but it makes sense. I’ll take a closer look.

      Clojure-hbase-schemas would have been very useful! I’ll try and make use of these and looks better than using the HBase shell for basic querying.

  2. Pranav says:

    Hi Simon, I was looking at your code in hbase-likely. How did you use cascading/maple to query row-id specific records from hbase?

    my code – http://pastebin.com/FkT3RQKg

    This query returns all rows which have columns ‘cf:a’. I want ‘cf:a’ only from row-id ‘row1’. How would you accomplish that? Thanks!

    • simonholgate says:

      Hi Pranav,

      if you know the row id (‘row1’ as you say) then you can do something like this:

      (defn get-key-from-table
      "Gets the designated key from the row in the table"
      [table-name row-id column-family key]
      (let [table (hbase-table table-name)
      g (Get. (Bytes/toBytes row-id))
      r (.get table g)
      nm (.getFamilyMap r (Bytes/toBytes column-family))]
      (.get nm (Bytes/toBytes key))))

      where hbase-table is:
      (defn hbase-table [table-name]
      ;; Note that (HBaseConfiguration.) is deprecated in HBase 0.95-SNAPSHOT
      ;; and will be replaced by (Configuration/create)
      (HTable. (HBaseConfiguration.) table-name))

      So you would use this as:
      (get-key-from-table "pranavs-table" "row1" "cf" "a")

      There may be a better way of course!

  3. Pranav says:

    Thanks for replying Simon. Actually, I wanted to know if there was a way cascading/maple can handle this. Alternatively, a way to get all columns for a column family using maple would be great too.

    I dont see org.apache.hadoop.hbase.client.Get being used in the tap source, so Im guessing its not supported, but hoping Im wrong.

  4. Pingback: Computer art « My News

Leave a reply to simonholgate Cancel reply