Importing data to HBase

I’ve begun to experiment with Hadoop (with the aim of eventually running jobs on EC2) for a project with

Henry Garner (’s CTO) provided exported HBase tables containing tweets as well as content and urls extracted from the tweets. So the first job was to import that data back into HBase.

I’m working on Ubuntu and initially installed the Cloudera Hadoop distribution. However, I found that the way the configuration files and jars were distributed made it harder for me to understand what was going on. Coupled with the fact that I’m running Natty Narwhal and the distribution is based on Lucid Lynx, I decided to uninstall it and  use a fresh (and up-to-date) version (1.0.3) from the Hadoop website.

The instructions from Michael Noll on running a single cluster Hadoop installation on Ubuntu were clear and easy to follow. Hadoop was therefore installed in /usr/local/hadoop and I run jobs as hduser. I also installed the latest HBase (0.92.1 at the time of writing)  in /usr/local/hbase.

After exporting the class paths and related variables:

export HBASE_HOME=/usr/local/hbase/
export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.92.1.jar:$HBASE_HOME:\

Hadoop and HBase are then started (as hduser):

hduser:~$ /usr/local/hadoop/bin/
hduser:~$ /usr/local/hbase/bin/

I then created the schemas for the tables in the HBase shell (as not doing so leads to an exception):

hduser:~$ hbase shell
hbase(main):001:0> create 'twitter_accounts', 'raw', 'base', 'extra'
hbase(main):002:0> create 'content', 'raw', 'base', 'extra'
hbase(main):003:0> create 'tweets', 'raw', 'base', 'extra'
hbase(main):004:0> create 'short_urls', 'rel'

Here the table name is followed by the names of the columns to be created. For example the ‘tweets’ table is created with three columns: ‘raw’, ‘base’ and ‘extra’ which matches the schema of’s data.

The data were provided as gzipped tar files so they were uncompressed into a local directory (hbase-likely). To import the files into HBase the next step is to copy from the local file system into the Hadoop file system (HFS).

hduser:~$ mkdir localtable
hduser:~$ hadoop fs -copyFromLocal hbase-likely/short_urls\
hduser:~$ hadoop fs -copyFromLocal hbase-likely/content\
hduser:~$ hadoop fs -copyFromLocal hbase-likely/tweets\
hduser:~$ hadoop fs -copyFromLocal hbase-likely/twitter_accounts\

Now I could finally import the data (which takes quite a while for big tables):

hduser:~$ hadoop jar $HBASE_HOME/hbase-0.92.1.jar import twitter_accounts\
hduser:~$ hadoop jar $HBASE_HOMEhbase-0.92.1.jar import tweets\
hduser:~$ hadoop jar $HBASE_HOME/hbase-0.92.1.jar import content\
hduser:~$ hadoop jar $HBASE_HOME/hbase-0.92.1.jar import short_urls\

Ta da! Now we can look at the data. Back in the HBase shell:

scan 'tweets', {COLUMNS => 'base', LIMIT=>1}

returns just the first row from the ‘base’ column of the ‘tweets’ table. From this we can see what the data looks like and, importantly, what qualifiers are applied to the column data.

We can count the number of rows (which may take a long time):
hbase(main):008:0> count 'short_urls'

In another post I’ll look at how to do more complex queries using Clojure and Cascalog.

About these ads

About simonholgate

I'm a Clojure developer working in Liverpool, UK. I'm a sometime oceanographer and interested in democracy and Big Data.
This entry was posted in Clojure and tagged , , . Bookmark the permalink.

6 Responses to Importing data to HBase

  1. For tweets, for example, you can use compression – this gives pretty good space saving…
    I’m also using HBase from Clojure – you can look to my fork of clojure-hbase-schemas:

    • simonholgate says:

      Thanks for the comments and compression suggestion. It wasn’t a feature of HBase I was aware of but it makes sense. I’ll take a closer look.

      Clojure-hbase-schemas would have been very useful! I’ll try and make use of these and looks better than using the HBase shell for basic querying.

  2. Pranav says:

    Hi Simon, I was looking at your code in hbase-likely. How did you use cascading/maple to query row-id specific records from hbase?

    my code –

    This query returns all rows which have columns ‘cf:a’. I want ‘cf:a’ only from row-id ‘row1′. How would you accomplish that? Thanks!

    • simonholgate says:

      Hi Pranav,

      if you know the row id (‘row1′ as you say) then you can do something like this:

      (defn get-key-from-table
      "Gets the designated key from the row in the table"
      [table-name row-id column-family key]
      (let [table (hbase-table table-name)
      g (Get. (Bytes/toBytes row-id))
      r (.get table g)
      nm (.getFamilyMap r (Bytes/toBytes column-family))]
      (.get nm (Bytes/toBytes key))))

      where hbase-table is:
      (defn hbase-table [table-name]
      ;; Note that (HBaseConfiguration.) is deprecated in HBase 0.95-SNAPSHOT
      ;; and will be replaced by (Configuration/create)
      (HTable. (HBaseConfiguration.) table-name))

      So you would use this as:
      (get-key-from-table "pranavs-table" "row1" "cf" "a")

      There may be a better way of course!

  3. Pranav says:

    Thanks for replying Simon. Actually, I wanted to know if there was a way cascading/maple can handle this. Alternatively, a way to get all columns for a column family using maple would be great too.

    I dont see org.apache.hadoop.hbase.client.Get being used in the tap source, so Im guessing its not supported, but hoping Im wrong.

  4. Pingback: Computer art « My News

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s