Tips for using CDH's Hadoop Distribution with Amazon's S3

  1. Hive doesn’t play well with extension-less files in s3. Make sure each file has an extension (eg by default it can’t read file-00000, but it can read file-0000.tar.gz).
  2. If you’re using Oozie, no file/folder actions performed in the <prepare> block will work for s3 files. Oozie helpers are hard-coded to use HDFS paths only.
  3. Also with oozie, workflow.xml files cannot be stored in s3 (only hdfs) for the same reason. [hard coded dependencies]
  4. Remember that s3 credentials added to core-site.xml are available for ANY job, no matter who runs it, so make sure those credentials have strict enough permissions to stop users deleting production data.

UPDATE

There’s actually a fix for (1) — do this:

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

Now it can read any files!

Matthew Rathbone's Picture

Matthew Rathbone

CEO of Beekeeper Data. British. Data Nerd. Lucky husband and father. More about me

Need More Hadoop Reading?

I've collected a list of the top Hadoop books on the market

Join the discussion

comments powered by Disqus