Tips for using CDH's Hadoop Distribution with Amazon's S3

Learning Hadoop and Spark?

I've scoured the internet and I think this free Big Data course from UC San Diego is a great way to jump in. It's hosted on Coursera, so you can audit the course for free.

  1. Hive doesn’t play well with extension-less files in s3. Make sure each file has an extension (eg by default it can’t read file-00000, but it can read file-0000.tar.gz).
  2. If you’re using Oozie, no file/folder actions performed in the <prepare> block will work for s3 files. Oozie helpers are hard-coded to use HDFS paths only.
  3. Also with oozie, workflow.xml files cannot be stored in s3 (only hdfs) for the same reason. [hard coded dependencies]
  4. Remember that s3 credentials added to core-site.xml are available for ANY job, no matter who runs it, so make sure those credentials have strict enough permissions to stop users deleting production data.


There’s actually a fix for (1) — do this:


Now it can read any files!

Matthew Rathbone's Picture

Matthew Rathbone

CEO of Beekeeper Data. British. Data Nerd. Lucky husband and father. More about me

Need More Hadoop Reading?

I've collected a list of the top Hadoop books on the market

Join the discussion