Tips for using CDH's Hadoop Distribution with Amazon's S3

By Matthew Rathbone on March 09 2012 Share Tweet Post

I help businesses improve their return on investment from big data projects. I do everything from software architecture to staff training. Learn More

Hive doesn’t play well with extension-less files in s3. Make sure each file has an extension (eg by default it can’t read file-00000, but it can read file-0000.tar.gz).
If you’re using Oozie, no file/folder actions performed in the <prepare> block will work for s3 files. Oozie helpers are hard-coded to use HDFS paths only.
Also with oozie, workflow.xml files cannot be stored in s3 (only hdfs) for the same reason. [hard coded dependencies]
Remember that s3 credentials added to core-site.xml are available for ANY job, no matter who runs it, so make sure those credentials have strict enough permissions to stop users deleting production data.

UPDATE

There’s actually a fix for (1) — do this:

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

Now it can read any files!

Matthew Rathbone