Tips for using CDH's Hadoop Distribution with Amazon's S3

Hire me to supercharge your Hadoop and Spark projects

I help businesses improve their return on investment from big data projects. I do everything from software architecture to staff training. Learn More

  1. Hive doesn’t play well with extension-less files in s3. Make sure each file has an extension (eg by default it can’t read file-00000, but it can read file-0000.tar.gz).
  2. If you’re using Oozie, no file/folder actions performed in the <prepare> block will work for s3 files. Oozie helpers are hard-coded to use HDFS paths only.
  3. Also with oozie, workflow.xml files cannot be stored in s3 (only hdfs) for the same reason. [hard coded dependencies]
  4. Remember that s3 credentials added to core-site.xml are available for ANY job, no matter who runs it, so make sure those credentials have strict enough permissions to stop users deleting production data.

UPDATE

There’s actually a fix for (1) — do this:

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

Now it can read any files!

Matthew Rathbone's Picture

Matthew Rathbone

Consultant Big Data Infrastructure Engineer at Rathbone Labs. British. Data Nerd. Lucky husband and father.

Hire me to supercharge your Hadoop and Spark projects

I help businesses improve their return on investment from big data projects. I do everything from software architecture to staff training. Learn More

Join the discussion