Tips for using CDH's Hadoop Distribution with Amazon's S3
Learning Hadoop and Spark?
I've scoured the internet and I think this free Big Data course from UC San Diego is a great way to jump in. It's hosted on Coursera, so you can audit the course for free.
- Hive doesn’t play well with extension-less files in s3. Make sure each file has an extension (eg by default it can’t read file-00000, but it can read file-0000.tar.gz).
- If you’re using Oozie, no file/folder actions performed in the <prepare> block will work for s3 files. Oozie helpers are hard-coded to use HDFS paths only.
- Also with oozie, workflow.xml files cannot be stored in s3 (only hdfs) for the same reason. [hard coded dependencies]
- Remember that s3 credentials added to core-site.xml are available for ANY job, no matter who runs it, so make sure those credentials have strict enough permissions to stop users deleting production data.
There’s actually a fix for (1) — do this:
Now it can read any files!