If you use HDFS, at some point you will need to interact with it programatically. Files can be in several formats, and stored using a spectrum of compression codecs.
Thankfully the HDFS api’s allow you to interact with files fairly easily with only a little boiler plate.
In bash you can read any text-format file in hdfs (compressed or not), using the following command:
In java or scala you can read a file, or directory of files (taking compression into account) using the function below.
In fact, this code is taken from my Hadoop test helper library, so I know it works. :-)
Sequence files are harder to read as you have to read in key/value pairs. Here is a simple function, again taken straight from my Hadoop test helper library.