Reading data from HDFS programatically using java (and scala)
If you use HDFS, at some point you will need to interact with it programatically. Files can be in several formats, and stored using a spectrum of compression codecs.
Thankfully the HDFS api’s allow you to interact with files fairly easily with only a little boiler plate.
In bash you can read any text-format file in hdfs (compressed or not), using the following command:
In java or scala you can read a file, or directory of files (taking compression into account) using the function below.
In fact, this code is taken from my Hadoop test helper library, so I know it works. :-)
Sequence files are harder to read as you have to read in key/value pairs. Here is a simple function, again taken straight from my Hadoop test helper library.