Hire me to supercharge your Hadoop and Spark projects
I help businesses improve their return on investment from big data projects. I do everything from software architecture to staff training. Learn More
If you use HDFS, at some point you will need to interact with it programatically. Files can be in several formats, and stored using a spectrum of compression codecs.
Thankfully the HDFS api’s allow you to interact with files fairly easily with only a little boiler plate.
Text Files
In bash you can read any text-format file in hdfs (compressed or not), using the following command:
In java or scala you can read a file, or directory of files (taking compression into account) using the function below.
Sequence files are harder to read as you have to read in key/value pairs. Here is a simple function, again taken straight from my Hadoop test helper library.