Hadoop: RecordReader and FileInputFormat


Today’s new challenge…
I want to create a custom MapReduce job that can handle more than 1 single line at a time. Actually, it took me some time to understand the implementation of default LineRecordReader class, not because of its implementation Vs. my Java skill set, but rather that I was not familiar with its concept. I am describing in this article my understanding on this implementation.

As InputSplit is nothing more than a chunk of 1 or several blocks, it should be pretty rare to get a block boundary ending up at the exact location of a end of line (EOL). Some of my records located around block boundaries should be therefore split in 2 different blocks. This triggers the following issues:

  1. How Hadoop can guarantee lines read are 100% complete ?
  2. How Hadoop can consolidate a line that is starting on block B and that ends up on…

Voir l’article original 1 127 mots de plus