Spark指定读取hdfs文件的实现

来源：五一七教育网

spark读取hdfs时，通过重写FileInputFormat<LongWritable, Text>类，实现自定义TextInputFormat，对读取的hdfs文件的切片进行过滤，从而起到指定读取hdfs文件的效果。

val value: RDD[(LongWritable, Text)] = sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat](url)

重写TextInputForamt：

public class MyFileInputFormat extends FileInputFormat<LongWritable, Text> {

    @Override
    public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        String delimiter = context.getConfiguration().get("textinputformat.record.delimiter");
        byte[] recordDelimiterBytes = null;
        if (null != delimiter) {
            recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
        }

        return new LineRecordReader(recordDelimiterBytes);
    }

    &#

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部频道

Spark指定读取hdfs文件的实现