Hadoop的Partitioner - Hadoop

TOP

Hadoop的Partitioner

2018-12-22 12:41:17 【大中小】浏览:38次

Partitioner

HashPartitioner、TotalOrderPartitioner、KeyFieldBasedPartitioner、BinaryPartitioner

public abstract class Partitioner<KEY, VALUE> {
	public abstract int getPartition(KEY key, VALUE value, int numPartitions);
}

１)、HashPartitioner是mapreduce的默认partitioner。计算方法是
which reducer=(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks，得到当前的目的reducer。

public class HashPartitioner<K, V> extends Partitioner<K, V> {
  public int getPartition(K key, V value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

2）、BinaryPartitioner继承于Partitioner<BinaryComparable, V>，是Partitioner的偏特化子类。该类提供leftOffset和rightOffset，在计算which reducer时仅对键值K的[rightOffset，leftOffset]这个区间取hash。
Which reducer=(hash & Integer.MAX_VALUE) % numReduceTasks

public int getPartition(BinaryComparable key, V value, int numPartitions) {
    int length = key.getLength();
    int leftIndex = (leftOffset + length) % length;
    int rightIndex = (rightOffset + length) % length;
    int hash = WritableComparator.hashBytes(key.getBytes(), 
      leftIndex, rightIndex - leftIndex + 1);
    return (hash & Integer.MAX_VALUE) % numPartitions;
}

3）、KeyFieldBasedPartitioner也是基于hash的个partitioner。和BinaryPatitioner不同，它提供了多个区间用于计算hash。当区间数为0时KeyFieldBasedPartitioner退化成HashPartitioner。

 public int getPartition(K2 key, V2 value, int numReduceTasks) {
    byte[] keyBytes;
    List <KeyDescription> allKeySpecs = keyFieldHelper.keySpecs();
    if (allKeySpecs.size() == 0) {
      return getPartition(key.toString().hashCode(), numReduceTasks);
    }
    try {
      keyBytes = key.toString().getBytes("UTF-8");
    } catch (UnsupportedEncodingException e) {
      throw new RuntimeException("The current system does not " +
          "support UTF-8 encoding!", e);
    }
    // return 0 if the key is empty
    if (keyBytes.length == 0) {
      return 0;
    }
  
    int []lengthIndicesFirst = keyFieldHelper.getWordLengths(keyBytes, 0, 
        keyBytes.length);
    int currentHash = 0;
    for (KeyDescription keySpec : allKeySpecs) {
      int startChar = keyFieldHelper.getStartOffset(keyBytes, 0, 
        keyBytes.length, lengthIndicesFirst, keySpec);
       // no key found! continue
      if (startChar < 0) {
        continue;
      }
      int endChar = keyFieldHelper.getEndOffset(keyBytes, 0, keyBytes.length, 
          lengthIndicesFirst, keySpec);
      currentHash = hashCode(keyBytes, startChar, endChar, 
          currentHash);
    }
    return getPartition(currentHash, numReduceTasks);
  }
  protected int hashCode(byte[] b, int start, int end, int currentHash) {
    for (int i = start; i <= end; i++) {
      currentHash = 31*currentHash + b[i];
    }
    return currentHash;
  }
  protected int getPartition(int hash, int numReduceTasks) {
    return (hash & Integer.MAX_VALUE) % numReduceTasks;
}

4）、TotalOrderPartitioner这个类可以实现输出的全排序。不同于以上3个partitioner，这个类并不是基于hash的。

public class TotalOrderPartitioner<K extends WritableComparable<>,V>
    extends Partitioner<K,V> implements Configurable {
	private K[] readPartitions(FileSystem fs, Path p, Class<K> keyClass,
      Configuration conf) throws IOException {
		...
	  }
}

readPartitions()方法从给定的IFile读取切割点。

1、如果所有要排序的数据都用一个reduce来进行排序，数据量少的时候可以，如果数据量大的时候就不可行了。
2、可以创建一些列排好序的文件，然后串联起来，part-r-00000的最大值比part-r-00001的最小值要小，依次下去。。
主要思路是使用一个partitioner来描述全局排序的输出。
由此我们可以归纳出这样一个用hadoop对大量数据排序的步骤：
1）对待排序数据进行抽样；
2）对抽样数据进行排序，产生标尺；
3） Map对输入的每条数据计算其处于哪两个标尺之间；将数据发给对应区间ID的reduce
4） Reduce将获得数据直接输出。

InputSampler：

public class InputSampler<K,V> extends Configured implements Tool  {
	/**
	* Samples the first n records from s splits.
	* Inexpensive way to sample random data.
	*/
	public static class SplitSampler<K,V> implements Sampler<K,V> {
		...
	}
	/**
	* Sample from random points in the input.
	* General-purpose sampler. Takes numSamples / maxSplitsSampled inputs from
	* each split.
	*/
	public static class RandomSampler<K,V> implements Sampler<K,V> {
		...
	}
	/**
	* Sample from s splits at regular intervals.
	* Useful for sorted data.
	*/
	public static class IntervalSampler<K,V> implements Sampler<K,V> {
		...
	}
	/**
	* 编写一个分区文件给给定的Job,使用提供的取样器
	* Queries the sampler for a sample keyset, sorts by the output key
	* comparator, selects the keys for each rank, and writes to the destination
	* returned from {@link TotalOrderPartitioner#getPartitionFile}.
	*/
	public static <K,V> void writePartitionFile(Job job, Sampler<K,V> sampler) 
      throws IOException, ClassNotFoundException, InterruptedException {
		...
	}
}

3）、IntervalSampler<K,V> | 固定间隔采样 |　采样频率，划分数　｜　中　｜　

IntervalSampler<K,V>对有序的数据十分适用

InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>( 0.1, 10000, 10);

RandomSampler的三个参数分别是采样率、最大样本数、最大分区。

/**
     * Create a new RandomSampler.
     * @param freq Probability with which a key will be chosen.
     * @param numSamples Total number of samples to obtain from all selected
     *                   splits.
     * @param maxSplitsSampled The maximum number of splits to examine.
     */
    public RandomSampler(double freq, int numSamples, int maxSplitsSampled) {
      this.freq = freq;
      this.numSamples = numSamples;
      this.maxSplitsSampled = maxSplitsSampled;
    }


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：Windows10 Pro 下 Hadoop2.7.3 编..	下一篇：Hadoop单节点配置