classGZIPOutputStreamextendsDeflaterOutputStream{publicGZIPOutputStream(OutputStream out,int size,boolean syncFlush){super(out,newDeflater(Deflater.DEFAULT_COMPRESSION,true), size, syncFlush);writeHeader();}publicvoidfinish()throws IOException {if(!def.finished()){
def.finish();while(!def.finished()){int len = def.deflate(buf,0, buf.length);if(def.finished()&& len <= buf.length - TRAILER_SIZE){// last deflater buffer. Fit trailer at the endwriteTrailer(buf, len);
len = len + TRAILER_SIZE;
out.write(buf,0, len);return;}if(len >0)
out.write(buf,0, len);}// if we can't fit the trailer at the end of the last// deflater buffer, we write it separatelybyte[] trailer =newbyte[TRAILER_SIZE];writeTrailer(trailer,0);
out.write(trailer);}}}
从上述源码可知一下信息:
GZIPOutputStream 继承至 DeflaterOutputStream
构造方法:创建DeflaterOutputStream,并write header
finish方法:将数据全部处理完成之后,write tailer
总的来说:
Gzip与Deflater的关系:Gzip是一种格式,Deflater是一种压缩算法
Gzip的格式:Gzip使用Deflater算法进行数据压缩,并添加 Header和Tailer
Flume中 HDFSCompressedDataStream的一个玩笑
@Overridepublicvoidsync()throws IOException {// We must use finish() and resetState() here -- flush() is apparently not// supported by the compressed output streams (it's a no-op).// Also, since resetState() writes headers, avoid calling it without an// additional write/append operation.// Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see HADOOP-8522.
serializer.flush();if(!isFinished){
cmpOut.finish();// 触发finish逻辑
isFinished =true;}
fsOut.flush();hflushOrSync(this.fsOut);}
何时会调用sync
当write中的batchCounter(计数器) 等于batchSize时
当HDFSEventSink处理完一次事务之后
可以发现有这么一段神奇的注释 Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see HADOOP-8522.
摘录至:https://issues.apache.org/jira/browse/HADOOP-8522
* ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used *
Description
ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used. The issue is that finish() flushes the compressor buffer and writes the gzip CRC32 + data length trailer. After that, resetState() does not repeat the gzip header, but simply starts writing more deflate-compressed data. The resultant files are not readable by the Linux "gunzip" tool. ResetableGzipOutputStream should write valid multi-member gzip files.
常见的WARN:Unable to load native-hadoop library for your platform
这个WARN就是Hadoop没有找到native包 于是就无法成功load时的信息
publicfinalclassNativeCodeLoader{static{try{
System.loadLibrary("hadoop");
LOG.debug("Loaded the native-hadoop library");
nativeCodeLoaded =true;}catch(Throwable t){// Ignore failure to load
LOG.debug("Failed to load native-hadoop with error: "+ t);
LOG.debug("java.library.path="+ System.getProperty("java.library.path"));}if(!nativeCodeLoaded){
LOG.warn("Unable to load native-hadoop library for your platform... using builtin-java classes where applicable");}}}