版权声明:本文为博主原创文章,未经博主允许不得转载 https://blog.csdn.net/T1DMzks/article/details/62891335
获取文件大小,在命令行上,使用hadoop fs -du 命令可以,但是通过java API怎么获取呢,
最开始我想到的是递归的方法,这个方法很慢,后来发现FileSystem.getContentSummary的方法
最慢的一个方法–递归
网上很多类似的方法,不建议
使用FileSystem.getContentSummary方法
下面是api的一句话:
The getSpaceConsumed() function in the ContentSummary class will return the actual space the file/directory occupies in the cluster i.e. it takes into account the replication factor set for the cluster.
For instance, if the replication factor in the hadoop cluster is set to 3 and the directory size is 1.5GB, the getSpaceConsumed() function will return the value as 4.5GB.
Using getLength() function in the ContentSummary class will return you the actual file/directory size.
示例代码如下
public static void main ( String[ ] args) {
FileSystem hdfs = null;
Configuration conf = new Configuration ( ) ;
try {
hdfs = FileSystem. get ( new URI ( "hdfs://192.xxx.xx.xx:9000" ) , conf, "username" ) ;
} catch ( Exception e) {
e. printStackTrace ( ) ;
}
Path filenamePath = new Path ( "/test/input" ) ;
try {
System. out. println ( "SIZE OF THE HDFS DIRECTORY : " + hdfs. getContentSummary ( filenamePath) . getSpaceConsumed ( ) ) ;
System. out. println ( "SIZE OF THE HDFS DIRECTORY : " + hdfs. getContentSummary ( filenamePath) . getLength ( ) ) ;
} catch ( IOException e) {
e. printStackTrace ( ) ;
}
}
问题记录
记录一个非常诡异的问题,暂未发现原因.
原生API很快,自己使用递归,慢了不止10倍,将原生API的代码复制出来,也慢了不止10倍。
然后看源码,原生API也是用的递归
import org. apache. hadoop. conf. Configuration;
import org. apache. hadoop. fs. ContentSummary;
import org. apache. hadoop. fs. FileStatus;
import org. apache. hadoop. fs. FileSystem;
import org. apache. hadoop. fs. Path;
import java . io. IOException;
public class TestString {
public static FileSystem fs = null;
public static void main ( String[ ] args) throws Exception {
Configuration conf = new Configuration ( ) ;
conf. set ( "fs.defaultFS" , "hdfs://192.168.xx.xx:9000" ) ;
fs = FileSystem. get ( conf) ;
String rootPath = "/sparktest" ;
long t1 = System. currentTimeMillis ( ) ;
ContentSummary contentSummary = fs. getContentSummary ( new Path ( rootPath) ) ;
System. out. println ( "contentSummary.count" + contentSummary. getFileCount ( ) +
" contentSummary length: " +
"" + contentSummary. getLength
( ) ) ;
long t2 = System. currentTimeMillis ( ) ;
System. out. println ( "API本身的getContentSummary 用时: " + ( t2- t1) + " ms" ) ;
TestString ts = new TestString ( ) ;
FileModel fileModel = ts. new FileModel ( ) ;
long t3 = System. currentTimeMillis ( ) ;
ts. getFileLength ( fileModel, new Path ( rootPath) ) ;
long t4 = System. currentTimeMillis ( ) ;
System. out. println ( "count: " + fileModel. count+ " length " + fileModel. length) ;
System. out. println ( "自己写的递归用时 " + ( t4- t3) + " ms" ) ;
ContentSummary contentSummary1 = ts. getContentSummary ( new Path ( rootPath) ) ;
System. out. println ( "contentSummary1.count" + contentSummary1. getFileCount ( ) +
" contentSummary1 length: " +
"" + contentSummary1. getLength
( ) ) ;
long t5 = System. currentTimeMillis ( ) ;
System. out. println ( "源码复制出来的 getContentSummary1 用时: " + ( t5- t4) + " ms" ) ;
}
public void getFileLength ( FileModel fileModel, Path path) {
try {
FileStatus file = fs. getFileStatus ( path) ;
if ( file. isFile ( ) ) {
fileModel. count+= 1 ;
fileModel. length+= file. getLen ( ) ;
} else {
for ( FileStatus curFile : fs. listStatus ( path) ) {
if ( curFile. isFile ( ) ) {
fileModel. count+= 1 ;
fileModel. length+= curFile. getLen ( ) ;
} else {
getFileLength ( fileModel, curFile. getPath ( ) ) ;
}
}
}
} catch ( Exception e) {
e. printStackTrace ( ) ;
}
}
public ContentSummary getContentSummary ( Path f) throws IOException {
FileStatus status = fs. getFileStatus ( f) ;
if ( status. isFile ( ) ) {
long length = status. getLen ( ) ;
return new ContentSummary. Builder ( ) . length ( length) .
fileCount ( 1 ) . directoryCount ( 0 ) . spaceConsumed ( length) . build ( ) ;
}
long [ ] summary = { 0 , 0 , 1 } ;
for ( FileStatus s : fs. listStatus ( f) ) {
long length = s. getLen ( ) ;
ContentSummary c = s. isDirectory ( ) getContentSummary ( s. getPath ( ) ) :
new ContentSummary. Builder ( ) . length ( length) .
fileCount ( 1 ) . directoryCount ( 0 ) . spaceConsumed ( length) . build ( ) ;
summary[ 0 ] += c. getLength ( ) ;
summary[ 1 ] += c. getFileCount ( ) ;
summary[ 2 ] += c. getDirectoryCount ( ) ;
}
return new ContentSummary. Builder ( ) . length ( summary[ 0 ] ) .
fileCount ( summary[ 1 ] ) . directoryCount ( summary[ 2 ] ) .
spaceConsumed ( summary[ 0 ] ) . build ( ) ;
}
public class FileModel {
int count = 0 ;
int length = 0 ;
}
}
问题的可能情况
前面的测试其实测错类了,
我们是复制错了,我们复制的是FileSystem中的,其实如果是读hdfs,应该复制DistributedFileSystem的getContentSummay的方法
然后我仔细去看里面的源码,会回调其中的doCall方法(不会调用next方法)。
将doCall跟进去,会进入org.apache.hadoop.hdfs.
DFSClient.getContentSummary,这一步会进入namenode.getContentSummary(src);
其中namenode就是org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB的实例化类,跟到
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB的getContentSummary方法
后面就跟不进去了看不懂了
个人的猜测是 ,DFSClient的getContentSummary方法,可能用的是多线程的方法,
又或者是用的request请求到namenode后,由namenode在集群中使用分布式的查询,然后将结果汇总给namenode,然后再直接返回。由于存在并行方式,所以会快很多。
个人感觉第二种的可能性较大
具体的源码实在是太难看懂了,特别是涉及到protocolPB就不是很清晰了