Twenty Newsgroups Classification任务之二seq2sparse(二)
cessVector (-seq) (Optional) Whether output vectors should
be SequentialAccessVectors. If set true
else false
--namedVector (-nv) (Optional) Whether output vectors should
be NamedVectors. If set true else false
--logNormalize (-lnorm) (Optional) Whether output vectors should
be logNormalize. If set true else false
在昨天算法的终端信息中该步骤的调用命令如下:
[python]
./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf
我们只看对应的参数,首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化(设置则为true),-nv解释为输出向量被设置为named 向量,这里的named是啥意思?(暂时不清楚),-wt tfidf解释为使用权重的算法,具体参考http://zh.wikipedia.org/wiki/TF-IDF 。
第(1)步在SparseVectorsFromSequenceFiles的253行的:
[java]
DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);
这里进入可以看到使用的Mapper是:SequenceFileTokenizerMapper,没有使用Reducer。Mapper的代码如下:
[java]
protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));
CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
StringTuple document = new StringTuple();
stream.reset();
while (stream.incrementToken()) {
if (termAtt.length() > 0) {
document.add(new String(termAtt.buffer(), 0, termAtt.length()));
}
}
context.write(key, document);
}
该Mapper的setup函数主要设置Analyzer的,关于Analyzer的api参考:http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.
html ,其中在map中用到的函数为reusableTokenStream(String fieldName, Reader reader) :Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序:
[java]
package mahout.fansy.test.bayes;
import java.io.IOException;
import java.io.StringReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.mahout.common.ClassUtils;
import org.apache.mahout.common.StringTuple;
import org.apache.mahout.vectorizer.DefaultAnalyzer;
import org.apache.mahout.vectorizer.DocumentProcessor;
public class TestSequenceFileTokenizerMapper {
/**
* @param args
*/
private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",
Analyzer.class);
public static void main(String[] args) throws IOException {
testMap();
}
public static void testMap() throws IOException{
Text key=new Text("4096");
Text value=new Text("today is also late.what about tomorrow ");
TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));
CharTermAttribute termAtt = stream.addAttribute(CharTe