✎ 编程开发网

Twenty Newsgroups Classification任务之二seq2sparse(二)

2014-11-24 09:44:07 · 作者: · 浏览: 75

标签: Twenty Newsgroups Classification 任务之二 seq2sparse

cessVector (-seq) (Optional) Whether output vectors should

be SequentialAccessVectors. If set true

else false

--namedVector (-nv) (Optional) Whether output vectors should

be NamedVectors. If set true else false

--logNormalize (-lnorm) (Optional) Whether output vectors should

be logNormalize. If set true else false

在昨天算法的终端信息中该步骤的调用命令如下：

[python]

./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf

我们只看对应的参数，首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化（设置则为true），-nv解释为输出向量被设置为named 向量，这里的named是啥意思？（暂时不清楚），-wt tfidf解释为使用权重的算法，具体参考http://zh.wikipedia.org/wiki/TF-IDF 。

第（1）步在SparseVectorsFromSequenceFiles的253行的：

[java]

DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

这里进入可以看到使用的Mapper是：SequenceFileTokenizerMapper，没有使用Reducer。Mapper的代码如下：

[java]

protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {

TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));

CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

StringTuple document = new StringTuple();

stream.reset();

while (stream.incrementToken()) {

if (termAtt.length() > 0) {

document.add(new String(termAtt.buffer(), 0, termAtt.length()));

}

context.write(key, document);

}

该Mapper的setup函数主要设置Analyzer的，关于Analyzer的api参考：http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer. html ，其中在map中用到的函数为reusableTokenStream(String fieldName, Reader reader) ：Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.

编写下面的测试程序：

[java]

package mahout.fansy.test.bayes;

import java.io.IOException;

import java.io.StringReader;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.Text;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.TokenStream;

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import org.apache.mahout.common.ClassUtils;

import org.apache.mahout.common.StringTuple;

import org.apache.mahout.vectorizer.DefaultAnalyzer;

import org.apache.mahout.vectorizer.DocumentProcessor;

public class TestSequenceFileTokenizerMapper {

/**

* @param args

private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",

Analyzer.class);

public static void main(String[] args) throws IOException {

testMap();

}

public static void testMap() throws IOException{

Text key=new Text("4096");

Text value=new Text("today is also late.what about tomorrow ");

TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));

CharTermAttribute termAtt = stream.addAttribute(CharTe

首页上一页 1 2 3 下一页尾页 2/3/3

上一篇策略模式(Strategy)

下一篇使用java反射技术完成对象所有属..