public class IndexUtil extends Object
Constructor and Description |
---|
IndexUtil(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
index(String key,
WebPage page)
Index a
Webpage , here we add the following fields:
id: default uniqueKey for the NutchDocument .
digest: Digest is used to identify pages (like unique ID) and
is used to remove duplicates during the dedup procedure. |
public IndexUtil(Configuration conf)
public NutchDocument index(String key, WebPage page)
Webpage
, here we add the following fields:
NutchDocument
.MD5Signature
or
TextProfileSignature
.key
- The key of the page (reversed url).page
- The Webpage
.Copyright © 2015 The Apache Software Foundation