Package | Description |
---|---|
org.apache.lucene.analysis |
API and code to convert text into indexable/searchable tokens.
|
org.apache.lucene.analysis.cjk |
Analyzer for Chinese, Japanese and Korean.
|
org.apache.lucene.analysis.cn |
Analyzer for Chinese.
|
org.apache.lucene.analysis.ngram | |
org.apache.lucene.analysis.ru |
Analyzer for Russian.
|
org.apache.lucene.analysis.sinks |
Implementations of the SinkTokenizer that might be useful.
|
org.apache.lucene.analysis.standard |
A fast grammar-based tokenizer constructed with JFlex.
|
org.apache.lucene.wikipedia.analysis |
Modifier and Type | Class and Description |
---|---|
class |
CharTokenizer
An abstract base class for simple, character-oriented tokenizers.
|
class |
KeywordTokenizer
Emits the entire input as a single token.
|
class |
LetterTokenizer
A LetterTokenizer is a tokenizer that divides text at non-letters.
|
class |
LowerCaseTokenizer
LowerCaseTokenizer performs the function of LetterTokenizer
and LowerCaseFilter together.
|
class |
SinkTokenizer
A SinkTokenizer can be used to cache Tokens for use in an Analyzer
|
class |
WhitespaceTokenizer
A WhitespaceTokenizer is a tokenizer that divides text at whitespace.
|
Modifier and Type | Class and Description |
---|---|
class |
CJKTokenizer
CJKTokenizer was modified from StopTokenizer which does a decent job for
most European languages.
|
Modifier and Type | Class and Description |
---|---|
class |
ChineseTokenizer
Title: ChineseTokenizer
Description: Extract tokens from the Stream using Character.getType()
Rule: A Chinese character as a single token
Copyright: Copyright (c) 2001
Company:
The difference between thr ChineseTokenizer and the
CJKTokenizer (id=23545) is that they have different
token parsing logic.
|
Modifier and Type | Class and Description |
---|---|
class |
EdgeNGramTokenizer
Tokenizes the input from an edge into n-grams of given size(s).
|
class |
NGramTokenizer
Tokenizes the input into n-grams of the given size(s).
|
Modifier and Type | Class and Description |
---|---|
class |
RussianLetterTokenizer
A RussianLetterTokenizer is a tokenizer that extends LetterTokenizer by additionally looking up letters
in a given "russian charset".
|
Modifier and Type | Class and Description |
---|---|
class |
DateRecognizerSinkTokenizer
Attempts to parse the
Token.termBuffer() as a Date using a DateFormat . |
class |
TokenRangeSinkTokenizer
Counts the tokens as they go by and saves to the internal list those between the range of lower and upper, exclusive of upper
|
class |
TokenTypeSinkTokenizer
If the
Token.type() matches the passed in typeToMatch then
add it to the sink |
Modifier and Type | Class and Description |
---|---|
class |
StandardTokenizer
A grammar-based tokenizer constructed with JFlex
|
Modifier and Type | Class and Description |
---|---|
class |
WikipediaTokenizer
Extension of StandardTokenizer that is aware of Wikipedia syntax.
|
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.