public class Token extends Object implements Cloneable
The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.
The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".
A Token can optionally have metadata (a.k.a. Payload) in the form of a variable
length byte array. Use TermPositions.getPayloadLength()
and
TermPositions.getPayload(byte[], int)
to retrieve the payloads from the index.
WARNING: The status of the Payloads feature is experimental.
The APIs introduced here might change in the future and will not be
supported anymore in such a case.
NOTE: As of 2.3, Token stores the term text
internally as a malleable char[] termBuffer instead of
String termText. The indexing code and core tokenizers
have been changed to re-use a single Token instance, changing
its buffer and other fields in-place as the Token is
processed. This provides substantially better indexing
performance as it saves the GC cost of new'ing a Token and
String for every term. The APIs that accept String
termText are still available but a warning about the
associated performance cost has been added (below). The
termText()
method has been deprecated.
Tokenizers and filters should try to re-use a Token
instance when possible for best performance, by
implementing the TokenStream.next(Token)
API.
Failing that, to create a new Token you should first use
one of the constructors that starts with null text. To load
the token from a char[] use setTermBuffer(char[], int, int)
.
To load from a String use setTermBuffer(String)
or setTermBuffer(String, int, int)
.
Alternatively you can get the Token's termBuffer by calling either termBuffer()
,
if you know that your text is shorter than the capacity of the termBuffer
or resizeTermBuffer(int)
, if there is any possibility
that you may need to grow the buffer. Fill in the characters of your term into this
buffer, with String.getChars(int, int, char[], int)
if loading from a string,
or with System.arraycopy(Object, int, Object, int, int)
, and finally call setTermLength(int)
to
set the length of the term text. See LUCENE-969
for details.
Typical reuse patterns:
return reusableToken.reinit(string, startOffset, endOffset[, type]);
return reusableToken.reinit(string, 0, string.length(), startOffset, endOffset[, type]);
return reusableToken.reinit(buffer, 0, buffer.length, startOffset, endOffset[, type]);
return reusableToken.reinit(buffer, start, end - start, startOffset, endOffset[, type]);
return reusableToken.reinit(source.termBuffer(), 0, source.termLength(), source.startOffset(), source.endOffset()[, source.type()]);
TokenStreams
can be chained, one cannot assume that the Token's
current type is correct.Payload
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_TYPE |
Constructor and Description |
---|
Token()
Constructs a Token will null text.
|
Token(char[] startTermBuffer,
int termBufferOffset,
int termBufferLength,
int start,
int end)
Constructs a Token with the given term buffer (offset
& length), start and end
offsets
|
Token(int start,
int end)
Constructs a Token with null text and start & end
offsets.
|
Token(int start,
int end,
int flags)
Constructs a Token with null text and start & end
offsets plus flags.
|
Token(int start,
int end,
String typ)
Constructs a Token with null text and start & end
offsets plus the Token type.
|
Token(String text,
int start,
int end)
Deprecated.
|
Token(String text,
int start,
int end,
int flags)
Deprecated.
|
Token(String text,
int start,
int end,
String typ)
Deprecated.
|
Modifier and Type | Method and Description |
---|---|
void |
clear()
Resets the term text, payload, flags, and positionIncrement to default.
|
Object |
clone() |
Token |
clone(char[] newTermBuffer,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset)
Makes a clone, but replaces the term buffer &
start/end offset in the process.
|
int |
endOffset()
Returns this Token's ending offset, one greater than the position of the
last character corresponding to this token in the source text.
|
boolean |
equals(Object obj) |
int |
getFlags()
EXPERIMENTAL: While we think this is here to stay, we may want to change it to be a long.
|
Payload |
getPayload()
Returns this Token's payload.
|
int |
getPositionIncrement()
Returns the position increment of this Token.
|
int |
hashCode() |
Token |
reinit(char[] newTermBuffer,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset)
Shorthand for calling
clear() ,
setTermBuffer(char[], int, int) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) on Token.DEFAULT_TYPE |
Token |
reinit(char[] newTermBuffer,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset,
String newType)
Shorthand for calling
clear() ,
setTermBuffer(char[], int, int) ,
setStartOffset(int) ,
setEndOffset(int) ,
setType(java.lang.String) |
Token |
reinit(String newTerm,
int newStartOffset,
int newEndOffset)
Shorthand for calling
clear() ,
setTermBuffer(String) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) on Token.DEFAULT_TYPE |
Token |
reinit(String newTerm,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset)
Shorthand for calling
clear() ,
setTermBuffer(String, int, int) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) on Token.DEFAULT_TYPE |
Token |
reinit(String newTerm,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset,
String newType)
Shorthand for calling
clear() ,
setTermBuffer(String, int, int) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) |
Token |
reinit(String newTerm,
int newStartOffset,
int newEndOffset,
String newType)
Shorthand for calling
clear() ,
setTermBuffer(String) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) |
void |
reinit(Token prototype)
Copy the prototype token's fields into this one.
|
void |
reinit(Token prototype,
char[] newTermBuffer,
int offset,
int length)
Copy the prototype token's fields into this one, with a different term.
|
void |
reinit(Token prototype,
String newTerm)
Copy the prototype token's fields into this one, with a different term.
|
char[] |
resizeTermBuffer(int newSize)
Grows the termBuffer to at least size newSize, preserving the
existing content.
|
void |
setEndOffset(int offset)
Set the ending offset.
|
void |
setFlags(int flags) |
void |
setPayload(Payload payload)
Sets this Token's payload.
|
void |
setPositionIncrement(int positionIncrement)
Set the position increment.
|
void |
setStartOffset(int offset)
Set the starting offset.
|
void |
setTermBuffer(char[] buffer,
int offset,
int length)
Copies the contents of buffer, starting at offset for
length characters, into the termBuffer array.
|
void |
setTermBuffer(String buffer)
Copies the contents of buffer into the termBuffer array.
|
void |
setTermBuffer(String buffer,
int offset,
int length)
Copies the contents of buffer, starting at offset and continuing
for length characters, into the termBuffer array.
|
void |
setTermLength(int length)
Set number of valid characters (length of the term) in
the termBuffer array.
|
void |
setTermText(String text)
Deprecated.
|
void |
setType(String type)
Set the lexical type.
|
int |
startOffset()
Returns this Token's starting offset, the position of the first character
corresponding to this token in the source text.
|
String |
term()
Returns the Token's term text.
|
char[] |
termBuffer()
Returns the internal termBuffer character array which
you can then directly alter.
|
int |
termLength()
Return number of valid characters (length of the term)
in the termBuffer array.
|
String |
termText()
Deprecated.
This method now has a performance penalty
because the text is stored internally in a char[]. If
possible, use
termBuffer() and termLength() directly instead. If you really need a
String, use term() |
String |
toString() |
String |
type()
Returns this Token's lexical type.
|
public static final String DEFAULT_TYPE
public Token()
public Token(int start, int end)
start
- start offset in the source textend
- end offset in the source textpublic Token(int start, int end, String typ)
start
- start offset in the source textend
- end offset in the source texttyp
- the lexical type of this Tokenpublic Token(int start, int end, int flags)
start
- start offset in the source textend
- end offset in the source textflags
- The bits to set for this tokenpublic Token(String text, int start, int end)
text
- term textstart
- start offsetend
- end offsetpublic Token(String text, int start, int end, String typ)
text
- term textstart
- start offsetend
- end offsettyp
- token typepublic Token(String text, int start, int end, int flags)
text
- start
- end
- flags
- token type bitspublic Token(char[] startTermBuffer, int termBufferOffset, int termBufferLength, int start, int end)
startTermBuffer
- termBufferOffset
- termBufferLength
- start
- end
- public void setPositionIncrement(int positionIncrement)
TokenStream
, used in phrase
searching.
The default value is one.
Some common uses for this are:
positionIncrement
- the distance from the prior termTermPositions
public int getPositionIncrement()
setPositionIncrement(int)
public void setTermText(String text)
setTermBuffer(char[], int, int)
or
setTermBuffer(String)
or
setTermBuffer(String, int, int)
.public final String termText()
termBuffer()
and termLength()
directly instead. If you really need a
String, use term()
public final String term()
termBuffer()
and termLength()
directly instead. If you really need a
String, use this method, which is nothing more than
a convenience call to new String(token.termBuffer(), 0, token.termLength())public final void setTermBuffer(char[] buffer, int offset, int length)
buffer
- the buffer to copyoffset
- the index in the buffer of the first character to copylength
- the number of characters to copypublic final void setTermBuffer(String buffer)
buffer
- the buffer to copypublic final void setTermBuffer(String buffer, int offset, int length)
buffer
- the buffer to copyoffset
- the index in the buffer of the first character to copylength
- the number of characters to copypublic final char[] termBuffer()
resizeTermBuffer(int)
to increase it. After
altering the buffer be sure to call setTermLength(int)
to record the number of valid
characters that were placed into the termBuffer.public char[] resizeTermBuffer(int newSize)
setTermBuffer(char[], int, int)
,
setTermBuffer(String)
, or
setTermBuffer(String, int, int)
to optimally combine the resize with the setting of the termBuffer.newSize
- minimum size of the new termBufferpublic final int termLength()
public final void setTermLength(int length)
resizeTermBuffer(int)
first.length
- the truncated lengthpublic final int startOffset()
public void setStartOffset(int offset)
startOffset()
public final int endOffset()
public void setEndOffset(int offset)
endOffset()
public final String type()
public int getFlags()
type()
, although they do share similar purposes.
The flags can be used to encode information about the token for use by other TokenFilter
s.public void setFlags(int flags)
getFlags()
public Payload getPayload()
public void setPayload(Payload payload)
public void clear()
public Token clone(char[] newTermBuffer, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset)
public Token reinit(char[] newTermBuffer, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset, String newType)
clear()
,
setTermBuffer(char[], int, int)
,
setStartOffset(int)
,
setEndOffset(int)
,
setType(java.lang.String)
public Token reinit(char[] newTermBuffer, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset)
clear()
,
setTermBuffer(char[], int, int)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
on Token.DEFAULT_TYPEpublic Token reinit(String newTerm, int newStartOffset, int newEndOffset, String newType)
clear()
,
setTermBuffer(String)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
public Token reinit(String newTerm, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset, String newType)
clear()
,
setTermBuffer(String, int, int)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
public Token reinit(String newTerm, int newStartOffset, int newEndOffset)
clear()
,
setTermBuffer(String)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
on Token.DEFAULT_TYPEpublic Token reinit(String newTerm, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset)
clear()
,
setTermBuffer(String, int, int)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
on Token.DEFAULT_TYPEpublic void reinit(Token prototype)
prototype
- public void reinit(Token prototype, String newTerm)
prototype
- newTerm
- public void reinit(Token prototype, char[] newTermBuffer, int offset, int length)
prototype
- newTermBuffer
- offset
- length
- Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.