Class HTMLParseState
- java.lang.Object
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.CharacterReceiver
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState
-
public class HTMLParseState extends TagParseState
This class takes the output of the basic tag parser and converts it for typical HTML usage. It takes the attribute lists, for instance, and converts them to lowercased maps. It also bashes all tag names etc to lower case as well.
-
-
Field Summary
-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
accumBuffer, ampBuffer, bTagDepth, currentAttrList, currentAttrName, currentAttrNameBuffer, currentState, currentTagName, currentTagNameBuffer, currentValueBuffer, inAmpersand, mapLookup, TAGPARSESTATE_IN_ATTR_LOOKING_FOR_VALUE, TAGPARSESTATE_IN_ATTR_NAME, TAGPARSESTATE_IN_ATTR_VALUE, TAGPARSESTATE_IN_BANG_TOKEN, TAGPARSESTATE_IN_BRACKET_TOKEN, TAGPARSESTATE_IN_CDATA_BODY, TAGPARSESTATE_IN_COMMENT, TAGPARSESTATE_IN_DOUBLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_END_TAG_NAME, TAGPARSESTATE_IN_QTAG_ATTR_LOOKING_FOR_VALUE, TAGPARSESTATE_IN_QTAG_ATTR_NAME, TAGPARSESTATE_IN_QTAG_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_DOUBLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_NAME, TAGPARSESTATE_IN_QTAG_SAW_QUESTION, TAGPARSESTATE_IN_QTAG_SINGLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_UNQUOTED_ATTR_VALUE, TAGPARSESTATE_IN_SINGLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_TAG_NAME, TAGPARSESTATE_IN_TAG_SAW_SLASH, TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE, TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE_SAW_SLASH, TAGPARSESTATE_NEED_FINAL_BRACKET, TAGPARSESTATE_NORMAL, TAGPARSESTATE_SAWCOMMENTDASH, TAGPARSESTATE_SAWDASH, TAGPARSESTATE_SAWEXCLAMATION, TAGPARSESTATE_SAWLEFTANGLE, TAGPARSESTATE_SAWRIGHTBRACKET, TAGPARSESTATE_SAWSECONDCOMMENTDASH, TAGPARSESTATE_SAWSECONDRIGHTBRACKET
-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
charBuffer
-
-
Constructor Summary
Constructors Constructor Description HTMLParseState()
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
noteBTag(java.lang.String tagName)
This method is called for every <! <token> ...protected boolean
noteBTagToken(java.lang.String token)
This method gets called for every token inside a btag.protected boolean
noteEndBTag()
This method is called for the end of every btag, or any time there's a naked '>' in the document.protected boolean
noteEndEscaped()
Called for the end of every cdata-like tag.protected boolean
noteEndTag(java.lang.String tagName)
This method gets called for every end tag.protected boolean
noteEscaped(java.lang.String token)
Called for the start of every cdata-like tag, e.g.protected boolean
noteEscapedCharacter(char thisChar)
This method gets called for every character that is found within an escape block, e.g.protected boolean
noteQTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes)
This method is called for every <? ...protected boolean
noteTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes)
This method gets called for every tag.protected boolean
noteTag(java.lang.String tagName, java.util.Map<java.lang.String,java.lang.String> attributes)
Map version of the noteTag method.protected boolean
noteTagEnd(java.lang.String tagName)
Note end tag.-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
acceptNewTag, attributeDecode, dealWithCharacter, dumpValues, isPunctuation, isWhitespace, mapChunk, newBuffer, noteNormalCharacter, outputAmpBuffer
-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
dealWithCharacters, dealWithRemainder
-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.CharacterReceiver
finishUp
-
-
-
-
Method Detail
-
noteTag
protected final boolean noteTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes) throws ManifoldCFException
This method gets called for every tag. Override this method to intercept tag begins.- Overrides:
noteTag
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteTag
protected boolean noteTag(java.lang.String tagName, java.util.Map<java.lang.String,java.lang.String> attributes) throws ManifoldCFException
Map version of the noteTag method.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndTag
protected final boolean noteEndTag(java.lang.String tagName) throws ManifoldCFException
This method gets called for every end tag. Override this method to intercept tag ends.- Overrides:
noteEndTag
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteTagEnd
protected boolean noteTagEnd(java.lang.String tagName) throws ManifoldCFException
Note end tag.- Throws:
ManifoldCFException
-
noteQTag
protected final boolean noteQTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes) throws ManifoldCFException
This method is called for every <? ... ?> construct, or 'qtag'. This is not useful for HTML.- Overrides:
noteQTag
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteBTag
protected final boolean noteBTag(java.lang.String tagName) throws ManifoldCFException
This method is called for every <! <token> ... > construct, or 'btag'. Override it to intercept these.- Overrides:
noteBTag
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndBTag
protected final boolean noteEndBTag() throws ManifoldCFException
This method is called for the end of every btag, or any time there's a naked '>' in the document. Override it if you want to intercept these.- Overrides:
noteEndBTag
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEscaped
protected final boolean noteEscaped(java.lang.String token) throws ManifoldCFException
Called for the start of every cdata-like tag, e.g. <![ <token> [ ... ]]>- Overrides:
noteEscaped
in classTagParseState
- Parameters:
token
- may be empty!!!- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndEscaped
protected final boolean noteEndEscaped() throws ManifoldCFException
Called for the end of every cdata-like tag.- Overrides:
noteEndEscaped
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteBTagToken
protected final boolean noteBTagToken(java.lang.String token) throws ManifoldCFException
This method gets called for every token inside a btag.- Overrides:
noteBTagToken
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEscapedCharacter
protected final boolean noteEscapedCharacter(char thisChar) throws ManifoldCFException
This method gets called for every character that is found within an escape block, e.g. CDATA. Override this method to intercept such characters.- Overrides:
noteEscapedCharacter
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
-