Class XMLFuzzyParseState
- java.lang.Object
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.CharacterReceiver
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.XMLFuzzyParseState
-
- Direct Known Subclasses:
XMLFuzzyHierarchicalParseState
public class XMLFuzzyParseState extends TagParseState
Class to keep track of XML hierarchy in the face of possibly corrupt XML and with case-insensitive tags, etc. Basically, this class accepts what is supposedly XML but allows for various kinds of handwritten corruption. Specific kinds of errors allowed include: - Bad character encoding - Tag case match problems; all attributes are (optionally) bashed to lower case - Other parsing recoveries to be added as they arise The functionality of this class is also somewhat lessened vs. standard SAX-type parsers. No namespace interpretation is done, for instance; tag qnames are split into namespace name and local name, and that's all folks. But if you need more power, you can write a class extension that will do that readily.
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
lowerCaseAttributes
protected boolean
lowerCaseBTags
protected boolean
lowerCaseEscapeTags
protected boolean
lowerCaseQAttributes
protected boolean
lowerCaseQTags
protected boolean
lowerCaseTags
-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
accumBuffer, ampBuffer, bTagDepth, currentAttrList, currentAttrName, currentAttrNameBuffer, currentState, currentTagName, currentTagNameBuffer, currentValueBuffer, inAmpersand, mapLookup, TAGPARSESTATE_IN_ATTR_LOOKING_FOR_VALUE, TAGPARSESTATE_IN_ATTR_NAME, TAGPARSESTATE_IN_ATTR_VALUE, TAGPARSESTATE_IN_BANG_TOKEN, TAGPARSESTATE_IN_BRACKET_TOKEN, TAGPARSESTATE_IN_CDATA_BODY, TAGPARSESTATE_IN_COMMENT, TAGPARSESTATE_IN_DOUBLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_END_TAG_NAME, TAGPARSESTATE_IN_QTAG_ATTR_LOOKING_FOR_VALUE, TAGPARSESTATE_IN_QTAG_ATTR_NAME, TAGPARSESTATE_IN_QTAG_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_DOUBLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_NAME, TAGPARSESTATE_IN_QTAG_SAW_QUESTION, TAGPARSESTATE_IN_QTAG_SINGLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_UNQUOTED_ATTR_VALUE, TAGPARSESTATE_IN_SINGLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_TAG_NAME, TAGPARSESTATE_IN_TAG_SAW_SLASH, TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE, TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE_SAW_SLASH, TAGPARSESTATE_NEED_FINAL_BRACKET, TAGPARSESTATE_NORMAL, TAGPARSESTATE_SAWCOMMENTDASH, TAGPARSESTATE_SAWDASH, TAGPARSESTATE_SAWEXCLAMATION, TAGPARSESTATE_SAWLEFTANGLE, TAGPARSESTATE_SAWRIGHTBRACKET, TAGPARSESTATE_SAWSECONDCOMMENTDASH, TAGPARSESTATE_SAWSECONDRIGHTBRACKET
-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
charBuffer
-
-
Constructor Summary
Constructors Constructor Description XMLFuzzyParseState(boolean lowerCaseAttributes, boolean lowerCaseTags, boolean lowerCaseQAttributes, boolean lowerCaseQTags, boolean lowerCaseBTags, boolean lowerCaseEscapeTags)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
noteBTag(java.lang.String tagName)
This method is called for every <! <token> ...protected boolean
noteBTagEx(java.lang.String tagName)
New version of the noteBTag method.protected boolean
noteBTagToken(java.lang.String token)
This method gets called for every token inside a btag.protected boolean
noteBTagTokenEx(java.lang.String token)
New version of the noteBTagToken method.protected boolean
noteEndTag(java.lang.String tagName)
This method gets called for every end tag.protected boolean
noteEndTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName)
Note end tag.protected boolean
noteEscaped(java.lang.String token)
Called for the start of every cdata-like tag, e.g.protected boolean
noteEscapedEx(java.lang.String token)
New version of the noteEscapedTag method.protected boolean
noteQTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes)
This method is called for every <? ...protected boolean
noteQTagEx(java.lang.String tagName, java.util.Map<java.lang.String,java.lang.String> attributes)
Map version of the noteQTag method.protected boolean
noteTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes)
This method gets called for every tag.protected boolean
noteTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName, java.util.Map<java.lang.String,java.lang.String> attributes)
Map version of the noteTag method.-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
acceptNewTag, attributeDecode, dealWithCharacter, dumpValues, isPunctuation, isWhitespace, mapChunk, newBuffer, noteEndBTag, noteEndEscaped, noteEscapedCharacter, noteNormalCharacter, outputAmpBuffer
-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
dealWithCharacters, dealWithRemainder
-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.CharacterReceiver
finishUp
-
-
-
-
Field Detail
-
lowerCaseAttributes
protected final boolean lowerCaseAttributes
-
lowerCaseTags
protected final boolean lowerCaseTags
-
lowerCaseQAttributes
protected final boolean lowerCaseQAttributes
-
lowerCaseQTags
protected final boolean lowerCaseQTags
-
lowerCaseBTags
protected final boolean lowerCaseBTags
-
lowerCaseEscapeTags
protected final boolean lowerCaseEscapeTags
-
-
Method Detail
-
noteTag
protected final boolean noteTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes) throws ManifoldCFException
This method gets called for every tag. Override this method to intercept tag begins.- Overrides:
noteTag
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteTagEx
protected boolean noteTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName, java.util.Map<java.lang.String,java.lang.String> attributes) throws ManifoldCFException
Map version of the noteTag method.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndTag
protected final boolean noteEndTag(java.lang.String tagName) throws ManifoldCFException
This method gets called for every end tag. Override this method to intercept tag ends.- Overrides:
noteEndTag
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndTagEx
protected boolean noteEndTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName) throws ManifoldCFException
Note end tag.- Throws:
ManifoldCFException
-
noteQTag
protected final boolean noteQTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes) throws ManifoldCFException
This method is called for every <? ... ?> construct, or 'qtag'. This is not useful for HTML.- Overrides:
noteQTag
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteQTagEx
protected boolean noteQTagEx(java.lang.String tagName, java.util.Map<java.lang.String,java.lang.String> attributes) throws ManifoldCFException
Map version of the noteQTag method.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteBTag
protected final boolean noteBTag(java.lang.String tagName) throws ManifoldCFException
This method is called for every <! <token> ... > construct, or 'btag'. Override it to intercept these.- Overrides:
noteBTag
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteBTagEx
protected boolean noteBTagEx(java.lang.String tagName) throws ManifoldCFException
New version of the noteBTag method.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEscaped
protected final boolean noteEscaped(java.lang.String token) throws ManifoldCFException
Called for the start of every cdata-like tag, e.g. <![ <token> [ ... ]]>- Overrides:
noteEscaped
in classTagParseState
- Parameters:
token
- may be empty!!!- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEscapedEx
protected boolean noteEscapedEx(java.lang.String token) throws ManifoldCFException
New version of the noteEscapedTag method.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteBTagToken
protected final boolean noteBTagToken(java.lang.String token) throws ManifoldCFException
This method gets called for every token inside a btag.- Overrides:
noteBTagToken
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteBTagTokenEx
protected boolean noteBTagTokenEx(java.lang.String token) throws ManifoldCFException
New version of the noteBTagToken method.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
-