Class XMLFuzzyHierarchicalParseState
- java.lang.Object
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.CharacterReceiver
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.XMLFuzzyParseState
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.XMLFuzzyHierarchicalParseState
-
public class XMLFuzzyHierarchicalParseState extends XMLFuzzyParseState
Class to keep track of XML hierarchy in the face of possibly corrupt XML and with case-insensitive tags, etc. Basically, this class accepts what is supposedly XML but allows for various kinds of handwritten corruption. Specific kinds of errors allowed include: - Bad character encoding - Tag case match problems; all attributes are (optionally) bashed to lower case, and tag names are checked to match when all lower case, if case-sensitive didn't work - End tag matching problems, where someone lost an end tag somehow - Other parsing recoveries to be added as they arise The functionality of this class is also somewhat lessened vs. standard SAX-type parsers. No namespace interpretation is done, for instance; tag qnames are split into namespace name and local name, and that's all folks. But if you need more power, you can write a class extension that will do that readily.
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
captureEscaped
Whether we're capturing escaped charactersprotected java.lang.StringBuilder
characterBuffer
The current value bufferprotected XMLParsingContext
currentContext
The current contextprotected static int
MAX_CHUNK_SIZE
This is the maximum size of a chunk of characters getting sent to the characters() method.-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.XMLFuzzyParseState
lowerCaseAttributes, lowerCaseBTags, lowerCaseEscapeTags, lowerCaseQAttributes, lowerCaseQTags, lowerCaseTags
-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
accumBuffer, ampBuffer, bTagDepth, currentAttrList, currentAttrName, currentAttrNameBuffer, currentState, currentTagName, currentTagNameBuffer, currentValueBuffer, inAmpersand, mapLookup, TAGPARSESTATE_IN_ATTR_LOOKING_FOR_VALUE, TAGPARSESTATE_IN_ATTR_NAME, TAGPARSESTATE_IN_ATTR_VALUE, TAGPARSESTATE_IN_BANG_TOKEN, TAGPARSESTATE_IN_BRACKET_TOKEN, TAGPARSESTATE_IN_CDATA_BODY, TAGPARSESTATE_IN_COMMENT, TAGPARSESTATE_IN_DOUBLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_END_TAG_NAME, TAGPARSESTATE_IN_QTAG_ATTR_LOOKING_FOR_VALUE, TAGPARSESTATE_IN_QTAG_ATTR_NAME, TAGPARSESTATE_IN_QTAG_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_DOUBLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_NAME, TAGPARSESTATE_IN_QTAG_SAW_QUESTION, TAGPARSESTATE_IN_QTAG_SINGLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_UNQUOTED_ATTR_VALUE, TAGPARSESTATE_IN_SINGLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_TAG_NAME, TAGPARSESTATE_IN_TAG_SAW_SLASH, TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE, TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE_SAW_SLASH, TAGPARSESTATE_NEED_FINAL_BRACKET, TAGPARSESTATE_NORMAL, TAGPARSESTATE_SAWCOMMENTDASH, TAGPARSESTATE_SAWDASH, TAGPARSESTATE_SAWEXCLAMATION, TAGPARSESTATE_SAWLEFTANGLE, TAGPARSESTATE_SAWRIGHTBRACKET, TAGPARSESTATE_SAWSECONDCOMMENTDASH, TAGPARSESTATE_SAWSECONDRIGHTBRACKET
-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
charBuffer
-
-
Constructor Summary
Constructors Constructor Description XMLFuzzyHierarchicalParseState()
Constructor with default properties.XMLFuzzyHierarchicalParseState(boolean lowerCaseAttributes, boolean lowerCaseTags, boolean lowerCaseQAttributes, boolean lowerCaseQTags, boolean lowerCaseBTags, boolean lowerCaseEscapeTags)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
appendToCharacterBuffer(char thisChar)
void
cleanup()
Call this method to clean up completely after a parse attempt, whether successful or failure.void
finishUp()
Called at the end of everything.protected void
flushCharacterBuffer()
XMLParsingContext
getContext()
protected boolean
noteEndEscaped()
Called for the end of every cdata-like tag.protected boolean
noteEndTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName)
Note end tag.protected boolean
noteEscapedCharacter(char thisChar)
This method gets called for every character that is found within an escape block, e.g.protected boolean
noteEscapedEx(java.lang.String token)
New version of the noteEscapedTag method.protected boolean
noteNormalCharacter(char thisChar)
This method gets called for every character that is not part of a tag etc.protected boolean
noteTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName, java.util.Map<java.lang.String,java.lang.String> attributes)
Map version of the noteTag method.void
setContext(XMLParsingContext context)
-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.XMLFuzzyParseState
noteBTag, noteBTagEx, noteBTagToken, noteBTagTokenEx, noteEndTag, noteEscaped, noteQTag, noteQTagEx, noteTag
-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
acceptNewTag, attributeDecode, dealWithCharacter, dumpValues, isPunctuation, isWhitespace, mapChunk, newBuffer, noteEndBTag, outputAmpBuffer
-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
dealWithCharacters, dealWithRemainder
-
-
-
-
Field Detail
-
currentContext
protected XMLParsingContext currentContext
The current context
-
characterBuffer
protected java.lang.StringBuilder characterBuffer
The current value buffer
-
captureEscaped
protected boolean captureEscaped
Whether we're capturing escaped characters
-
MAX_CHUNK_SIZE
protected static final int MAX_CHUNK_SIZE
This is the maximum size of a chunk of characters getting sent to the characters() method.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
XMLFuzzyHierarchicalParseState
public XMLFuzzyHierarchicalParseState()
Constructor with default properties.
-
XMLFuzzyHierarchicalParseState
public XMLFuzzyHierarchicalParseState(boolean lowerCaseAttributes, boolean lowerCaseTags, boolean lowerCaseQAttributes, boolean lowerCaseQTags, boolean lowerCaseBTags, boolean lowerCaseEscapeTags)
Constructor.
-
-
Method Detail
-
setContext
public void setContext(XMLParsingContext context)
-
getContext
public XMLParsingContext getContext()
-
cleanup
public void cleanup() throws ManifoldCFException
Call this method to clean up completely after a parse attempt, whether successful or failure.- Throws:
ManifoldCFException
-
noteTagEx
protected boolean noteTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName, java.util.Map<java.lang.String,java.lang.String> attributes) throws ManifoldCFException
Map version of the noteTag method.- Overrides:
noteTagEx
in classXMLFuzzyParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndTagEx
protected boolean noteEndTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName) throws ManifoldCFException
Note end tag.- Overrides:
noteEndTagEx
in classXMLFuzzyParseState
- Throws:
ManifoldCFException
-
noteNormalCharacter
protected boolean noteNormalCharacter(char thisChar) throws ManifoldCFException
This method gets called for every character that is not part of a tag etc. Override this method to intercept such characters.- Overrides:
noteNormalCharacter
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
appendToCharacterBuffer
protected void appendToCharacterBuffer(char thisChar) throws ManifoldCFException
- Throws:
ManifoldCFException
-
flushCharacterBuffer
protected void flushCharacterBuffer() throws ManifoldCFException
- Throws:
ManifoldCFException
-
noteEscapedEx
protected boolean noteEscapedEx(java.lang.String token) throws ManifoldCFException
New version of the noteEscapedTag method.- Overrides:
noteEscapedEx
in classXMLFuzzyParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEscapedCharacter
protected boolean noteEscapedCharacter(char thisChar) throws ManifoldCFException
This method gets called for every character that is found within an escape block, e.g. CDATA. Override this method to intercept such characters.- Overrides:
noteEscapedCharacter
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndEscaped
protected boolean noteEndEscaped() throws ManifoldCFException
Called for the end of every cdata-like tag.- Overrides:
noteEndEscaped
in classTagParseState
- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
finishUp
public void finishUp() throws ManifoldCFException
Called at the end of everything.- Overrides:
finishUp
in classCharacterReceiver
- Throws:
ManifoldCFException
-
-