Class XMLFuzzyHierarchicalParseState
- java.lang.Object
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.CharacterReceiver
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.XMLFuzzyParseState
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.XMLFuzzyHierarchicalParseState
-
public class XMLFuzzyHierarchicalParseState extends XMLFuzzyParseState
Class to keep track of XML hierarchy in the face of possibly corrupt XML and with case-insensitive tags, etc. Basically, this class accepts what is supposedly XML but allows for various kinds of handwritten corruption. Specific kinds of errors allowed include: - Bad character encoding - Tag case match problems; all attributes are (optionally) bashed to lower case, and tag names are checked to match when all lower case, if case-sensitive didn't work - End tag matching problems, where someone lost an end tag somehow - Other parsing recoveries to be added as they arise The functionality of this class is also somewhat lessened vs. standard SAX-type parsers. No namespace interpretation is done, for instance; tag qnames are split into namespace name and local name, and that's all folks. But if you need more power, you can write a class extension that will do that readily.
-
-
Field Summary
Fields Modifier and Type Field Description protected booleancaptureEscapedWhether we're capturing escaped charactersprotected java.lang.StringBuildercharacterBufferThe current value bufferprotected XMLParsingContextcurrentContextThe current contextprotected static intMAX_CHUNK_SIZEThis is the maximum size of a chunk of characters getting sent to the characters() method.-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.XMLFuzzyParseState
lowerCaseAttributes, lowerCaseBTags, lowerCaseEscapeTags, lowerCaseQAttributes, lowerCaseQTags, lowerCaseTags
-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
accumBuffer, ampBuffer, bTagDepth, currentAttrList, currentAttrName, currentAttrNameBuffer, currentState, currentTagName, currentTagNameBuffer, currentValueBuffer, inAmpersand, mapLookup, TAGPARSESTATE_IN_ATTR_LOOKING_FOR_VALUE, TAGPARSESTATE_IN_ATTR_NAME, TAGPARSESTATE_IN_ATTR_VALUE, TAGPARSESTATE_IN_BANG_TOKEN, TAGPARSESTATE_IN_BRACKET_TOKEN, TAGPARSESTATE_IN_CDATA_BODY, TAGPARSESTATE_IN_COMMENT, TAGPARSESTATE_IN_DOUBLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_END_TAG_NAME, TAGPARSESTATE_IN_QTAG_ATTR_LOOKING_FOR_VALUE, TAGPARSESTATE_IN_QTAG_ATTR_NAME, TAGPARSESTATE_IN_QTAG_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_DOUBLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_NAME, TAGPARSESTATE_IN_QTAG_SAW_QUESTION, TAGPARSESTATE_IN_QTAG_SINGLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_QTAG_UNQUOTED_ATTR_VALUE, TAGPARSESTATE_IN_SINGLE_QUOTES_ATTR_VALUE, TAGPARSESTATE_IN_TAG_NAME, TAGPARSESTATE_IN_TAG_SAW_SLASH, TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE, TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE_SAW_SLASH, TAGPARSESTATE_NEED_FINAL_BRACKET, TAGPARSESTATE_NORMAL, TAGPARSESTATE_SAWCOMMENTDASH, TAGPARSESTATE_SAWDASH, TAGPARSESTATE_SAWEXCLAMATION, TAGPARSESTATE_SAWLEFTANGLE, TAGPARSESTATE_SAWRIGHTBRACKET, TAGPARSESTATE_SAWSECONDCOMMENTDASH, TAGPARSESTATE_SAWSECONDRIGHTBRACKET
-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
charBuffer
-
-
Constructor Summary
Constructors Constructor Description XMLFuzzyHierarchicalParseState()Constructor with default properties.XMLFuzzyHierarchicalParseState(boolean lowerCaseAttributes, boolean lowerCaseTags, boolean lowerCaseQAttributes, boolean lowerCaseQTags, boolean lowerCaseBTags, boolean lowerCaseEscapeTags)Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voidappendToCharacterBuffer(char thisChar)voidcleanup()Call this method to clean up completely after a parse attempt, whether successful or failure.voidfinishUp()Called at the end of everything.protected voidflushCharacterBuffer()XMLParsingContextgetContext()protected booleannoteEndEscaped()Called for the end of every cdata-like tag.protected booleannoteEndTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName)Note end tag.protected booleannoteEscapedCharacter(char thisChar)This method gets called for every character that is found within an escape block, e.g.protected booleannoteEscapedEx(java.lang.String token)New version of the noteEscapedTag method.protected booleannoteNormalCharacter(char thisChar)This method gets called for every character that is not part of a tag etc.protected booleannoteTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName, java.util.Map<java.lang.String,java.lang.String> attributes)Map version of the noteTag method.voidsetContext(XMLParsingContext context)-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.XMLFuzzyParseState
noteBTag, noteBTagEx, noteBTagToken, noteBTagTokenEx, noteEndTag, noteEscaped, noteQTag, noteQTagEx, noteTag
-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
acceptNewTag, attributeDecode, dealWithCharacter, dumpValues, isPunctuation, isWhitespace, mapChunk, newBuffer, noteEndBTag, outputAmpBuffer
-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
dealWithCharacters, dealWithRemainder
-
-
-
-
Field Detail
-
currentContext
protected XMLParsingContext currentContext
The current context
-
characterBuffer
protected java.lang.StringBuilder characterBuffer
The current value buffer
-
captureEscaped
protected boolean captureEscaped
Whether we're capturing escaped characters
-
MAX_CHUNK_SIZE
protected static final int MAX_CHUNK_SIZE
This is the maximum size of a chunk of characters getting sent to the characters() method.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
XMLFuzzyHierarchicalParseState
public XMLFuzzyHierarchicalParseState()
Constructor with default properties.
-
XMLFuzzyHierarchicalParseState
public XMLFuzzyHierarchicalParseState(boolean lowerCaseAttributes, boolean lowerCaseTags, boolean lowerCaseQAttributes, boolean lowerCaseQTags, boolean lowerCaseBTags, boolean lowerCaseEscapeTags)Constructor.
-
-
Method Detail
-
setContext
public void setContext(XMLParsingContext context)
-
getContext
public XMLParsingContext getContext()
-
cleanup
public void cleanup() throws ManifoldCFExceptionCall this method to clean up completely after a parse attempt, whether successful or failure.- Throws:
ManifoldCFException
-
noteTagEx
protected boolean noteTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName, java.util.Map<java.lang.String,java.lang.String> attributes) throws ManifoldCFExceptionMap version of the noteTag method.- Overrides:
noteTagExin classXMLFuzzyParseState- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndTagEx
protected boolean noteEndTagEx(java.lang.String tagName, java.lang.String nameSpace, java.lang.String localName) throws ManifoldCFExceptionNote end tag.- Overrides:
noteEndTagExin classXMLFuzzyParseState- Throws:
ManifoldCFException
-
noteNormalCharacter
protected boolean noteNormalCharacter(char thisChar) throws ManifoldCFExceptionThis method gets called for every character that is not part of a tag etc. Override this method to intercept such characters.- Overrides:
noteNormalCharacterin classTagParseState- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
appendToCharacterBuffer
protected void appendToCharacterBuffer(char thisChar) throws ManifoldCFException- Throws:
ManifoldCFException
-
flushCharacterBuffer
protected void flushCharacterBuffer() throws ManifoldCFException- Throws:
ManifoldCFException
-
noteEscapedEx
protected boolean noteEscapedEx(java.lang.String token) throws ManifoldCFExceptionNew version of the noteEscapedTag method.- Overrides:
noteEscapedExin classXMLFuzzyParseState- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEscapedCharacter
protected boolean noteEscapedCharacter(char thisChar) throws ManifoldCFExceptionThis method gets called for every character that is found within an escape block, e.g. CDATA. Override this method to intercept such characters.- Overrides:
noteEscapedCharacterin classTagParseState- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndEscaped
protected boolean noteEndEscaped() throws ManifoldCFExceptionCalled for the end of every cdata-like tag.- Overrides:
noteEndEscapedin classTagParseState- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
finishUp
public void finishUp() throws ManifoldCFExceptionCalled at the end of everything.- Overrides:
finishUpin classCharacterReceiver- Throws:
ManifoldCFException
-
-