Interface IProcessActivity
-
- All Superinterfaces:
IAbortActivity,ICarrydownActivity,IEventActivity,IFingerprintActivity,IHistoryActivity,INamingActivity
- All Known Implementing Classes:
WorkerThread.ProcessActivity
public interface IProcessActivity extends IHistoryActivity, IEventActivity, IAbortActivity, IFingerprintActivity, ICarrydownActivity
This interface abstracts from the activities that a connector's processDocuments() method can do. The processing flow for a document is expected to go something like this: (1) The connector's processDocuments() method is called with a set of documents to be processed. (2) The connector computes a version string for each document in the set as part of determining whether the document indeed needs to be refetched. (3) For each document processed, there can be one of several dispositions: (a) There is no such document (anymore): deleteDocument() called for the document. (b) The document is (re)indexed: ingestDocumentWithException() is called for the document. (c) The document is determined to be unchanged and no updates are needed: nothing needs to be called for the document. (d) The document is determined to be unchanged BUT the version string needs to be updated: recordDocument() is called for the document. (e) The document is determined to be unindexable BUT it still exists in the repository: noDocument() is called for the document. (f) There was a service interruption: ServiceInterruption is thrown. (4) In order to determine whether a document needs to be reindexed, the method checkDocumentNeedsReindexing() is available to return an opinion on that matter.
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String_rcsid-
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IHistoryActivity
BAD_URL, EXCLUDED_CONTENT, EXCLUDED_DATE, EXCLUDED_LENGTH, EXCLUDED_MIMETYPE, EXCLUDED_URL, NULL_URL
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description voidaddDocumentReference(java.lang.String documentIdentifier)Add a document description to the current job's queue.voidaddDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType)Add a document description to the current job's queue.voidaddDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues)Add a document description to the current job's queue.voidaddDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime)Add a document description to the current job's queue.voidaddDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime, java.lang.String[] prereqEventNames)Add a document description to the current job's queue.booleancheckDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String newVersionString)Check if a document needs to be reindexed, based on a computed version string.booleancheckDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String newVersionString)Check if a document needs to be reindexed, based on a computed version string.voiddeleteDocument(java.lang.String documentIdentifier)Delete the specified document permanently from the search engine index, and from the status table, along with all its components.voidingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data)Ingest the current document.voidingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data)Ingest the current document.voidnoDocument(java.lang.String documentIdentifier, java.lang.String version)Remove the specified document from the search engine index, and update the recorded version information for the document.voidnoDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version)Remove the specified document from the search engine index, and update the recorded version information for the document.voidrecordDocument(java.lang.String documentIdentifier, java.lang.String version)Record a document version, WITHOUT reindexing it, or removing it.voidrecordDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version)Record a document version, WITHOUT reindexing it, or removing it.voidremoveDocument(java.lang.String documentIdentifier)Remove the specified document primary component permanently from the search engine index, and from the status table.voidretainAllComponentDocument(java.lang.String documentIdentifier)Retain all existing document components of a primary document.voidretainDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier)Retain existing document component.voidsetDocumentOriginationTime(java.lang.String documentIdentifier, java.lang.Long originationTime)Override a document's origination time.voidsetDocumentScheduleBounds(java.lang.String documentIdentifier, java.lang.Long lowerRecrawlBoundTime, java.lang.Long upperRecrawlBoundTime, java.lang.Long lowerExpireBoundTime, java.lang.Long upperExpireBoundTime)Override the schedule for the next time a document is crawled.-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.IAbortActivity
checkJobStillActive
-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.ICarrydownActivity
retrieveParentData, retrieveParentDataAsFiles
-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.IEventActivity
beginEventSequence, completeEventSequence, retryDocumentProcessing
-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity
checkDateIndexable, checkDocumentIndexable, checkLengthIndexable, checkMimeTypeIndexable, checkURLIndexable
-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.IHistoryActivity
recordActivity
-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.INamingActivity
createConnectionSpecificString, createGlobalString, createJobSpecificString
-
-
-
-
Field Detail
-
_rcsid
static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
-
Method Detail
-
checkDocumentNeedsReindexing
boolean checkDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String newVersionString) throws ManifoldCFExceptionCheck if a document needs to be reindexed, based on a computed version string. Call this method to determine whether reindexing is necessary. Pass in a newly-computed version string. This method will return "true" if the document needs to be re-indexed.- Parameters:
documentIdentifier- is the document identifier.newVersionString- is the newly-computed version string.- Returns:
- true if the document needs to be reindexed.
- Throws:
ManifoldCFException
-
checkDocumentNeedsReindexing
boolean checkDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String newVersionString) throws ManifoldCFExceptionCheck if a document needs to be reindexed, based on a computed version string. Call this method to determine whether reindexing is necessary. Pass in a newly-computed version string. This method will return "true" if the document needs to be re-indexed.- Parameters:
documentIdentifier- is the document identifier.componentIdentifier- is the component document identifier, if any.newVersionString- is the newly-computed version string.- Returns:
- true if the document needs to be reindexed.
- Throws:
ManifoldCFException
-
addDocumentReference
void addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime, java.lang.String[] prereqEventNames) throws ManifoldCFExceptionAdd a document description to the current job's queue.- Parameters:
documentIdentifier- is the local document identifier to add (for the connector that fetched the document).parentIdentifier- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship. MUST be present in the case of carrydown information.relationshipType- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.dataNames- is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!dataValues- are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.originationTime- is the time, in ms since epoch, that the document originated. Pass null if none or unknown.prereqEventNames- are the names of the prerequisite events which this document requires prior to processing. Pass null if none.- Throws:
ManifoldCFException
-
addDocumentReference
void addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime) throws ManifoldCFExceptionAdd a document description to the current job's queue.- Parameters:
documentIdentifier- is the document identifier to add (for the connector that fetched the document).parentIdentifier- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship. MUST be present in the case of carrydown information.relationshipType- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.dataNames- is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!dataValues- are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.originationTime- is the time, in ms since epoch, that the document originated. Pass null if none or unknown.- Throws:
ManifoldCFException
-
addDocumentReference
void addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues) throws ManifoldCFExceptionAdd a document description to the current job's queue.- Parameters:
documentIdentifier- is the document identifier to add (for the connector that fetched the document).parentIdentifier- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship. MUST be present in the case of carrydown information.relationshipType- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.dataNames- is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!dataValues- are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.- Throws:
ManifoldCFException
-
addDocumentReference
void addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType) throws ManifoldCFExceptionAdd a document description to the current job's queue.- Parameters:
documentIdentifier- is the document identifier to add (for the connector that fetched the document).parentIdentifier- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.relationshipType- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.- Throws:
ManifoldCFException
-
addDocumentReference
void addDocumentReference(java.lang.String documentIdentifier) throws ManifoldCFExceptionAdd a document description to the current job's queue. This method is equivalent to addDocumentReference(localIdentifier,null,null).- Parameters:
documentIdentifier- is the document identifier to add (for the connector that fetched the document).- Throws:
ManifoldCFException
-
ingestDocumentWithException
void ingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data) throws ManifoldCFException, ServiceInterruption, java.io.IOExceptionIngest the current document.- Parameters:
documentIdentifier- is the document's identifier.version- is the version of the document, as reported by the getDocumentVersions() method of the corresponding repository connector. An empty version string signals that there is no calculable document version string, and that the document should always be indexed.documentURI- is the URI to use to retrieve this document from the search interface (and is also the unique key in the index).data- is the document data. The data is closed after ingestion is complete.- Throws:
java.io.IOException- only when data stream reading fails.ManifoldCFExceptionServiceInterruption
-
ingestDocumentWithException
void ingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data) throws ManifoldCFException, ServiceInterruption, java.io.IOExceptionIngest the current document.- Parameters:
documentIdentifier- is the document's identifier.componentIdentifier- is the component document identifier, if any.version- is the version of the document, as reported by the getDocumentVersions() method of the corresponding repository connector.documentURI- is the URI to use to retrieve this document from the search interface (and is also the unique key in the index).data- is the document data. The data is closed after ingestion is complete.- Throws:
java.io.IOException- only when data stream reading fails.ManifoldCFExceptionServiceInterruption
-
noDocument
void noDocument(java.lang.String documentIdentifier, java.lang.String version) throws ManifoldCFException, ServiceInterruptionRemove the specified document from the search engine index, and update the recorded version information for the document.- Parameters:
documentIdentifier- is the document's local identifier.version- is the version string to be recorded for the document.- Throws:
ManifoldCFExceptionServiceInterruption
-
noDocument
void noDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version) throws ManifoldCFException, ServiceInterruptionRemove the specified document from the search engine index, and update the recorded version information for the document.- Parameters:
documentIdentifier- is the document's local identifier.componentIdentifier- is the component document identifier, if any.version- is the version string to be recorded for the document.- Throws:
ManifoldCFExceptionServiceInterruption
-
removeDocument
void removeDocument(java.lang.String documentIdentifier) throws ManifoldCFException, ServiceInterruptionRemove the specified document primary component permanently from the search engine index, and from the status table. Use this method when your document has components and now also has a primary document, but will not have a primary document again for the foreseeable future. This is a rare situation.- Parameters:
documentIdentifier- is the document's identifier.- Throws:
ManifoldCFExceptionServiceInterruption
-
retainDocument
void retainDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier) throws ManifoldCFExceptionRetain existing document component. Use this method to signal that an already-existing document component does not need to be reindexed. The default behavior is to remove components that are not mentioned during processing.- Parameters:
documentIdentifier- is the document's identifier.componentIdentifier- is the component document identifier, which cannot be null.- Throws:
ManifoldCFException
-
retainAllComponentDocument
void retainAllComponentDocument(java.lang.String documentIdentifier) throws ManifoldCFExceptionRetain all existing document components of a primary document. Use this method to signal that no document components need to be reindexed. The default behavior is to remove components that are not mentioned during processing.- Parameters:
documentIdentifier- is the document's identifier.- Throws:
ManifoldCFException
-
recordDocument
void recordDocument(java.lang.String documentIdentifier, java.lang.String version) throws ManifoldCFExceptionRecord a document version, WITHOUT reindexing it, or removing it. (Other documents with the same URL, however, will still be removed.) This is useful if the version string changes but the document contents are known not to have changed.- Parameters:
documentIdentifier- is the document identifier.version- is the document version.- Throws:
ManifoldCFException
-
recordDocument
void recordDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version) throws ManifoldCFExceptionRecord a document version, WITHOUT reindexing it, or removing it. (Other documents with the same URL, however, will still be removed.) This is useful if the version string changes but the document contents are known not to have changed.- Parameters:
documentIdentifier- is the document identifier.componentIdentifier- is the component document identifier, if any.version- is the document version.- Throws:
ManifoldCFException
-
deleteDocument
void deleteDocument(java.lang.String documentIdentifier) throws ManifoldCFExceptionDelete the specified document permanently from the search engine index, and from the status table, along with all its components. This method does NOT keep track of any document version information for the document and thus can lead to "churn", whereby the same document is queued, processed, and removed on subsequent crawls. It is therefore preferable to use noDocument() instead, in any case where the same decision will need to be made over and over.- Parameters:
documentIdentifier- is the document's identifier.- Throws:
ManifoldCFException
-
setDocumentScheduleBounds
void setDocumentScheduleBounds(java.lang.String documentIdentifier, java.lang.Long lowerRecrawlBoundTime, java.lang.Long upperRecrawlBoundTime, java.lang.Long lowerExpireBoundTime, java.lang.Long upperExpireBoundTime) throws ManifoldCFExceptionOverride the schedule for the next time a document is crawled. Calling this method allows you to set an upper recrawl bound, lower recrawl bound, upper expire bound, lower expire bound, or a combination of these, on a specific document. This method is only effective if the job is a continuous one, and if the identifier you pass in is being processed.- Parameters:
documentIdentifier- is the document's identifier.lowerRecrawlBoundTime- is the time in ms since epoch that the reschedule time should not fall BELOW, or null if none.upperRecrawlBoundTime- is the time in ms since epoch that the reschedule time should not rise ABOVE, or null if none.lowerExpireBoundTime- is the time in ms since epoch that the expire time should not fall BELOW, or null if none.upperExpireBoundTime- is the time in ms since epoch that the expire time should not rise ABOVE, or null if none.- Throws:
ManifoldCFException
-
setDocumentOriginationTime
void setDocumentOriginationTime(java.lang.String documentIdentifier, java.lang.Long originationTime) throws ManifoldCFExceptionOverride a document's origination time. Use this method to signal the framework that a document's origination time is something other than the first time it was crawled.- Parameters:
documentIdentifier- is the document's identifier.originationTime- is the document's origination time, or null if unknown.- Throws:
ManifoldCFException
-
-