Interface IProcessActivity
-
- All Superinterfaces:
IAbortActivity
,ICarrydownActivity
,IEventActivity
,IFingerprintActivity
,IHistoryActivity
,INamingActivity
- All Known Implementing Classes:
WorkerThread.ProcessActivity
public interface IProcessActivity extends IHistoryActivity, IEventActivity, IAbortActivity, IFingerprintActivity, ICarrydownActivity
This interface abstracts from the activities that a connector's processDocuments() method can do. The processing flow for a document is expected to go something like this: (1) The connector's processDocuments() method is called with a set of documents to be processed. (2) The connector computes a version string for each document in the set as part of determining whether the document indeed needs to be refetched. (3) For each document processed, there can be one of several dispositions: (a) There is no such document (anymore): deleteDocument() called for the document. (b) The document is (re)indexed: ingestDocumentWithException() is called for the document. (c) The document is determined to be unchanged and no updates are needed: nothing needs to be called for the document. (d) The document is determined to be unchanged BUT the version string needs to be updated: recordDocument() is called for the document. (e) The document is determined to be unindexable BUT it still exists in the repository: noDocument() is called for the document. (f) There was a service interruption: ServiceInterruption is thrown. (4) In order to determine whether a document needs to be reindexed, the method checkDocumentNeedsReindexing() is available to return an opinion on that matter.
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
_rcsid
-
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IHistoryActivity
BAD_URL, EXCLUDED_CONTENT, EXCLUDED_DATE, EXCLUDED_LENGTH, EXCLUDED_MIMETYPE, EXCLUDED_URL, NULL_URL
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description void
addDocumentReference(java.lang.String documentIdentifier)
Add a document description to the current job's queue.void
addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType)
Add a document description to the current job's queue.void
addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues)
Add a document description to the current job's queue.void
addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime)
Add a document description to the current job's queue.void
addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime, java.lang.String[] prereqEventNames)
Add a document description to the current job's queue.boolean
checkDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String newVersionString)
Check if a document needs to be reindexed, based on a computed version string.boolean
checkDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String newVersionString)
Check if a document needs to be reindexed, based on a computed version string.void
deleteDocument(java.lang.String documentIdentifier)
Delete the specified document permanently from the search engine index, and from the status table, along with all its components.void
ingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data)
Ingest the current document.void
ingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data)
Ingest the current document.void
noDocument(java.lang.String documentIdentifier, java.lang.String version)
Remove the specified document from the search engine index, and update the recorded version information for the document.void
noDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version)
Remove the specified document from the search engine index, and update the recorded version information for the document.void
recordDocument(java.lang.String documentIdentifier, java.lang.String version)
Record a document version, WITHOUT reindexing it, or removing it.void
recordDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version)
Record a document version, WITHOUT reindexing it, or removing it.void
removeDocument(java.lang.String documentIdentifier)
Remove the specified document primary component permanently from the search engine index, and from the status table.void
retainAllComponentDocument(java.lang.String documentIdentifier)
Retain all existing document components of a primary document.void
retainDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier)
Retain existing document component.void
setDocumentOriginationTime(java.lang.String documentIdentifier, java.lang.Long originationTime)
Override a document's origination time.void
setDocumentScheduleBounds(java.lang.String documentIdentifier, java.lang.Long lowerRecrawlBoundTime, java.lang.Long upperRecrawlBoundTime, java.lang.Long lowerExpireBoundTime, java.lang.Long upperExpireBoundTime)
Override the schedule for the next time a document is crawled.-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.IAbortActivity
checkJobStillActive
-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.ICarrydownActivity
retrieveParentData, retrieveParentDataAsFiles
-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.IEventActivity
beginEventSequence, completeEventSequence, retryDocumentProcessing
-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity
checkDateIndexable, checkDocumentIndexable, checkLengthIndexable, checkMimeTypeIndexable, checkURLIndexable
-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.IHistoryActivity
recordActivity
-
Methods inherited from interface org.apache.manifoldcf.crawler.interfaces.INamingActivity
createConnectionSpecificString, createGlobalString, createJobSpecificString
-
-
-
-
Field Detail
-
_rcsid
static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
-
Method Detail
-
checkDocumentNeedsReindexing
boolean checkDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String newVersionString) throws ManifoldCFException
Check if a document needs to be reindexed, based on a computed version string. Call this method to determine whether reindexing is necessary. Pass in a newly-computed version string. This method will return "true" if the document needs to be re-indexed.- Parameters:
documentIdentifier
- is the document identifier.newVersionString
- is the newly-computed version string.- Returns:
- true if the document needs to be reindexed.
- Throws:
ManifoldCFException
-
checkDocumentNeedsReindexing
boolean checkDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String newVersionString) throws ManifoldCFException
Check if a document needs to be reindexed, based on a computed version string. Call this method to determine whether reindexing is necessary. Pass in a newly-computed version string. This method will return "true" if the document needs to be re-indexed.- Parameters:
documentIdentifier
- is the document identifier.componentIdentifier
- is the component document identifier, if any.newVersionString
- is the newly-computed version string.- Returns:
- true if the document needs to be reindexed.
- Throws:
ManifoldCFException
-
addDocumentReference
void addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime, java.lang.String[] prereqEventNames) throws ManifoldCFException
Add a document description to the current job's queue.- Parameters:
documentIdentifier
- is the local document identifier to add (for the connector that fetched the document).parentIdentifier
- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship. MUST be present in the case of carrydown information.relationshipType
- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.dataNames
- is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!dataValues
- are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.originationTime
- is the time, in ms since epoch, that the document originated. Pass null if none or unknown.prereqEventNames
- are the names of the prerequisite events which this document requires prior to processing. Pass null if none.- Throws:
ManifoldCFException
-
addDocumentReference
void addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime) throws ManifoldCFException
Add a document description to the current job's queue.- Parameters:
documentIdentifier
- is the document identifier to add (for the connector that fetched the document).parentIdentifier
- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship. MUST be present in the case of carrydown information.relationshipType
- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.dataNames
- is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!dataValues
- are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.originationTime
- is the time, in ms since epoch, that the document originated. Pass null if none or unknown.- Throws:
ManifoldCFException
-
addDocumentReference
void addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues) throws ManifoldCFException
Add a document description to the current job's queue.- Parameters:
documentIdentifier
- is the document identifier to add (for the connector that fetched the document).parentIdentifier
- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship. MUST be present in the case of carrydown information.relationshipType
- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.dataNames
- is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!dataValues
- are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.- Throws:
ManifoldCFException
-
addDocumentReference
void addDocumentReference(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType) throws ManifoldCFException
Add a document description to the current job's queue.- Parameters:
documentIdentifier
- is the document identifier to add (for the connector that fetched the document).parentIdentifier
- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.relationshipType
- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.- Throws:
ManifoldCFException
-
addDocumentReference
void addDocumentReference(java.lang.String documentIdentifier) throws ManifoldCFException
Add a document description to the current job's queue. This method is equivalent to addDocumentReference(localIdentifier,null,null).- Parameters:
documentIdentifier
- is the document identifier to add (for the connector that fetched the document).- Throws:
ManifoldCFException
-
ingestDocumentWithException
void ingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data) throws ManifoldCFException, ServiceInterruption, java.io.IOException
Ingest the current document.- Parameters:
documentIdentifier
- is the document's identifier.version
- is the version of the document, as reported by the getDocumentVersions() method of the corresponding repository connector. An empty version string signals that there is no calculable document version string, and that the document should always be indexed.documentURI
- is the URI to use to retrieve this document from the search interface (and is also the unique key in the index).data
- is the document data. The data is closed after ingestion is complete.- Throws:
java.io.IOException
- only when data stream reading fails.ManifoldCFException
ServiceInterruption
-
ingestDocumentWithException
void ingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data) throws ManifoldCFException, ServiceInterruption, java.io.IOException
Ingest the current document.- Parameters:
documentIdentifier
- is the document's identifier.componentIdentifier
- is the component document identifier, if any.version
- is the version of the document, as reported by the getDocumentVersions() method of the corresponding repository connector.documentURI
- is the URI to use to retrieve this document from the search interface (and is also the unique key in the index).data
- is the document data. The data is closed after ingestion is complete.- Throws:
java.io.IOException
- only when data stream reading fails.ManifoldCFException
ServiceInterruption
-
noDocument
void noDocument(java.lang.String documentIdentifier, java.lang.String version) throws ManifoldCFException, ServiceInterruption
Remove the specified document from the search engine index, and update the recorded version information for the document.- Parameters:
documentIdentifier
- is the document's local identifier.version
- is the version string to be recorded for the document.- Throws:
ManifoldCFException
ServiceInterruption
-
noDocument
void noDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version) throws ManifoldCFException, ServiceInterruption
Remove the specified document from the search engine index, and update the recorded version information for the document.- Parameters:
documentIdentifier
- is the document's local identifier.componentIdentifier
- is the component document identifier, if any.version
- is the version string to be recorded for the document.- Throws:
ManifoldCFException
ServiceInterruption
-
removeDocument
void removeDocument(java.lang.String documentIdentifier) throws ManifoldCFException, ServiceInterruption
Remove the specified document primary component permanently from the search engine index, and from the status table. Use this method when your document has components and now also has a primary document, but will not have a primary document again for the foreseeable future. This is a rare situation.- Parameters:
documentIdentifier
- is the document's identifier.- Throws:
ManifoldCFException
ServiceInterruption
-
retainDocument
void retainDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier) throws ManifoldCFException
Retain existing document component. Use this method to signal that an already-existing document component does not need to be reindexed. The default behavior is to remove components that are not mentioned during processing.- Parameters:
documentIdentifier
- is the document's identifier.componentIdentifier
- is the component document identifier, which cannot be null.- Throws:
ManifoldCFException
-
retainAllComponentDocument
void retainAllComponentDocument(java.lang.String documentIdentifier) throws ManifoldCFException
Retain all existing document components of a primary document. Use this method to signal that no document components need to be reindexed. The default behavior is to remove components that are not mentioned during processing.- Parameters:
documentIdentifier
- is the document's identifier.- Throws:
ManifoldCFException
-
recordDocument
void recordDocument(java.lang.String documentIdentifier, java.lang.String version) throws ManifoldCFException
Record a document version, WITHOUT reindexing it, or removing it. (Other documents with the same URL, however, will still be removed.) This is useful if the version string changes but the document contents are known not to have changed.- Parameters:
documentIdentifier
- is the document identifier.version
- is the document version.- Throws:
ManifoldCFException
-
recordDocument
void recordDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version) throws ManifoldCFException
Record a document version, WITHOUT reindexing it, or removing it. (Other documents with the same URL, however, will still be removed.) This is useful if the version string changes but the document contents are known not to have changed.- Parameters:
documentIdentifier
- is the document identifier.componentIdentifier
- is the component document identifier, if any.version
- is the document version.- Throws:
ManifoldCFException
-
deleteDocument
void deleteDocument(java.lang.String documentIdentifier) throws ManifoldCFException
Delete the specified document permanently from the search engine index, and from the status table, along with all its components. This method does NOT keep track of any document version information for the document and thus can lead to "churn", whereby the same document is queued, processed, and removed on subsequent crawls. It is therefore preferable to use noDocument() instead, in any case where the same decision will need to be made over and over.- Parameters:
documentIdentifier
- is the document's identifier.- Throws:
ManifoldCFException
-
setDocumentScheduleBounds
void setDocumentScheduleBounds(java.lang.String documentIdentifier, java.lang.Long lowerRecrawlBoundTime, java.lang.Long upperRecrawlBoundTime, java.lang.Long lowerExpireBoundTime, java.lang.Long upperExpireBoundTime) throws ManifoldCFException
Override the schedule for the next time a document is crawled. Calling this method allows you to set an upper recrawl bound, lower recrawl bound, upper expire bound, lower expire bound, or a combination of these, on a specific document. This method is only effective if the job is a continuous one, and if the identifier you pass in is being processed.- Parameters:
documentIdentifier
- is the document's identifier.lowerRecrawlBoundTime
- is the time in ms since epoch that the reschedule time should not fall BELOW, or null if none.upperRecrawlBoundTime
- is the time in ms since epoch that the reschedule time should not rise ABOVE, or null if none.lowerExpireBoundTime
- is the time in ms since epoch that the expire time should not fall BELOW, or null if none.upperExpireBoundTime
- is the time in ms since epoch that the expire time should not rise ABOVE, or null if none.- Throws:
ManifoldCFException
-
setDocumentOriginationTime
void setDocumentOriginationTime(java.lang.String documentIdentifier, java.lang.Long originationTime) throws ManifoldCFException
Override a document's origination time. Use this method to signal the framework that a document's origination time is something other than the first time it was crawled.- Parameters:
documentIdentifier
- is the document's identifier.originationTime
- is the document's origination time, or null if unknown.- Throws:
ManifoldCFException
-
-