Class WorkerThread.ProcessActivity
- java.lang.Object
-
- org.apache.manifoldcf.crawler.system.WorkerThread.ProcessActivity
-
- All Implemented Interfaces:
IAbortActivity,ICarrydownActivity,IEventActivity,IFingerprintActivity,IHistoryActivity,INamingActivity,IProcessActivity
- Enclosing class:
- WorkerThread
protected static class WorkerThread.ProcessActivity extends java.lang.Object implements IProcessActivity
Process activity class wraps access to the ingester and job queue.
-
-
Field Summary
Fields Modifier and Type Field Description protected java.util.Set<java.lang.String>abortSetprotected java.util.Set<java.lang.String>allComponentsSetprotected IRepositoryConnectionconnectionprotected java.lang.StringconnectionNameprotected IRepositoryConnectorconnectorprotected IRepositoryConnectionManagerconnMgrprotected longcurrentTimeprotected java.util.Set<java.lang.String>documentDeletedSetprotected java.lang.LongexpireIntervalprotected inthopcountModeprotected IIncrementalIngesteringesterprotected WorkerThread.OutputActivityingestLoggerprotected java.lang.LongjobIDprotected IJobManagerjobManagerprotected java.lang.String[]legalLinkTypesprotected java.util.Map<java.lang.String,java.lang.Long>lowerExpireBoundsprotected java.util.Map<java.lang.String,java.lang.Long>lowerRescheduleBoundsprotected java.lang.LongmaxIntervalprotected java.util.Map<java.lang.String,java.lang.Long>originationTimesprotected IPipelineSpecificationpipelineSpecificationprotected java.util.Map<java.lang.String,QueuedDocument>previousDocumentsprotected java.lang.StringprocessIDprotected java.lang.LongrecrawlIntervalprotected java.util.Map<WorkerThread.DocumentReference,WorkerThread.DocumentReference>referenceListprotected IReprioritizationTrackerrtprotected java.util.Map<java.lang.String,java.util.Set<java.lang.String>>touchedComponentSetprotected java.util.Set<java.lang.String>touchedPrimarySetprotected java.util.Set<java.lang.String>touchedSetprotected java.util.Map<java.lang.String,java.lang.Long>upperExpireBoundsprotected java.util.Map<java.lang.String,java.lang.Long>upperRescheduleBounds-
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IHistoryActivity
BAD_URL, EXCLUDED_CONTENT, EXCLUDED_DATE, EXCLUDED_LENGTH, EXCLUDED_MIMETYPE, EXCLUDED_URL, NULL_URL
-
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IProcessActivity
_rcsid
-
-
Constructor Summary
Constructors Constructor Description ProcessActivity(java.lang.Long jobID, java.lang.String processID, IReprioritizationTracker rt, IJobManager jobManager, IIncrementalIngester ingester, java.lang.String connectionName, IPipelineSpecification pipelineSpecification, java.util.Map<java.lang.String,QueuedDocument> previousDocuments, long currentTime, java.lang.Long expireInterval, java.lang.Long recrawlInterval, java.lang.Long maxInterval, int hopcountMode, IRepositoryConnection connection, IRepositoryConnector connector, IRepositoryConnectionManager connMgr, java.lang.String[] legalLinkTypes, WorkerThread.OutputActivity ingestLogger)Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddDocumentReference(java.lang.String localIdentifier)Add a document description to the current job's queue.voidaddDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType)Add a document description to the current job's queue.voidaddDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues)Add a document description to the current job's queue.voidaddDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime)Add a document description to the current job's queue.voidaddDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime, java.lang.String[] prereqEventNames)Add a document description to the current job's queue.booleanbeginEventSequence(java.lang.String eventName)Begin an event sequence.java.lang.LongcalculateDocumentExpireTime(long currentTime, java.lang.String localIdentifier)java.lang.LongcalculateDocumentRescheduleTime(long currentTime, long timeAmt, java.lang.String localIdentifier)protected voidcheckAllComponentsMultipleDispositions(java.lang.String documentIdentifier)booleancheckDateIndexable(java.util.Date date)Detect if a date is indexable or not.booleancheckDocumentIndexable(java.io.File localFile)Check whether a document is indexable by the currently specified output connector.booleancheckDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String newVersionString)Check if a document needs to be reindexed, based on a computed version string.booleancheckDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String newVersionString)Check if a document needs to be reindexed, based on a computed version string.voidcheckJobStillActive()Check whether current job is still active.booleancheckLengthIndexable(long length)Check whether a document of a specified length is indexable by the currently specified output connector.booleancheckMimeTypeIndexable(java.lang.String mimeType)Check whether a mime type is indexable by the currently specified output connector.protected voidcheckMultipleDispositions(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String componentIdentifierHash)booleancheckURLIndexable(java.lang.String url)Pre-determine whether a document's URL is indexable by this connector.voidcompleteEventSequence(java.lang.String eventName)Complete an event sequence.protected IPipelineSpecificationWithVersionscomputePipelineSpecificationWithVersions(java.lang.String documentIdentifierHash, java.lang.String componentIdentifierHash, java.lang.String documentIdentifier)java.lang.StringcreateConnectionSpecificString(java.lang.String simpleString)Create a connection-specific string from a simple string.java.lang.StringcreateGlobalString(java.lang.String simpleString)Create a global string from a simple string.java.lang.StringcreateJobSpecificString(java.lang.String simpleString)Create a job-based string from a simple string.voiddeleteDocument(java.lang.String documentIdentifier)Delete the specified document from the search engine index, and from the status table.voiddiscard()Clean up any dangling information, before abandoning this process activity objectvoidflush()Flush the outstanding references into the database.java.lang.LonggetDocumentExpirationLowerBoundTime(java.lang.String localIdentifier)Find a document's lower expiration time bound, if anyjava.lang.LonggetDocumentExpirationUpperBoundTime(java.lang.String localIdentifier)Find a document's upper expiration time bound, if anyjava.lang.LonggetDocumentOriginationTime(java.lang.String localIdentifier)Get a document's origination timejava.lang.LonggetDocumentRescheduleLowerBoundTime(java.lang.String localIdentifier)Find a document's lower rescheduling time bound, if anyjava.lang.LonggetDocumentRescheduleUpperBoundTime(java.lang.String localIdentifier)Find a document's upper rescheduling time bound, if anyvoidingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data)Ingest the current document.voidingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data)Ingest the current document.voidnoDocument(java.lang.String documentIdentifier, java.lang.String version)Remove the specified document from the search engine index, while keeping track of the version information for it (to reduce churn).voidnoDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version)Remove the specified document from the search engine index, and update the recorded version information for the document.protected voidprocessDocumentReferences()Process outstanding document references, in batch.voidrecordActivity(java.lang.Long startTime, java.lang.String activityType, java.lang.Long dataSize, java.lang.String entityIdentifier, java.lang.String resultCode, java.lang.String resultDescription, java.lang.String[] childIdentifiers)Record time-stamped information about the activity of the connector.voidrecordDocument(java.lang.String documentIdentifier, java.lang.String version)Record a document version, but don't ingest it.voidrecordDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version)Record a document version, WITHOUT reindexing it, or removing it.voidremoveDocument(java.lang.String documentIdentifier)Remove the specified document primary component permanently from the search engine index, and from the status table.voidresetTimes()Reset the recorded timesvoidretainAllComponentDocument(java.lang.String documentIdentifier)Retain all existing document components of a primary document.voidretainDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier)Retain existing document component.java.lang.String[]retrieveParentData(java.lang.String localIdentifier, java.lang.String dataName)Retrieve data passed from parents to a specified child document.CharacterInput[]retrieveParentDataAsFiles(java.lang.String localIdentifier, java.lang.String dataName)Retrieve data passed from parents to a specified child document.voidretryDocumentProcessing(java.lang.String localIdentifier)Abort processing a document (for sequencing reasons).voidsetDocumentOriginationTime(java.lang.String localIdentifier, java.lang.Long originationTime)Override a document's origination time.voidsetDocumentScheduleBounds(java.lang.String localIdentifier, java.lang.Long lowerRecrawlBoundTime, java.lang.Long upperRecrawlBoundTime, java.lang.Long lowerExpireBoundTime, java.lang.Long upperExpireBoundTime)Override the schedule for the next time a document is crawled.protected voidtouchAllComponentsSet(java.lang.String documentIdentifier)protected voidtouchComponentSet(java.lang.String documentIdentifier, java.lang.String componentIdentifierHash)booleanwasDocumentAborted(java.lang.String documentIdentifier)Check whether a document was aborted or not.booleanwasDocumentComponentTouched(java.lang.String documentIdentifier, java.lang.String componentIdentifierHash)Check whether a document component was touched or not.booleanwasDocumentDeleted(java.lang.String documentIdentifier)Check whether document was deleted or not.booleanwasDocumentTouched(java.lang.String documentIdentifier)Check whether a document (and its version string) was touched or not.
-
-
-
Field Detail
-
jobID
protected final java.lang.Long jobID
-
processID
protected final java.lang.String processID
-
jobManager
protected final IJobManager jobManager
-
ingester
protected final IIncrementalIngester ingester
-
connectionName
protected final java.lang.String connectionName
-
pipelineSpecification
protected final IPipelineSpecification pipelineSpecification
-
previousDocuments
protected final java.util.Map<java.lang.String,QueuedDocument> previousDocuments
-
currentTime
protected final long currentTime
-
expireInterval
protected final java.lang.Long expireInterval
-
recrawlInterval
protected final java.lang.Long recrawlInterval
-
maxInterval
protected final java.lang.Long maxInterval
-
hopcountMode
protected final int hopcountMode
-
connection
protected final IRepositoryConnection connection
-
connector
protected final IRepositoryConnector connector
-
connMgr
protected final IRepositoryConnectionManager connMgr
-
legalLinkTypes
protected final java.lang.String[] legalLinkTypes
-
ingestLogger
protected final WorkerThread.OutputActivity ingestLogger
-
rt
protected final IReprioritizationTracker rt
-
referenceList
protected final java.util.Map<WorkerThread.DocumentReference,WorkerThread.DocumentReference> referenceList
-
lowerRescheduleBounds
protected final java.util.Map<java.lang.String,java.lang.Long> lowerRescheduleBounds
-
upperRescheduleBounds
protected final java.util.Map<java.lang.String,java.lang.Long> upperRescheduleBounds
-
lowerExpireBounds
protected final java.util.Map<java.lang.String,java.lang.Long> lowerExpireBounds
-
upperExpireBounds
protected final java.util.Map<java.lang.String,java.lang.Long> upperExpireBounds
-
originationTimes
protected final java.util.Map<java.lang.String,java.lang.Long> originationTimes
-
abortSet
protected final java.util.Set<java.lang.String> abortSet
-
touchedSet
protected final java.util.Set<java.lang.String> touchedSet
-
documentDeletedSet
protected final java.util.Set<java.lang.String> documentDeletedSet
-
allComponentsSet
protected final java.util.Set<java.lang.String> allComponentsSet
-
touchedComponentSet
protected final java.util.Map<java.lang.String,java.util.Set<java.lang.String>> touchedComponentSet
-
touchedPrimarySet
protected final java.util.Set<java.lang.String> touchedPrimarySet
-
-
Constructor Detail
-
ProcessActivity
public ProcessActivity(java.lang.Long jobID, java.lang.String processID, IReprioritizationTracker rt, IJobManager jobManager, IIncrementalIngester ingester, java.lang.String connectionName, IPipelineSpecification pipelineSpecification, java.util.Map<java.lang.String,QueuedDocument> previousDocuments, long currentTime, java.lang.Long expireInterval, java.lang.Long recrawlInterval, java.lang.Long maxInterval, int hopcountMode, IRepositoryConnection connection, IRepositoryConnector connector, IRepositoryConnectionManager connMgr, java.lang.String[] legalLinkTypes, WorkerThread.OutputActivity ingestLogger)Constructor.- Parameters:
jobManager- is the job manageringester- is the ingester
-
-
Method Detail
-
discard
public void discard() throws ManifoldCFExceptionClean up any dangling information, before abandoning this process activity object- Throws:
ManifoldCFException
-
wasDocumentTouched
public boolean wasDocumentTouched(java.lang.String documentIdentifier)
Check whether a document (and its version string) was touched or not.
-
wasDocumentComponentTouched
public boolean wasDocumentComponentTouched(java.lang.String documentIdentifier, java.lang.String componentIdentifierHash)Check whether a document component was touched or not.
-
wasDocumentDeleted
public boolean wasDocumentDeleted(java.lang.String documentIdentifier)
Check whether document was deleted or not.
-
wasDocumentAborted
public boolean wasDocumentAborted(java.lang.String documentIdentifier)
Check whether a document was aborted or not.
-
checkDocumentNeedsReindexing
public boolean checkDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String newVersionString) throws ManifoldCFExceptionCheck if a document needs to be reindexed, based on a computed version string. Call this method to determine whether reindexing is necessary. Pass in a newly-computed version string. This method will return "true" if the document needs to be re-indexed.- Specified by:
checkDocumentNeedsReindexingin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document identifier.newVersionString- is the newly-computed version string.- Returns:
- true if the document needs to be reindexed.
- Throws:
ManifoldCFException
-
checkDocumentNeedsReindexing
public boolean checkDocumentNeedsReindexing(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String newVersionString) throws ManifoldCFExceptionCheck if a document needs to be reindexed, based on a computed version string. Call this method to determine whether reindexing is necessary. Pass in a newly-computed version string. This method will return "true" if the document needs to be re-indexed.- Specified by:
checkDocumentNeedsReindexingin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document identifier.componentIdentifier- is the component document identifier, if any.newVersionString- is the newly-computed version string.- Returns:
- true if the document needs to be reindexed.
- Throws:
ManifoldCFException
-
addDocumentReference
public void addDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime, java.lang.String[] prereqEventNames) throws ManifoldCFExceptionAdd a document description to the current job's queue.- Specified by:
addDocumentReferencein interfaceIProcessActivity- Parameters:
localIdentifier- is the local document identifier to add (for the connector that fetched the document).parentIdentifier- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.relationshipType- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.dataNames- is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!dataValues- are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.originationTime- is the time, in ms since epoch, that the document originated. Pass null if none or unknown.prereqEventNames- are the names of the prerequisite events which this document requires prior to processing. Pass null if none.- Throws:
ManifoldCFException
-
addDocumentReference
public void addDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime) throws ManifoldCFExceptionAdd a document description to the current job's queue.- Specified by:
addDocumentReferencein interfaceIProcessActivity- Parameters:
localIdentifier- is the local document identifier to add (for the connector that fetched the document).parentIdentifier- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.relationshipType- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.dataNames- is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!dataValues- are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null.originationTime- is the time, in ms since epoch, that the document originated. Pass null if none or unknown.- Throws:
ManifoldCFException
-
addDocumentReference
public void addDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues) throws ManifoldCFExceptionAdd a document description to the current job's queue.- Specified by:
addDocumentReferencein interfaceIProcessActivity- Parameters:
localIdentifier- is the local document identifier to add (for the connector that fetched the document).parentIdentifier- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.relationshipType- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.dataNames- is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!dataValues- are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null.- Throws:
ManifoldCFException
-
addDocumentReference
public void addDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType) throws ManifoldCFExceptionAdd a document description to the current job's queue.- Specified by:
addDocumentReferencein interfaceIProcessActivity- Parameters:
localIdentifier- is the local document identifier to add (for the connector that fetched the document).parentIdentifier- is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.relationshipType- is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.- Throws:
ManifoldCFException
-
addDocumentReference
public void addDocumentReference(java.lang.String localIdentifier) throws ManifoldCFExceptionAdd a document description to the current job's queue. This method is equivalent to addDocumentReference(localIdentifier,null,null).- Specified by:
addDocumentReferencein interfaceIProcessActivity- Parameters:
localIdentifier- is the local document identifier to add (for the connector that fetched the document).- Throws:
ManifoldCFException
-
retrieveParentData
public java.lang.String[] retrieveParentData(java.lang.String localIdentifier, java.lang.String dataName) throws ManifoldCFExceptionRetrieve data passed from parents to a specified child document.- Specified by:
retrieveParentDatain interfaceICarrydownActivity- Parameters:
localIdentifier- is the document identifier of the document we want the recorded data for.dataName- is the name of the data items to retrieve.- Returns:
- an array containing the unique data values passed from ALL parents. Note that these are in no particular order, and there will not be any duplicates.
- Throws:
ManifoldCFException
-
retrieveParentDataAsFiles
public CharacterInput[] retrieveParentDataAsFiles(java.lang.String localIdentifier, java.lang.String dataName) throws ManifoldCFException
Retrieve data passed from parents to a specified child document.- Specified by:
retrieveParentDataAsFilesin interfaceICarrydownActivity- Parameters:
localIdentifier- is the document identifier of the document we want the recorded data for.dataName- is the name of the data items to retrieve.- Returns:
- an array containing the unique data values passed from ALL parents. Note that these are in no particular order, and there will not be any duplicates.
- Throws:
ManifoldCFException
-
recordDocument
public void recordDocument(java.lang.String documentIdentifier, java.lang.String version) throws ManifoldCFExceptionRecord a document version, but don't ingest it.- Specified by:
recordDocumentin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document identifier.version- is the document version.- Throws:
ManifoldCFException
-
recordDocument
public void recordDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version) throws ManifoldCFExceptionRecord a document version, WITHOUT reindexing it, or removing it. (Other documents with the same URL, however, will still be removed.) This is useful if the version string changes but the document contents are known not to have changed.- Specified by:
recordDocumentin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document identifier.componentIdentifier- is the component document identifier, if any.version- is the document version.- Throws:
ManifoldCFException
-
ingestDocumentWithException
public void ingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data) throws ManifoldCFException, ServiceInterruption, java.io.IOExceptionIngest the current document.- Specified by:
ingestDocumentWithExceptionin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document's local identifier.version- is the version of the document, as reported by the getDocumentVersions() method of the corresponding repository connector.documentURI- is the URI to use to retrieve this document from the search interface (and is also the unique key in the index).data- is the document data. The data is closed after ingestion is complete.- Throws:
java.io.IOException- only when data stream reading fails.ManifoldCFExceptionServiceInterruption
-
ingestDocumentWithException
public void ingestDocumentWithException(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data) throws ManifoldCFException, ServiceInterruption, java.io.IOExceptionIngest the current document.- Specified by:
ingestDocumentWithExceptionin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document's identifier.componentIdentifier- is the component document identifier, if any.version- is the version of the document, as reported by the getDocumentVersions() method of the corresponding repository connector.documentURI- is the URI to use to retrieve this document from the search interface (and is also the unique key in the index).data- is the document data. The data is closed after ingestion is complete.- Throws:
java.io.IOException- only when data stream reading fails.ManifoldCFExceptionServiceInterruption
-
noDocument
public void noDocument(java.lang.String documentIdentifier, java.lang.String version) throws ManifoldCFException, ServiceInterruptionRemove the specified document from the search engine index, while keeping track of the version information for it (to reduce churn).- Specified by:
noDocumentin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document's local identifier.version- is the version string to be recorded for the document.- Throws:
ManifoldCFExceptionServiceInterruption
-
noDocument
public void noDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version) throws ManifoldCFException, ServiceInterruptionRemove the specified document from the search engine index, and update the recorded version information for the document.- Specified by:
noDocumentin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document's local identifier.componentIdentifier- is the component document identifier, if any.version- is the version string to be recorded for the document.- Throws:
ManifoldCFExceptionServiceInterruption
-
removeDocument
public void removeDocument(java.lang.String documentIdentifier) throws ManifoldCFException, ServiceInterruptionRemove the specified document primary component permanently from the search engine index, and from the status table. Use this method when your document has components and now also has a primary document, but will not have a primary document again for the foreseeable future. This is a rare situation.- Specified by:
removeDocumentin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document's identifier.- Throws:
ManifoldCFExceptionServiceInterruption
-
retainDocument
public void retainDocument(java.lang.String documentIdentifier, java.lang.String componentIdentifier) throws ManifoldCFExceptionRetain existing document component. Use this method to signal that an already-existing document component does not need to be reindexed. The default behavior is to remove components that are not mentioned during processing.- Specified by:
retainDocumentin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document's identifier.componentIdentifier- is the component document identifier, which cannot be null.- Throws:
ManifoldCFException
-
retainAllComponentDocument
public void retainAllComponentDocument(java.lang.String documentIdentifier) throws ManifoldCFExceptionRetain all existing document components of a primary document. Use this method to signal that no document components need to be reindexed. The default behavior is to remove components that are not mentioned during processing.- Specified by:
retainAllComponentDocumentin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document's identifier.- Throws:
ManifoldCFException
-
deleteDocument
public void deleteDocument(java.lang.String documentIdentifier) throws ManifoldCFExceptionDelete the specified document from the search engine index, and from the status table. This method does NOT keep track of version information for the document and thus can lead to "churn", whereby the same document is queued, processed, and removed on subsequent crawls. It is therefore preferable to use noDocument() instead, in any case where the same decision will need to be made over and over.- Specified by:
deleteDocumentin interfaceIProcessActivity- Parameters:
documentIdentifier- is the document's identifier.- Throws:
ManifoldCFException
-
setDocumentScheduleBounds
public void setDocumentScheduleBounds(java.lang.String localIdentifier, java.lang.Long lowerRecrawlBoundTime, java.lang.Long upperRecrawlBoundTime, java.lang.Long lowerExpireBoundTime, java.lang.Long upperExpireBoundTime) throws ManifoldCFExceptionOverride the schedule for the next time a document is crawled. Calling this method allows you to set an upper recrawl bound, lower recrawl bound, upper expire bound, lower expire bound, or a combination of these, on a specific document. This method is only effective if the job is a continuous one, and if the identifier you pass in is being processed.- Specified by:
setDocumentScheduleBoundsin interfaceIProcessActivity- Parameters:
localIdentifier- is the document's local identifier.lowerRecrawlBoundTime- is the time in ms since epoch that the reschedule time should not fall BELOW, or null if none.upperRecrawlBoundTime- is the time in ms since epoch that the reschedule time should not rise ABOVE, or null if none.lowerExpireBoundTime- is the time in ms since epoch that the expire time should not fall BELOW, or null if none.upperExpireBoundTime- is the time in ms since epoch that the expire time should not rise ABOVE, or null if none.- Throws:
ManifoldCFException
-
setDocumentOriginationTime
public void setDocumentOriginationTime(java.lang.String localIdentifier, java.lang.Long originationTime) throws ManifoldCFExceptionOverride a document's origination time. Use this method to signal the framework that a document's origination time is something other than the first time it was crawled.- Specified by:
setDocumentOriginationTimein interfaceIProcessActivity- Parameters:
localIdentifier- is the document's local identifier.originationTime- is the document's origination time, or null if unknown.- Throws:
ManifoldCFException
-
getDocumentRescheduleLowerBoundTime
public java.lang.Long getDocumentRescheduleLowerBoundTime(java.lang.String localIdentifier)
Find a document's lower rescheduling time bound, if any
-
getDocumentRescheduleUpperBoundTime
public java.lang.Long getDocumentRescheduleUpperBoundTime(java.lang.String localIdentifier)
Find a document's upper rescheduling time bound, if any
-
getDocumentExpirationLowerBoundTime
public java.lang.Long getDocumentExpirationLowerBoundTime(java.lang.String localIdentifier)
Find a document's lower expiration time bound, if any
-
getDocumentExpirationUpperBoundTime
public java.lang.Long getDocumentExpirationUpperBoundTime(java.lang.String localIdentifier)
Find a document's upper expiration time bound, if any
-
getDocumentOriginationTime
public java.lang.Long getDocumentOriginationTime(java.lang.String localIdentifier)
Get a document's origination time
-
calculateDocumentRescheduleTime
public java.lang.Long calculateDocumentRescheduleTime(long currentTime, long timeAmt, java.lang.String localIdentifier)
-
calculateDocumentExpireTime
public java.lang.Long calculateDocumentExpireTime(long currentTime, java.lang.String localIdentifier)
-
resetTimes
public void resetTimes()
Reset the recorded times
-
recordActivity
public void recordActivity(java.lang.Long startTime, java.lang.String activityType, java.lang.Long dataSize, java.lang.String entityIdentifier, java.lang.String resultCode, java.lang.String resultDescription, java.lang.String[] childIdentifiers) throws ManifoldCFExceptionRecord time-stamped information about the activity of the connector.- Specified by:
recordActivityin interfaceIHistoryActivity- Parameters:
startTime- is either null or the time since the start of epoch in milliseconds (Jan 1, 1970). Every activity has an associated time; the startTime field records when the activity began. A null value indicates that the start time and the finishing time are the same.activityType- is a string which is fully interpretable only in the context of the connector involved, which is used to categorize what kind of activity is being recorded. For example, a web connector might record a "fetch document" activity. Cannot be null.dataSize- is the number of bytes of data involved in the activity, or null if not applicable.entityIdentifier- is a (possibly long) string which identifies the object involved in the history record. The interpretation of this field will differ from connector to connector. May be null.resultCode- contains a terse description of the result of the activity. The description is limited in size to 255 characters, and can be interpreted only in the context of the current connector. May be null.resultDescription- is a (possibly long) human-readable string which adds detail, if required, to the result described in the resultCode field. This field is not meant to be queried on. May be null.childIdentifiers- is a set of child entity identifiers associated with this activity. May be null.- Throws:
ManifoldCFException
-
flush
public void flush() throws ManifoldCFExceptionFlush the outstanding references into the database.- Throws:
ManifoldCFException
-
processDocumentReferences
protected void processDocumentReferences() throws ManifoldCFExceptionProcess outstanding document references, in batch.- Throws:
ManifoldCFException
-
checkJobStillActive
public void checkJobStillActive() throws ManifoldCFException, ServiceInterruptionCheck whether current job is still active. This method is provided to allow an individual connector that needs to wait on some long-term condition to give up waiting due to the job itself being aborted. If the connector should abort, this method will raise a properly-formed ServiceInterruption, which if thrown to the caller, will signal that the current processing activity remains incomplete and must be retried when the job is resumed.- Specified by:
checkJobStillActivein interfaceIAbortActivity- Throws:
ManifoldCFExceptionServiceInterruption
-
beginEventSequence
public boolean beginEventSequence(java.lang.String eventName) throws ManifoldCFExceptionBegin an event sequence. This method should be called by a connector when a sequencing event should enter the "pending" state. If the event is already in that state, this method will return false, otherwise true. The connector has the responsibility of appropriately managing sequencing given the response status.- Specified by:
beginEventSequencein interfaceIEventActivity- Parameters:
eventName- is the event name.- Returns:
- false if the event is already in the "pending" state.
- Throws:
ManifoldCFException
-
completeEventSequence
public void completeEventSequence(java.lang.String eventName) throws ManifoldCFExceptionComplete an event sequence. This method should be called to signal that an event is no longer in the "pending" state. This can mean that the prerequisite processing is completed, but it can also mean that prerequisite processing was aborted or cannot be completed. Note well: This method should not be called unless the connector is CERTAIN that an event is in progress, and that the current thread has the sole right to complete it. Otherwise, race conditions can develop which would be difficult to diagnose.- Specified by:
completeEventSequencein interfaceIEventActivity- Parameters:
eventName- is the event name.- Throws:
ManifoldCFException
-
retryDocumentProcessing
public void retryDocumentProcessing(java.lang.String localIdentifier) throws ManifoldCFExceptionAbort processing a document (for sequencing reasons). This method should be called in order to cause the specified document to be requeued for later processing. While this is similar in some respects to the semantics of a ServiceInterruption, it is applicable to only one document at a time, and also does not specify any delay period, since it is presumed that the reason for the requeue is because of sequencing issues synchronized around an underlying event.- Specified by:
retryDocumentProcessingin interfaceIEventActivity- Parameters:
localIdentifier- is the document identifier to requeue- Throws:
ManifoldCFException
-
checkDateIndexable
public boolean checkDateIndexable(java.util.Date date) throws ManifoldCFException, ServiceInterruptionDetect if a date is indexable or not. This method is used by participating repository connectors to pre-filter the number of unusable documents that will be passed to this output connector.- Specified by:
checkDateIndexablein interfaceIFingerprintActivity- Parameters:
date- is the date of the document; may be null- Returns:
- true if a document with that date is indexable by this connector.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkMimeTypeIndexable
public boolean checkMimeTypeIndexable(java.lang.String mimeType) throws ManifoldCFException, ServiceInterruptionCheck whether a mime type is indexable by the currently specified output connector.- Specified by:
checkMimeTypeIndexablein interfaceIFingerprintActivity- Parameters:
mimeType- is the mime type to check, not including any character set specification.- Returns:
- true if the mime type is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkDocumentIndexable
public boolean checkDocumentIndexable(java.io.File localFile) throws ManifoldCFException, ServiceInterruptionCheck whether a document is indexable by the currently specified output connector.- Specified by:
checkDocumentIndexablein interfaceIFingerprintActivity- Parameters:
localFile- is the local copy of the file to check.- Returns:
- true if the document is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkLengthIndexable
public boolean checkLengthIndexable(long length) throws ManifoldCFException, ServiceInterruptionCheck whether a document of a specified length is indexable by the currently specified output connector.- Specified by:
checkLengthIndexablein interfaceIFingerprintActivity- Parameters:
length- is the length to check.- Returns:
- true if the document is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkURLIndexable
public boolean checkURLIndexable(java.lang.String url) throws ManifoldCFException, ServiceInterruptionPre-determine whether a document's URL is indexable by this connector. This method is used by participating repository connectors to help filter out documents that are not worth indexing.- Specified by:
checkURLIndexablein interfaceIFingerprintActivity- Parameters:
url- is the URL of the document.- Returns:
- true if the file is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
createGlobalString
public java.lang.String createGlobalString(java.lang.String simpleString)
Create a global string from a simple string.- Specified by:
createGlobalStringin interfaceINamingActivity- Parameters:
simpleString- is the simple string.- Returns:
- a global string.
-
createConnectionSpecificString
public java.lang.String createConnectionSpecificString(java.lang.String simpleString)
Create a connection-specific string from a simple string.- Specified by:
createConnectionSpecificStringin interfaceINamingActivity- Parameters:
simpleString- is the simple string.- Returns:
- a connection-specific string.
-
createJobSpecificString
public java.lang.String createJobSpecificString(java.lang.String simpleString)
Create a job-based string from a simple string.- Specified by:
createJobSpecificStringin interfaceINamingActivity- Parameters:
simpleString- is the simple string.- Returns:
- a job-specific string.
-
checkAllComponentsMultipleDispositions
protected void checkAllComponentsMultipleDispositions(java.lang.String documentIdentifier)
-
checkMultipleDispositions
protected void checkMultipleDispositions(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String componentIdentifierHash)
-
touchAllComponentsSet
protected void touchAllComponentsSet(java.lang.String documentIdentifier)
-
touchComponentSet
protected void touchComponentSet(java.lang.String documentIdentifier, java.lang.String componentIdentifierHash)
-
computePipelineSpecificationWithVersions
protected IPipelineSpecificationWithVersions computePipelineSpecificationWithVersions(java.lang.String documentIdentifierHash, java.lang.String componentIdentifierHash, java.lang.String documentIdentifier)
-
-