Class SeedingActivity
- java.lang.Object
-
- org.apache.manifoldcf.crawler.system.SeedingActivity
-
- All Implemented Interfaces:
IAbortActivity
,IHistoryActivity
,INamingActivity
,ISeedingActivity
public class SeedingActivity extends java.lang.Object implements ISeedingActivity
This class represents the things you can do with the framework while seeding.
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
_rcsid
protected IRepositoryConnection
connection
protected java.lang.String
connectionName
protected IRepositoryConnector
connector
protected IRepositoryConnectionManager
connManager
protected int
documentCount
protected java.lang.String[]
documentHashList
protected java.lang.String[]
documentList
protected java.lang.String[][]
documentPrereqList
protected int
hopcountMethod
protected java.lang.Long
jobID
protected IJobManager
jobManager
protected java.lang.String[]
legalLinkTypes
protected static int
MAX_COUNT
protected boolean
overrideSchedule
protected java.lang.String
processID
protected int
remainingDocumentCount
protected java.lang.String[]
remainingDocumentHashList
protected IReprioritizationTracker
rt
-
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IHistoryActivity
BAD_URL, EXCLUDED_CONTENT, EXCLUDED_DATE, EXCLUDED_LENGTH, EXCLUDED_MIMETYPE, EXCLUDED_URL, NULL_URL
-
-
Constructor Summary
Constructors Constructor Description SeedingActivity(java.lang.String connectionName, IRepositoryConnectionManager connManager, IJobManager jobManager, IReprioritizationTracker rt, IRepositoryConnection connection, IRepositoryConnector connector, java.lang.Long jobID, java.lang.String[] legalLinkTypes, boolean overrideSchedule, int hopcountMethod, java.lang.String processID)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addSeedDocument(java.lang.String documentIdentifier)
Record a "seed" document identifier.void
addSeedDocument(java.lang.String documentIdentifier, java.lang.String[] prereqEventNames)
Record a "seed" document identifier.void
addUnqueuedSeedDocument(java.lang.String documentIdentifier)
This method receives document identifiers that should be considered part of the seeds, but do not need to be queued for processing at this time.void
checkJobStillActive()
Check whether current job is still active.java.lang.String
createConnectionSpecificString(java.lang.String simpleString)
Create a connection-specific string from a simple string.java.lang.String
createGlobalString(java.lang.String simpleString)
Create a global string from a simple string.java.lang.String
createJobSpecificString(java.lang.String simpleString)
Create a job-based string from a simple string.void
doneSeeding(boolean isPartial)
Finish a seeding passvoid
recordActivity(java.lang.Long startTime, java.lang.String activityType, java.lang.Long dataSize, java.lang.String entityIdentifier, java.lang.String resultCode, java.lang.String resultDescription, java.lang.String[] childIdentifiers)
Record time-stamped information about the activity of the connector.protected void
writeSeedDocuments(java.lang.String[] docIDHashes, java.lang.String[] docIDs, java.lang.String[][] prereqEventNames)
Write specified documents after calculating their priorities
-
-
-
Field Detail
-
_rcsid
public static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
MAX_COUNT
protected static final int MAX_COUNT
- See Also:
- Constant Field Values
-
processID
protected final java.lang.String processID
-
connectionName
protected final java.lang.String connectionName
-
connManager
protected final IRepositoryConnectionManager connManager
-
jobManager
protected final IJobManager jobManager
-
rt
protected final IReprioritizationTracker rt
-
connection
protected final IRepositoryConnection connection
-
connector
protected final IRepositoryConnector connector
-
jobID
protected final java.lang.Long jobID
-
legalLinkTypes
protected final java.lang.String[] legalLinkTypes
-
overrideSchedule
protected final boolean overrideSchedule
-
hopcountMethod
protected final int hopcountMethod
-
documentHashList
protected final java.lang.String[] documentHashList
-
documentList
protected final java.lang.String[] documentList
-
documentPrereqList
protected final java.lang.String[][] documentPrereqList
-
documentCount
protected int documentCount
-
remainingDocumentHashList
protected final java.lang.String[] remainingDocumentHashList
-
remainingDocumentCount
protected int remainingDocumentCount
-
-
Constructor Detail
-
SeedingActivity
public SeedingActivity(java.lang.String connectionName, IRepositoryConnectionManager connManager, IJobManager jobManager, IReprioritizationTracker rt, IRepositoryConnection connection, IRepositoryConnector connector, java.lang.Long jobID, java.lang.String[] legalLinkTypes, boolean overrideSchedule, int hopcountMethod, java.lang.String processID)
Constructor.
-
-
Method Detail
-
addSeedDocument
public void addSeedDocument(java.lang.String documentIdentifier, java.lang.String[] prereqEventNames) throws ManifoldCFException
Record a "seed" document identifier. Seeds passed to this method will be loaded into the job's queue at the beginning of the job's execution, and for continuous crawling jobs, periodically throughout the crawl. All documents passed to this method are placed on the "pending documents" list, and are marked as being seed documents. All pending documents will be processed to determine if they have changed or have been deleted. It is not a big problem if the connector chooses to put more documents onto the pending list than are strictly necessary; it is merely a question of overall work required. Note that it is always ok to send MORE documents rather than less to this method.- Specified by:
addSeedDocument
in interfaceISeedingActivity
- Parameters:
documentIdentifier
- is the identifier of the document to add to the "pending" queue.prereqEventNames
- is the list of prerequisite events required for this document, or null if none.- Throws:
ManifoldCFException
-
addSeedDocument
public void addSeedDocument(java.lang.String documentIdentifier) throws ManifoldCFException
Record a "seed" document identifier. Seeds passed to this method will be loaded into the job's queue at the beginning of the job's execution, and for continuous crawling jobs, periodically throughout the crawl. All documents passed to this method are placed on the "pending documents" list, and are marked as being seed documents. All pending documents will be processed to determine if they have changed or have been deleted. It is not a big problem if the connector chooses to put more documents onto the pending list than are strictly necessary; it is merely a question of overall work required. Note that it is always ok to send MORE documents rather than less to this method.- Specified by:
addSeedDocument
in interfaceISeedingActivity
- Parameters:
documentIdentifier
- is the identifier of the document to add to the "pending" queue.- Throws:
ManifoldCFException
-
addUnqueuedSeedDocument
public void addUnqueuedSeedDocument(java.lang.String documentIdentifier) throws ManifoldCFException
This method receives document identifiers that should be considered part of the seeds, but do not need to be queued for processing at this time. (This method is used to keep the hopcount tables up to date.) It is allowed to receive more identifiers than it strictly needs to, specifically identifiers that may have also been sent to the addSeedDocuments() method above. However, the connector must constrain the identifiers it sends by the document specification. This method is only required to be called at all if the connector supports hopcount determination (which it should signal by having more than zero legal relationship types returned by the getRelationshipTypes() method).- Specified by:
addUnqueuedSeedDocument
in interfaceISeedingActivity
- Parameters:
documentIdentifier
- is the identifier of the document to consider as a seed, but not to put in the "pending" queue.- Throws:
ManifoldCFException
-
doneSeeding
public void doneSeeding(boolean isPartial) throws ManifoldCFException
Finish a seeding pass- Throws:
ManifoldCFException
-
recordActivity
public void recordActivity(java.lang.Long startTime, java.lang.String activityType, java.lang.Long dataSize, java.lang.String entityIdentifier, java.lang.String resultCode, java.lang.String resultDescription, java.lang.String[] childIdentifiers) throws ManifoldCFException
Record time-stamped information about the activity of the connector.- Specified by:
recordActivity
in interfaceIHistoryActivity
- Parameters:
startTime
- is either null or the time since the start of epoch in milliseconds (Jan 1, 1970). Every activity has an associated time; the startTime field records when the activity began. A null value indicates that the start time and the finishing time are the same.activityType
- is a string which is fully interpretable only in the context of the connector involved, which is used to categorize what kind of activity is being recorded. For example, a web connector might record a "fetch document" activity. Cannot be null.dataSize
- is the number of bytes of data involved in the activity, or null if not applicable.entityIdentifier
- is a (possibly long) string which identifies the object involved in the history record. The interpretation of this field will differ from connector to connector. May be null.resultCode
- contains a terse description of the result of the activity. The description is limited in size to 255 characters, and can be interpreted only in the context of the current connector. May be null.resultDescription
- is a (possibly long) human-readable string which adds detail, if required, to the result described in the resultCode field. This field is not meant to be queried on. May be null.childIdentifiers
- is a set of child entity identifiers associated with this activity. May be null.- Throws:
ManifoldCFException
-
writeSeedDocuments
protected void writeSeedDocuments(java.lang.String[] docIDHashes, java.lang.String[] docIDs, java.lang.String[][] prereqEventNames) throws ManifoldCFException
Write specified documents after calculating their priorities- Throws:
ManifoldCFException
-
checkJobStillActive
public void checkJobStillActive() throws ManifoldCFException, ServiceInterruption
Check whether current job is still active. This method is provided to allow an individual connector that needs to wait on some long-term condition to give up waiting due to the job itself being aborted. If the connector should abort, this method will raise a properly-formed ServiceInterruption, which if thrown to the caller, will signal that the current seeding activity remains incomplete and must be retried when the job is resumed.- Specified by:
checkJobStillActive
in interfaceIAbortActivity
- Throws:
ManifoldCFException
ServiceInterruption
-
createGlobalString
public java.lang.String createGlobalString(java.lang.String simpleString)
Create a global string from a simple string.- Specified by:
createGlobalString
in interfaceINamingActivity
- Parameters:
simpleString
- is the simple string.- Returns:
- a global string.
-
createConnectionSpecificString
public java.lang.String createConnectionSpecificString(java.lang.String simpleString)
Create a connection-specific string from a simple string.- Specified by:
createConnectionSpecificString
in interfaceINamingActivity
- Parameters:
simpleString
- is the simple string.- Returns:
- a connection-specific string.
-
createJobSpecificString
public java.lang.String createJobSpecificString(java.lang.String simpleString)
Create a job-based string from a simple string.- Specified by:
createJobSpecificString
in interfaceINamingActivity
- Parameters:
simpleString
- is the simple string.- Returns:
- a job-specific string.
-
-