Class QueueTracker
- java.lang.Object
-
- org.apache.manifoldcf.crawler.interfaces.QueueTracker
-
public class QueueTracker extends java.lang.Object
This class attempts to provide document priorities in order to acheive as much balance as possible between documents having different bins. A document's priority assignment takes place at the time the document is added to the queue, and will be recalculated when a job is aborted, or when the crawler daemon is started. The document priorities are strictly obeyed when documents are chosen from the queue and handed to worker threads; higher-priority documents always have precedence, except due to deliberate priority adjustment specified by the job priority. The priority values themselves are logarithmic: 0.0 is the highest, and the larger the number, the lower the priority. The basis for the calculation for each document priority handed out by this module are: - number of documents having a given bin (tracked) - performance of a connection (gathered through statistics) - throttling that applies to the each document bin The queuing prioritization model hooks into the document lifecycle in the following places: (1) When a document is added to the queue (and thus when its priority is handed out) (2) When documents that were *supposed* to be added to the queue turned out to already be there and already have an established priority, (in which case the priority that was handed out before is returned to the pool for reuse) (3) When a document is pulled from the database queue (which sets the current highest priority level that should not be exceeded in step (1)) The assignment prioritization model is largely independent of the queuing prioritization model, and is used to select among documents that have been marked "active" as they are handed to worker threads. These events cause information to be logged: (1) When a document is handed to a worker thread (2) When the worker thread completes the document
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static class
QueueTracker.BinCount
This is the class which allows a mutable integer count value to be saved in the bincount table.
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
_rcsid
protected java.util.Map<java.lang.String,QueueTracker.BinCount>
activeBinCounts
These are the bin counts for active threadsprotected static double
binReductionFactor
Factor by which bins are reducedprotected PerformanceStatistics
performanceStatistics
These are the accumulated performance averages for all connections etc.protected java.util.Map<java.lang.String,QueueTracker.BinCount>
queuedBinCounts
These are the bin counts for tracking the documents that are on the active queue, but are not being processed yet
-
Constructor Summary
Constructors Constructor Description QueueTracker()
Constructor
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addRecord(java.lang.String[] binNames)
Add an access record to the queue tracker.void
beginProcessing(java.lang.String[] binNames)
Note that we are beginning processing for a document with a particular set of bins.double
calculateAssignmentRating(java.lang.String[] binNames, IRepositoryConnection connection)
Calculate an assignment rating for a set of bins based on what's currently in use.void
endProcessing(java.lang.String[] binNames)
Note that we have completed processing of a document with a given set of bins.PerformanceStatistics
getCurrentStatistics()
Obtain the current performance statistics objectvoid
noteConnectionPerformance(int docCount, java.lang.String connectionName, long elapsedTime)
Note the time required to successfully complete a set of documents.
-
-
-
Field Detail
-
_rcsid
public static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
binReductionFactor
protected static final double binReductionFactor
Factor by which bins are reduced- See Also:
- Constant Field Values
-
performanceStatistics
protected final PerformanceStatistics performanceStatistics
These are the accumulated performance averages for all connections etc.
-
queuedBinCounts
protected final java.util.Map<java.lang.String,QueueTracker.BinCount> queuedBinCounts
These are the bin counts for tracking the documents that are on the active queue, but are not being processed yet
-
activeBinCounts
protected final java.util.Map<java.lang.String,QueueTracker.BinCount> activeBinCounts
These are the bin counts for active threads
-
-
Method Detail
-
addRecord
public void addRecord(java.lang.String[] binNames)
Add an access record to the queue tracker. This happens when a document is added to the in-memory queue, and allows us to keep track of that particular event so we can schedule in a way that meets our distribution goals.- Parameters:
binNames
- are the set of bins, as returned from the connector in question, for the document that is being queued. These bins are considered global in nature.
-
noteConnectionPerformance
public void noteConnectionPerformance(int docCount, java.lang.String connectionName, long elapsedTime)
Note the time required to successfully complete a set of documents. This allows this module to keep track of the performance characteristics of each individual connection, so distribution across connections can be balanced properly.
-
getCurrentStatistics
public PerformanceStatistics getCurrentStatistics()
Obtain the current performance statistics object
-
beginProcessing
public void beginProcessing(java.lang.String[] binNames)
Note that we are beginning processing for a document with a particular set of bins. This method is called when a worker thread starts work on a set of documents.
-
endProcessing
public void endProcessing(java.lang.String[] binNames)
Note that we have completed processing of a document with a given set of bins. This method gets called when a Worker Thread has finished with a document.
-
calculateAssignmentRating
public double calculateAssignmentRating(java.lang.String[] binNames, IRepositoryConnection connection)
Calculate an assignment rating for a set of bins based on what's currently in use. This rating is used to help determine which documents returned from a queueing query actually get made "active", and which ones are skipped for the moment. The rating returned for each bin will be 1 divided by one plus the active thread count for that bin. The higher the rating, the better. The ratings are combined by multiplying the rating for each bin by that for every other bin, and then taking the nth root (where n is the number of bins) to normalize for the number of bins. The repository connection is used to reduce the priority of assignment, based on the fetch rate that will result from this set of bins.
-
-