Class QueueTracker
- java.lang.Object
- 
- org.apache.manifoldcf.crawler.interfaces.QueueTracker
 
- 
 public class QueueTracker extends java.lang.ObjectThis class attempts to provide document priorities in order to acheive as much balance as possible between documents having different bins. A document's priority assignment takes place at the time the document is added to the queue, and will be recalculated when a job is aborted, or when the crawler daemon is started. The document priorities are strictly obeyed when documents are chosen from the queue and handed to worker threads; higher-priority documents always have precedence, except due to deliberate priority adjustment specified by the job priority. The priority values themselves are logarithmic: 0.0 is the highest, and the larger the number, the lower the priority. The basis for the calculation for each document priority handed out by this module are: - number of documents having a given bin (tracked) - performance of a connection (gathered through statistics) - throttling that applies to the each document bin The queuing prioritization model hooks into the document lifecycle in the following places: (1) When a document is added to the queue (and thus when its priority is handed out) (2) When documents that were *supposed* to be added to the queue turned out to already be there and already have an established priority, (in which case the priority that was handed out before is returned to the pool for reuse) (3) When a document is pulled from the database queue (which sets the current highest priority level that should not be exceeded in step (1)) The assignment prioritization model is largely independent of the queuing prioritization model, and is used to select among documents that have been marked "active" as they are handed to worker threads. These events cause information to be logged: (1) When a document is handed to a worker thread (2) When the worker thread completes the document
- 
- 
Nested Class SummaryNested Classes Modifier and Type Class Description protected static classQueueTracker.BinCountThis is the class which allows a mutable integer count value to be saved in the bincount table.
 - 
Field SummaryFields Modifier and Type Field Description static java.lang.String_rcsidprotected java.util.Map<java.lang.String,QueueTracker.BinCount>activeBinCountsThese are the bin counts for active threadsprotected static doublebinReductionFactorFactor by which bins are reducedprotected PerformanceStatisticsperformanceStatisticsThese are the accumulated performance averages for all connections etc.protected java.util.Map<java.lang.String,QueueTracker.BinCount>queuedBinCountsThese are the bin counts for tracking the documents that are on the active queue, but are not being processed yet
 - 
Constructor SummaryConstructors Constructor Description QueueTracker()Constructor
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddRecord(java.lang.String[] binNames)Add an access record to the queue tracker.voidbeginProcessing(java.lang.String[] binNames)Note that we are beginning processing for a document with a particular set of bins.doublecalculateAssignmentRating(java.lang.String[] binNames, IRepositoryConnection connection)Calculate an assignment rating for a set of bins based on what's currently in use.voidendProcessing(java.lang.String[] binNames)Note that we have completed processing of a document with a given set of bins.PerformanceStatisticsgetCurrentStatistics()Obtain the current performance statistics objectvoidnoteConnectionPerformance(int docCount, java.lang.String connectionName, long elapsedTime)Note the time required to successfully complete a set of documents.
 
- 
- 
- 
Field Detail- 
_rcsidpublic static final java.lang.String _rcsid - See Also:
- Constant Field Values
 
 - 
binReductionFactorprotected static final double binReductionFactor Factor by which bins are reduced- See Also:
- Constant Field Values
 
 - 
performanceStatisticsprotected final PerformanceStatistics performanceStatistics These are the accumulated performance averages for all connections etc.
 - 
queuedBinCountsprotected final java.util.Map<java.lang.String,QueueTracker.BinCount> queuedBinCounts These are the bin counts for tracking the documents that are on the active queue, but are not being processed yet
 - 
activeBinCountsprotected final java.util.Map<java.lang.String,QueueTracker.BinCount> activeBinCounts These are the bin counts for active threads
 
- 
 - 
Method Detail- 
addRecordpublic void addRecord(java.lang.String[] binNames) Add an access record to the queue tracker. This happens when a document is added to the in-memory queue, and allows us to keep track of that particular event so we can schedule in a way that meets our distribution goals.- Parameters:
- binNames- are the set of bins, as returned from the connector in question, for the document that is being queued. These bins are considered global in nature.
 
 - 
noteConnectionPerformancepublic void noteConnectionPerformance(int docCount, java.lang.String connectionName, long elapsedTime)Note the time required to successfully complete a set of documents. This allows this module to keep track of the performance characteristics of each individual connection, so distribution across connections can be balanced properly.
 - 
getCurrentStatisticspublic PerformanceStatistics getCurrentStatistics() Obtain the current performance statistics object
 - 
beginProcessingpublic void beginProcessing(java.lang.String[] binNames) Note that we are beginning processing for a document with a particular set of bins. This method is called when a worker thread starts work on a set of documents.
 - 
endProcessingpublic void endProcessing(java.lang.String[] binNames) Note that we have completed processing of a document with a given set of bins. This method gets called when a Worker Thread has finished with a document.
 - 
calculateAssignmentRatingpublic double calculateAssignmentRating(java.lang.String[] binNames, IRepositoryConnection connection)Calculate an assignment rating for a set of bins based on what's currently in use. This rating is used to help determine which documents returned from a queueing query actually get made "active", and which ones are skipped for the moment. The rating returned for each bin will be 1 divided by one plus the active thread count for that bin. The higher the rating, the better. The ratings are combined by multiplying the rating for each bin by that for every other bin, and then taking the nth root (where n is the number of bins) to normalize for the number of bins. The repository connection is used to reduce the priority of assignment, based on the fetch rate that will result from this set of bins.
 
- 
 
-