Class QueueTracker


  • public class QueueTracker
    extends java.lang.Object
    This class attempts to provide document priorities in order to acheive as much balance as possible between documents having different bins. A document's priority assignment takes place at the time the document is added to the queue, and will be recalculated when a job is aborted, or when the crawler daemon is started. The document priorities are strictly obeyed when documents are chosen from the queue and handed to worker threads; higher-priority documents always have precedence, except due to deliberate priority adjustment specified by the job priority. The priority values themselves are logarithmic: 0.0 is the highest, and the larger the number, the lower the priority. The basis for the calculation for each document priority handed out by this module are: - number of documents having a given bin (tracked) - performance of a connection (gathered through statistics) - throttling that applies to the each document bin The queuing prioritization model hooks into the document lifecycle in the following places: (1) When a document is added to the queue (and thus when its priority is handed out) (2) When documents that were *supposed* to be added to the queue turned out to already be there and already have an established priority, (in which case the priority that was handed out before is returned to the pool for reuse) (3) When a document is pulled from the database queue (which sets the current highest priority level that should not be exceeded in step (1)) The assignment prioritization model is largely independent of the queuing prioritization model, and is used to select among documents that have been marked "active" as they are handed to worker threads. These events cause information to be logged: (1) When a document is handed to a worker thread (2) When the worker thread completes the document
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      protected static class  QueueTracker.BinCount
      This is the class which allows a mutable integer count value to be saved in the bincount table.
    • Constructor Summary

      Constructors 
      Constructor Description
      QueueTracker()
      Constructor
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void addRecord​(java.lang.String[] binNames)
      Add an access record to the queue tracker.
      void beginProcessing​(java.lang.String[] binNames)
      Note that we are beginning processing for a document with a particular set of bins.
      double calculateAssignmentRating​(java.lang.String[] binNames, IRepositoryConnection connection)
      Calculate an assignment rating for a set of bins based on what's currently in use.
      void endProcessing​(java.lang.String[] binNames)
      Note that we have completed processing of a document with a given set of bins.
      PerformanceStatistics getCurrentStatistics()
      Obtain the current performance statistics object
      void noteConnectionPerformance​(int docCount, java.lang.String connectionName, long elapsedTime)
      Note the time required to successfully complete a set of documents.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • binReductionFactor

        protected static final double binReductionFactor
        Factor by which bins are reduced
        See Also:
        Constant Field Values
      • performanceStatistics

        protected final PerformanceStatistics performanceStatistics
        These are the accumulated performance averages for all connections etc.
      • queuedBinCounts

        protected final java.util.Map<java.lang.String,​QueueTracker.BinCount> queuedBinCounts
        These are the bin counts for tracking the documents that are on the active queue, but are not being processed yet
      • activeBinCounts

        protected final java.util.Map<java.lang.String,​QueueTracker.BinCount> activeBinCounts
        These are the bin counts for active threads
    • Constructor Detail

      • QueueTracker

        public QueueTracker()
        Constructor
    • Method Detail

      • addRecord

        public void addRecord​(java.lang.String[] binNames)
        Add an access record to the queue tracker. This happens when a document is added to the in-memory queue, and allows us to keep track of that particular event so we can schedule in a way that meets our distribution goals.
        Parameters:
        binNames - are the set of bins, as returned from the connector in question, for the document that is being queued. These bins are considered global in nature.
      • noteConnectionPerformance

        public void noteConnectionPerformance​(int docCount,
                                              java.lang.String connectionName,
                                              long elapsedTime)
        Note the time required to successfully complete a set of documents. This allows this module to keep track of the performance characteristics of each individual connection, so distribution across connections can be balanced properly.
      • getCurrentStatistics

        public PerformanceStatistics getCurrentStatistics()
        Obtain the current performance statistics object
      • beginProcessing

        public void beginProcessing​(java.lang.String[] binNames)
        Note that we are beginning processing for a document with a particular set of bins. This method is called when a worker thread starts work on a set of documents.
      • endProcessing

        public void endProcessing​(java.lang.String[] binNames)
        Note that we have completed processing of a document with a given set of bins. This method gets called when a Worker Thread has finished with a document.
      • calculateAssignmentRating

        public double calculateAssignmentRating​(java.lang.String[] binNames,
                                                IRepositoryConnection connection)
        Calculate an assignment rating for a set of bins based on what's currently in use. This rating is used to help determine which documents returned from a queueing query actually get made "active", and which ones are skipped for the moment. The rating returned for each bin will be 1 divided by one plus the active thread count for that bin. The higher the rating, the better. The ratings are combined by multiplying the rating for each bin by that for every other bin, and then taking the nth root (where n is the number of bins) to normalize for the number of bins. The repository connection is used to reduce the priority of assignment, based on the fetch rate that will result from this set of bins.