Interface IProcessActivity

  • All Superinterfaces:
    IAbortActivity, ICarrydownActivity, IEventActivity, IFingerprintActivity, IHistoryActivity, INamingActivity
    All Known Implementing Classes:
    WorkerThread.ProcessActivity

    public interface IProcessActivity
    extends IHistoryActivity, IEventActivity, IAbortActivity, IFingerprintActivity, ICarrydownActivity
    This interface abstracts from the activities that a connector's processDocuments() method can do. The processing flow for a document is expected to go something like this: (1) The connector's processDocuments() method is called with a set of documents to be processed. (2) The connector computes a version string for each document in the set as part of determining whether the document indeed needs to be refetched. (3) For each document processed, there can be one of several dispositions: (a) There is no such document (anymore): deleteDocument() called for the document. (b) The document is (re)indexed: ingestDocumentWithException() is called for the document. (c) The document is determined to be unchanged and no updates are needed: nothing needs to be called for the document. (d) The document is determined to be unchanged BUT the version string needs to be updated: recordDocument() is called for the document. (e) The document is determined to be unindexable BUT it still exists in the repository: noDocument() is called for the document. (f) There was a service interruption: ServiceInterruption is thrown. (4) In order to determine whether a document needs to be reindexed, the method checkDocumentNeedsReindexing() is available to return an opinion on that matter.
    • Method Summary

      All Methods Instance Methods Abstract Methods 
      Modifier and Type Method Description
      void addDocumentReference​(java.lang.String documentIdentifier)
      Add a document description to the current job's queue.
      void addDocumentReference​(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType)
      Add a document description to the current job's queue.
      void addDocumentReference​(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues)
      Add a document description to the current job's queue.
      void addDocumentReference​(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime)
      Add a document description to the current job's queue.
      void addDocumentReference​(java.lang.String documentIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime, java.lang.String[] prereqEventNames)
      Add a document description to the current job's queue.
      boolean checkDocumentNeedsReindexing​(java.lang.String documentIdentifier, java.lang.String newVersionString)
      Check if a document needs to be reindexed, based on a computed version string.
      boolean checkDocumentNeedsReindexing​(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String newVersionString)
      Check if a document needs to be reindexed, based on a computed version string.
      void deleteDocument​(java.lang.String documentIdentifier)
      Delete the specified document permanently from the search engine index, and from the status table, along with all its components.
      void ingestDocumentWithException​(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data)
      Ingest the current document.
      void ingestDocumentWithException​(java.lang.String documentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data)
      Ingest the current document.
      void noDocument​(java.lang.String documentIdentifier, java.lang.String version)
      Remove the specified document from the search engine index, and update the recorded version information for the document.
      void noDocument​(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version)
      Remove the specified document from the search engine index, and update the recorded version information for the document.
      void recordDocument​(java.lang.String documentIdentifier, java.lang.String version)
      Record a document version, WITHOUT reindexing it, or removing it.
      void recordDocument​(java.lang.String documentIdentifier, java.lang.String componentIdentifier, java.lang.String version)
      Record a document version, WITHOUT reindexing it, or removing it.
      void removeDocument​(java.lang.String documentIdentifier)
      Remove the specified document primary component permanently from the search engine index, and from the status table.
      void retainAllComponentDocument​(java.lang.String documentIdentifier)
      Retain all existing document components of a primary document.
      void retainDocument​(java.lang.String documentIdentifier, java.lang.String componentIdentifier)
      Retain existing document component.
      void setDocumentOriginationTime​(java.lang.String documentIdentifier, java.lang.Long originationTime)
      Override a document's origination time.
      void setDocumentScheduleBounds​(java.lang.String documentIdentifier, java.lang.Long lowerRecrawlBoundTime, java.lang.Long upperRecrawlBoundTime, java.lang.Long lowerExpireBoundTime, java.lang.Long upperExpireBoundTime)
      Override the schedule for the next time a document is crawled.
    • Method Detail

      • checkDocumentNeedsReindexing

        boolean checkDocumentNeedsReindexing​(java.lang.String documentIdentifier,
                                             java.lang.String newVersionString)
                                      throws ManifoldCFException
        Check if a document needs to be reindexed, based on a computed version string. Call this method to determine whether reindexing is necessary. Pass in a newly-computed version string. This method will return "true" if the document needs to be re-indexed.
        Parameters:
        documentIdentifier - is the document identifier.
        newVersionString - is the newly-computed version string.
        Returns:
        true if the document needs to be reindexed.
        Throws:
        ManifoldCFException
      • checkDocumentNeedsReindexing

        boolean checkDocumentNeedsReindexing​(java.lang.String documentIdentifier,
                                             java.lang.String componentIdentifier,
                                             java.lang.String newVersionString)
                                      throws ManifoldCFException
        Check if a document needs to be reindexed, based on a computed version string. Call this method to determine whether reindexing is necessary. Pass in a newly-computed version string. This method will return "true" if the document needs to be re-indexed.
        Parameters:
        documentIdentifier - is the document identifier.
        componentIdentifier - is the component document identifier, if any.
        newVersionString - is the newly-computed version string.
        Returns:
        true if the document needs to be reindexed.
        Throws:
        ManifoldCFException
      • addDocumentReference

        void addDocumentReference​(java.lang.String documentIdentifier,
                                  java.lang.String parentIdentifier,
                                  java.lang.String relationshipType,
                                  java.lang.String[] dataNames,
                                  java.lang.Object[][] dataValues,
                                  java.lang.Long originationTime,
                                  java.lang.String[] prereqEventNames)
                           throws ManifoldCFException
        Add a document description to the current job's queue.
        Parameters:
        documentIdentifier - is the local document identifier to add (for the connector that fetched the document).
        parentIdentifier - is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship. MUST be present in the case of carrydown information.
        relationshipType - is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.
        dataNames - is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!
        dataValues - are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.
        originationTime - is the time, in ms since epoch, that the document originated. Pass null if none or unknown.
        prereqEventNames - are the names of the prerequisite events which this document requires prior to processing. Pass null if none.
        Throws:
        ManifoldCFException
      • addDocumentReference

        void addDocumentReference​(java.lang.String documentIdentifier,
                                  java.lang.String parentIdentifier,
                                  java.lang.String relationshipType,
                                  java.lang.String[] dataNames,
                                  java.lang.Object[][] dataValues,
                                  java.lang.Long originationTime)
                           throws ManifoldCFException
        Add a document description to the current job's queue.
        Parameters:
        documentIdentifier - is the document identifier to add (for the connector that fetched the document).
        parentIdentifier - is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship. MUST be present in the case of carrydown information.
        relationshipType - is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.
        dataNames - is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!
        dataValues - are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.
        originationTime - is the time, in ms since epoch, that the document originated. Pass null if none or unknown.
        Throws:
        ManifoldCFException
      • addDocumentReference

        void addDocumentReference​(java.lang.String documentIdentifier,
                                  java.lang.String parentIdentifier,
                                  java.lang.String relationshipType,
                                  java.lang.String[] dataNames,
                                  java.lang.Object[][] dataValues)
                           throws ManifoldCFException
        Add a document description to the current job's queue.
        Parameters:
        documentIdentifier - is the document identifier to add (for the connector that fetched the document).
        parentIdentifier - is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship. MUST be present in the case of carrydown information.
        relationshipType - is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.
        dataNames - is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!
        dataValues - are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.
        Throws:
        ManifoldCFException
      • addDocumentReference

        void addDocumentReference​(java.lang.String documentIdentifier,
                                  java.lang.String parentIdentifier,
                                  java.lang.String relationshipType)
                           throws ManifoldCFException
        Add a document description to the current job's queue.
        Parameters:
        documentIdentifier - is the document identifier to add (for the connector that fetched the document).
        parentIdentifier - is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.
        relationshipType - is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.
        Throws:
        ManifoldCFException
      • addDocumentReference

        void addDocumentReference​(java.lang.String documentIdentifier)
                           throws ManifoldCFException
        Add a document description to the current job's queue. This method is equivalent to addDocumentReference(localIdentifier,null,null).
        Parameters:
        documentIdentifier - is the document identifier to add (for the connector that fetched the document).
        Throws:
        ManifoldCFException
      • ingestDocumentWithException

        void ingestDocumentWithException​(java.lang.String documentIdentifier,
                                         java.lang.String version,
                                         java.lang.String documentURI,
                                         RepositoryDocument data)
                                  throws ManifoldCFException,
                                         ServiceInterruption,
                                         java.io.IOException
        Ingest the current document.
        Parameters:
        documentIdentifier - is the document's identifier.
        version - is the version of the document, as reported by the getDocumentVersions() method of the corresponding repository connector. An empty version string signals that there is no calculable document version string, and that the document should always be indexed.
        documentURI - is the URI to use to retrieve this document from the search interface (and is also the unique key in the index).
        data - is the document data. The data is closed after ingestion is complete.
        Throws:
        java.io.IOException - only when data stream reading fails.
        ManifoldCFException
        ServiceInterruption
      • ingestDocumentWithException

        void ingestDocumentWithException​(java.lang.String documentIdentifier,
                                         java.lang.String componentIdentifier,
                                         java.lang.String version,
                                         java.lang.String documentURI,
                                         RepositoryDocument data)
                                  throws ManifoldCFException,
                                         ServiceInterruption,
                                         java.io.IOException
        Ingest the current document.
        Parameters:
        documentIdentifier - is the document's identifier.
        componentIdentifier - is the component document identifier, if any.
        version - is the version of the document, as reported by the getDocumentVersions() method of the corresponding repository connector.
        documentURI - is the URI to use to retrieve this document from the search interface (and is also the unique key in the index).
        data - is the document data. The data is closed after ingestion is complete.
        Throws:
        java.io.IOException - only when data stream reading fails.
        ManifoldCFException
        ServiceInterruption
      • noDocument

        void noDocument​(java.lang.String documentIdentifier,
                        java.lang.String version)
                 throws ManifoldCFException,
                        ServiceInterruption
        Remove the specified document from the search engine index, and update the recorded version information for the document.
        Parameters:
        documentIdentifier - is the document's local identifier.
        version - is the version string to be recorded for the document.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • noDocument

        void noDocument​(java.lang.String documentIdentifier,
                        java.lang.String componentIdentifier,
                        java.lang.String version)
                 throws ManifoldCFException,
                        ServiceInterruption
        Remove the specified document from the search engine index, and update the recorded version information for the document.
        Parameters:
        documentIdentifier - is the document's local identifier.
        componentIdentifier - is the component document identifier, if any.
        version - is the version string to be recorded for the document.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • removeDocument

        void removeDocument​(java.lang.String documentIdentifier)
                     throws ManifoldCFException,
                            ServiceInterruption
        Remove the specified document primary component permanently from the search engine index, and from the status table. Use this method when your document has components and now also has a primary document, but will not have a primary document again for the foreseeable future. This is a rare situation.
        Parameters:
        documentIdentifier - is the document's identifier.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • retainDocument

        void retainDocument​(java.lang.String documentIdentifier,
                            java.lang.String componentIdentifier)
                     throws ManifoldCFException
        Retain existing document component. Use this method to signal that an already-existing document component does not need to be reindexed. The default behavior is to remove components that are not mentioned during processing.
        Parameters:
        documentIdentifier - is the document's identifier.
        componentIdentifier - is the component document identifier, which cannot be null.
        Throws:
        ManifoldCFException
      • retainAllComponentDocument

        void retainAllComponentDocument​(java.lang.String documentIdentifier)
                                 throws ManifoldCFException
        Retain all existing document components of a primary document. Use this method to signal that no document components need to be reindexed. The default behavior is to remove components that are not mentioned during processing.
        Parameters:
        documentIdentifier - is the document's identifier.
        Throws:
        ManifoldCFException
      • recordDocument

        void recordDocument​(java.lang.String documentIdentifier,
                            java.lang.String version)
                     throws ManifoldCFException
        Record a document version, WITHOUT reindexing it, or removing it. (Other documents with the same URL, however, will still be removed.) This is useful if the version string changes but the document contents are known not to have changed.
        Parameters:
        documentIdentifier - is the document identifier.
        version - is the document version.
        Throws:
        ManifoldCFException
      • recordDocument

        void recordDocument​(java.lang.String documentIdentifier,
                            java.lang.String componentIdentifier,
                            java.lang.String version)
                     throws ManifoldCFException
        Record a document version, WITHOUT reindexing it, or removing it. (Other documents with the same URL, however, will still be removed.) This is useful if the version string changes but the document contents are known not to have changed.
        Parameters:
        documentIdentifier - is the document identifier.
        componentIdentifier - is the component document identifier, if any.
        version - is the document version.
        Throws:
        ManifoldCFException
      • deleteDocument

        void deleteDocument​(java.lang.String documentIdentifier)
                     throws ManifoldCFException
        Delete the specified document permanently from the search engine index, and from the status table, along with all its components. This method does NOT keep track of any document version information for the document and thus can lead to "churn", whereby the same document is queued, processed, and removed on subsequent crawls. It is therefore preferable to use noDocument() instead, in any case where the same decision will need to be made over and over.
        Parameters:
        documentIdentifier - is the document's identifier.
        Throws:
        ManifoldCFException
      • setDocumentScheduleBounds

        void setDocumentScheduleBounds​(java.lang.String documentIdentifier,
                                       java.lang.Long lowerRecrawlBoundTime,
                                       java.lang.Long upperRecrawlBoundTime,
                                       java.lang.Long lowerExpireBoundTime,
                                       java.lang.Long upperExpireBoundTime)
                                throws ManifoldCFException
        Override the schedule for the next time a document is crawled. Calling this method allows you to set an upper recrawl bound, lower recrawl bound, upper expire bound, lower expire bound, or a combination of these, on a specific document. This method is only effective if the job is a continuous one, and if the identifier you pass in is being processed.
        Parameters:
        documentIdentifier - is the document's identifier.
        lowerRecrawlBoundTime - is the time in ms since epoch that the reschedule time should not fall BELOW, or null if none.
        upperRecrawlBoundTime - is the time in ms since epoch that the reschedule time should not rise ABOVE, or null if none.
        lowerExpireBoundTime - is the time in ms since epoch that the expire time should not fall BELOW, or null if none.
        upperExpireBoundTime - is the time in ms since epoch that the expire time should not rise ABOVE, or null if none.
        Throws:
        ManifoldCFException
      • setDocumentOriginationTime

        void setDocumentOriginationTime​(java.lang.String documentIdentifier,
                                        java.lang.Long originationTime)
                                 throws ManifoldCFException
        Override a document's origination time. Use this method to signal the framework that a document's origination time is something other than the first time it was crawled.
        Parameters:
        documentIdentifier - is the document's identifier.
        originationTime - is the document's origination time, or null if unknown.
        Throws:
        ManifoldCFException