Interface IIncrementalIngester

  • All Known Implementing Classes:
    IncrementalIngester

    public interface IIncrementalIngester
    This interface describes the incremental ingestion API. SOME NOTES: The expected client flow for this API is to: 1) Use the API to fetch a document's version. 2) Base a decision whether to ingest based on that version. 3) If the decision to ingest occurs, then the ingest method in the API is called. The module described by this interface is responsible for keeping track of what has been sent where, and also the corresponding version of each document so indexed. The space over which this takes place is defined by the individual output connection - that is, the output connection seems to "remember" what documents were handed to it. A secondary purpose of this module is to provide a mapping between the key by which a document is described internally (by an identifier hash, plus the name of an identifier space), and the way the document is identified in the output space (by the name of an output connection, plus a URI which is considered local to that output connection space).
    • Method Detail

      • getLastIndexedOutputConnectionName

        java.lang.String getLastIndexedOutputConnectionName​(IPipelineSpecificationBasic pipelineSpecificationBasic)
        From a pipeline specification, get the name of the output connection that will be indexed last in the pipeline.
        Parameters:
        pipelineSpecificationBasic - is the basic pipeline specification.
        Returns:
        the last indexed output connection name.
      • getFirstIndexedOutputConnectionName

        java.lang.String getFirstIndexedOutputConnectionName​(IPipelineSpecificationBasic pipelineSpecificationBasic)
        From a pipeline specification, get the name of the output connection that will be indexed first in the pipeline.
        Parameters:
        pipelineSpecificationBasic - is the basic pipeline specification.
        Returns:
        the first indexed output connection name.
      • checkLengthIndexable

        boolean checkLengthIndexable​(IPipelineSpecification pipelineSpecification,
                                     long length,
                                     IOutputCheckActivity activity)
                              throws ManifoldCFException,
                                     ServiceInterruption
        Pre-determine whether a document's length is indexable by this connector. This method is used by participating repository connectors to help filter out documents that are too long to be indexable.
        Parameters:
        pipelineSpecification - is the IPipelineSpecification object for this pipeline.
        length - is the length of the document.
        activity - are the activities available to this method.
        Returns:
        true if the file is indexable.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • checkURLIndexable

        boolean checkURLIndexable​(IPipelineSpecification pipelineSpecification,
                                  java.lang.String url,
                                  IOutputCheckActivity activity)
                           throws ManifoldCFException,
                                  ServiceInterruption
        Pre-determine whether a document's URL is indexable by this connector. This method is used by participating repository connectors to help filter out documents that not indexable.
        Parameters:
        pipelineSpecification - is the IPipelineSpecification object for this pipeline.
        url - is the url of the document.
        activity - are the activities available to this method.
        Returns:
        true if the file is indexable.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • checkFetchDocument

        boolean checkFetchDocument​(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions,
                                   java.lang.String newDocumentVersion,
                                   java.lang.String newAuthorityNameString)
        Determine whether we need to fetch or refetch a document. Pass in information including the pipeline specification with existing version info, plus new document and parameter version strings. If no outputs need to be updated, then this method will return false. If any outputs need updating, then true is returned.
        Parameters:
        pipelineSpecificationWithVersions - is the pipeline specification including new version info for all transformation and output connections.
        newDocumentVersion - is the newly-determined document version.
        newAuthorityNameString - is the newly-determined authority name.
        Returns:
        true if the document needs to be refetched.
      • documentRecord

        void documentRecord​(IPipelineSpecificationBasic pipelineSpecificationBasic,
                            java.lang.String identifierClass,
                            java.lang.String identifierHash,
                            java.lang.String componentHash,
                            java.lang.String documentVersion,
                            long recordTime)
                     throws ManifoldCFException
        Record a document version, but don't ingest it. The purpose of this method is to update document version information without reindexing the document.
        Parameters:
        pipelineSpecificationBasic - is the basic pipeline specification needed.
        identifierClass - is the name of the space in which the identifier hash should be interpreted.
        identifierHash - is the hashed document identifier.
        componentHash - is the hashed component identifier, if any.
        documentVersion - is the document version.
        recordTime - is the time at which the recording took place, in milliseconds since epoch.
        Throws:
        ManifoldCFException
      • documentNoData

        void documentNoData​(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions,
                            java.lang.String identifierClass,
                            java.lang.String identifierHash,
                            java.lang.String componentHash,
                            java.lang.String documentVersion,
                            java.lang.String authorityName,
                            long recordTime,
                            IOutputActivity activities)
                     throws ManifoldCFException,
                            ServiceInterruption
        Remove a document from specified indexes, just as if an empty document was indexed, and record the necessary version information. This method is conceptually similar to documentIngest(), but does not actually take a document or allow it to be transformed. If there is a document already indexed, it is removed from the index.
        Parameters:
        pipelineSpecificationWithVersions - is the pipeline specification with already-fetched output versioning information.
        identifierClass - is the name of the space in which the identifier hash should be interpreted.
        identifierHash - is the hashed document identifier.
        componentHash - is the hashed component identifier, if any.
        documentVersion - is the document version.
        authorityName - is the name of the authority associated with the document, if any.
        recordTime - is the time at which the recording took place, in milliseconds since epoch.
        activities - is an object providing a set of methods that the implementer can use to perform the operation.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • documentIngest

        boolean documentIngest​(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions,
                               java.lang.String identifierClass,
                               java.lang.String identifierHash,
                               java.lang.String componentHash,
                               java.lang.String documentVersion,
                               java.lang.String authorityName,
                               RepositoryDocument data,
                               long ingestTime,
                               java.lang.String documentURI,
                               IOutputActivity activities)
                        throws ManifoldCFException,
                               ServiceInterruption,
                               java.io.IOException
        Ingest a document. This ingests the document, and notes it. If this is a repeat ingestion of the document, this method also REMOVES ALL OLD METADATA. When complete, the index will contain only the metadata described by the RepositoryDocument object passed to this method. ServiceInterruption is thrown if the document ingestion must be rescheduled.
        Parameters:
        pipelineSpecificationWithVersions - is the pipeline specification with already-fetched output versioning information.
        identifierClass - is the name of the space in which the identifier hash should be interpreted.
        identifierHash - is the hashed document identifier.
        componentHash - is the hashed component identifier, if any.
        documentVersion - is the document version.
        authorityName - is the name of the authority associated with the document, if any.
        data - is the document data. The data is closed after ingestion is complete.
        ingestTime - is the time at which the ingestion took place, in milliseconds since epoch.
        documentURI - is the URI of the document, which will be used as the key of the document in the index.
        activities - is an object providing a set of methods that the implementer can use to perform the operation.
        Returns:
        true if the ingest was ok, false if the ingest is illegal (and should not be repeated).
        Throws:
        java.io.IOException - only if data stream throws an IOException.
        ManifoldCFException
        ServiceInterruption
      • documentRemove

        void documentRemove​(IPipelineConnections pipelineConnections,
                            java.lang.String identifierClass,
                            java.lang.String identifierHash,
                            java.lang.String componentHash,
                            IOutputRemoveActivity activities)
                     throws ManifoldCFException,
                            ServiceInterruption
        Remove a document component from the search engine index.
        Parameters:
        pipelineConnections - is the pipeline specification.
        identifierClass - is the name of the space in which the identifier hash should be interpreted.
        identifierHash - is the hash of the id of the document.
        componentHash - is the hashed component identifier, if any.
        activities - is the object to use to log the details of the ingestion attempt. May be null.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • documentRemoveMultiple

        void documentRemoveMultiple​(IPipelineConnections pipelineConnections,
                                    java.lang.String[] identifierClasses,
                                    java.lang.String[] identifierHashes,
                                    java.lang.String componentHash,
                                    IOutputRemoveActivity activities)
                             throws ManifoldCFException,
                                    ServiceInterruption
        Remove multiple document components from the search engine index.
        Parameters:
        pipelineConnections - is the pipeline specification.
        identifierClasses - are the names of the spaces in which the identifier hash should be interpreted.
        identifierHashes - are the hashes of the ids of the documents.
        componentHash - is the hashed component identifier, if any.
        activities - is the object to use to log the details of the ingestion attempt. May be null.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • documentCheckMultiple

        void documentCheckMultiple​(IPipelineSpecificationBasic pipelineSpecificationBasic,
                                   java.lang.String[] identifierClasses,
                                   java.lang.String[] identifierHashes,
                                   long checkTime)
                            throws ManifoldCFException
        Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).
        Parameters:
        pipelineSpecificationBasic - is a pipeline specification.
        identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
        identifierHashes - are the set of document identifier hashes.
        checkTime - is the time at which the check took place, in milliseconds since epoch.
        Throws:
        ManifoldCFException
      • documentCheck

        void documentCheck​(IPipelineSpecificationBasic pipelineSpecificationBasic,
                           java.lang.String identifierClass,
                           java.lang.String identifierHash,
                           long checkTime)
                    throws ManifoldCFException
        Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).
        Parameters:
        pipelineSpecificationBasic - is a basic pipeline specification.
        identifierClass - is the name of the space in which the identifier hash should be interpreted.
        identifierHash - is the hashed document identifier.
        checkTime - is the time at which the check took place, in milliseconds since epoch.
        Throws:
        ManifoldCFException
      • documentDeleteMultiple

        void documentDeleteMultiple​(IPipelineConnections[] pipelineConnections,
                                    java.lang.String[] identifierClasses,
                                    java.lang.String[] identifierHashes,
                                    IOutputRemoveActivity activities)
                             throws ManifoldCFException,
                                    ServiceInterruption
        Delete multiple documents, and their components, from the search engine index.
        Parameters:
        pipelineConnections - are the pipeline specifications associated with the documents.
        identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
        identifierHashes - is tha array of document identifier hashes if the documents.
        activities - is the object to use to log the details of the ingestion attempt. May be null.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • documentDeleteMultiple

        void documentDeleteMultiple​(IPipelineConnections pipelineConnections,
                                    java.lang.String[] identifierClasses,
                                    java.lang.String[] identifierHashes,
                                    IOutputRemoveActivity activities)
                             throws ManifoldCFException,
                                    ServiceInterruption
        Delete multiple documents, and their components, from the search engine index.
        Parameters:
        pipelineConnections - is the pipeline specification.
        identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
        identifierHashes - is tha array of document identifier hashes if the documents.
        activities - is the object to use to log the details of the ingestion attempt. May be null.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • documentDelete

        void documentDelete​(IPipelineConnections pipelineConnections,
                            java.lang.String identifierClass,
                            java.lang.String identifierHash,
                            IOutputRemoveActivity activities)
                     throws ManifoldCFException,
                            ServiceInterruption
        Delete a document, and all its components, from the search engine index.
        Parameters:
        pipelineConnections - is the pipeline specification.
        identifierClass - is the name of the space in which the identifier hash should be interpreted.
        identifierHash - is the hash of the id of the document.
        activities - is the object to use to log the details of the ingestion attempt. May be null.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • getPipelineDocumentIngestDataMultiple

        void getPipelineDocumentIngestDataMultiple​(IngestStatuses rval,
                                                   IPipelineSpecificationBasic[] pipelineSpecificationBasics,
                                                   java.lang.String[] identifierClasses,
                                                   java.lang.String[] identifierHashes)
                                            throws ManifoldCFException
        Look up ingestion data for a set of documents.
        Parameters:
        rval - is a map of output key to document data, in no particular order, which will be loaded with all matching results.
        pipelineSpecificationBasics - are the pipeline specifications corresponding to the identifier classes and hashes.
        identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
        identifierHashes - is the array of document identifier hashes to look up.
        Throws:
        ManifoldCFException
      • getPipelineDocumentIngestDataMultiple

        void getPipelineDocumentIngestDataMultiple​(IngestStatuses rval,
                                                   IPipelineSpecificationBasic pipelineSpecificationBasic,
                                                   java.lang.String[] identifierClasses,
                                                   java.lang.String[] identifierHashes)
                                            throws ManifoldCFException
        Look up ingestion data for a SET of documents.
        Parameters:
        rval - is a map of output key to document data, in no particular order, which will be loaded with all matching results.
        pipelineSpecificationBasic - is the pipeline specification for all documents.
        identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
        identifierHashes - is the array of document identifier hashes to look up.
        Throws:
        ManifoldCFException
      • getPipelineDocumentIngestData

        void getPipelineDocumentIngestData​(IngestStatuses rval,
                                           IPipelineSpecificationBasic pipelineSpecificationBasic,
                                           java.lang.String identifierClass,
                                           java.lang.String identifierHash)
                                    throws ManifoldCFException
        Look up ingestion data for a document.
        Parameters:
        rval - is a map of output key to document data, in no particular order, which will be loaded with all matching results.
        pipelineSpecificationBasic - is the pipeline specification for the document.
        identifierClass - is the name of the space in which the identifier hash should be interpreted.
        identifierHash - is the hash of the id of the document.
        Throws:
        ManifoldCFException
      • getDocumentUpdateIntervalMultiple

        long[] getDocumentUpdateIntervalMultiple​(IPipelineSpecificationBasic pipelineSpecificationBasic,
                                                 java.lang.String[] identifierClasses,
                                                 java.lang.String[] identifierHashes)
                                          throws ManifoldCFException
        Calculate the average time interval between changes for a document. This is based on the data gathered for the document.
        Parameters:
        pipelineSpecificationBasic - is the basic pipeline specification.
        identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
        identifierHashes - is the hashes of the ids of the documents.
        Returns:
        the number of milliseconds between changes, or 0 if this cannot be calculated.
        Throws:
        ManifoldCFException
      • getDocumentUpdateInterval

        long getDocumentUpdateInterval​(IPipelineSpecificationBasic pipelineSpecificationBasic,
                                       java.lang.String identifierClass,
                                       java.lang.String identifierHash)
                                throws ManifoldCFException
        Calculate the average time interval between changes for a document. This is based on the data gathered for the document.
        Parameters:
        pipelineSpecificationBasic - is the basic pipeline specification.
        identifierClass - is the name of the space in which the identifier hash should be interpreted.
        identifierHash - is the hash of the id of the document.
        Returns:
        the number of milliseconds between changes, or 0 if this cannot be calculated.
        Throws:
        ManifoldCFException
      • resetOutputConnection

        void resetOutputConnection​(IOutputConnection outputConnection)
                            throws ManifoldCFException
        Reset all documents belonging to a specific output connection, because we've got information that that system has been reconfigured. This will force all such documents to be reindexed the next time they are checked.
        Parameters:
        outputConnection - is the output connection associated with this action.
        Throws:
        ManifoldCFException
      • removeOutputConnection

        void removeOutputConnection​(IOutputConnection outputConnection)
                             throws ManifoldCFException
        Remove all knowledge of an output index from the system. This is appropriate when the output index no longer exists and you wish to delete the associated job.
        Parameters:
        outputConnection - is the output connection associated with this action.
        Throws:
        ManifoldCFException