Interface IPipelineConnector

  • All Superinterfaces:
    IConnector
    All Known Subinterfaces:
    IOutputConnector, ITransformationConnector
    All Known Implementing Classes:
    BaseOutputConnector, BaseTransformationConnector

    public interface IPipelineConnector
    extends IConnector
    This interface describes an instance of a connector which can live in a chained processing pipeline. Both transformation connectors and output connectors are expected to extend this interface. Pipeline connectors have two basic functions: (1) Processing documents, and optionally passing them to the next pipeline stage; (2) Determining if a document is acceptable, optionally by querying the next pipeline stage.
    • Field Detail

      • DOCUMENTSTATUS_ACCEPTED

        static final int DOCUMENTSTATUS_ACCEPTED
        Document accepted
        See Also:
        Constant Field Values
      • DOCUMENTSTATUS_REJECTED

        static final int DOCUMENTSTATUS_REJECTED
        Document permanently rejected
        See Also:
        Constant Field Values
    • Method Detail

      • getPipelineDescription

        VersionContext getPipelineDescription​(Specification spec)
                                       throws ManifoldCFException,
                                              ServiceInterruption
        Get a pipeline version object, given a pipeline specification object. The version string is used to uniquely describe the pertinent details of the specification and the configuration, to allow the Connector Framework to determine whether a document will need to be processed again. Note that the contents of any document cannot be considered by this method; only configuration and specification information can be considered. This method presumes that the underlying connector object has been configured.
        Parameters:
        spec - is the current pipeline specification object for this connection for the job that is doing the crawling.
        Returns:
        a version object, including a string of unlimited length, which uniquely describes configuration and specification in such a way that if two such strings are equal, nothing that affects how or whether the document is indexed will be different.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • checkDateIndexable

        boolean checkDateIndexable​(VersionContext pipelineDescription,
                                   java.util.Date date,
                                   IOutputCheckActivity checkActivity)
                            throws ManifoldCFException,
                                   ServiceInterruption
        Detect if a document date is acceptable or not. This method is used to determine whether it makes sense to fetch a document in the first place.
        Parameters:
        pipelineDescription - is the document's pipeline version string, for this connection.
        date - is the date of the document.
        checkActivity - is an object including the activities that can be performed by this method.
        Returns:
        true if the document with that date can be accepted by this connector.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • checkMimeTypeIndexable

        boolean checkMimeTypeIndexable​(VersionContext pipelineDescription,
                                       java.lang.String mimeType,
                                       IOutputCheckActivity checkActivity)
                                throws ManifoldCFException,
                                       ServiceInterruption
        Detect if a mime type is acceptable or not. This method is used to determine whether it makes sense to fetch a document in the first place.
        Parameters:
        pipelineDescription - is the document's pipeline version string, for this connection.
        mimeType - is the mime type of the document.
        checkActivity - is an object including the activities that can be performed by this method.
        Returns:
        true if the mime type can be accepted by this connector.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • checkDocumentIndexable

        boolean checkDocumentIndexable​(VersionContext pipelineDescription,
                                       java.io.File localFile,
                                       IOutputCheckActivity checkActivity)
                                throws ManifoldCFException,
                                       ServiceInterruption
        Pre-determine whether a document (passed here as a File object) is acceptable or not. This method is used to determine whether a document needs to be actually transferred. This hook is provided mainly to support search engines that only handle a small set of accepted file types.
        Parameters:
        pipelineDescription - is the document's pipeline version string, for this connection.
        localFile - is the local file to check.
        checkActivity - is an object including the activities that can be done by this method.
        Returns:
        true if the file is acceptable, false if not.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • checkLengthIndexable

        boolean checkLengthIndexable​(VersionContext pipelineDescription,
                                     long length,
                                     IOutputCheckActivity checkActivity)
                              throws ManifoldCFException,
                                     ServiceInterruption
        Pre-determine whether a document's length is acceptable. This method is used to determine whether to fetch a document in the first place.
        Parameters:
        pipelineDescription - is the document's pipeline version string, for this connection.
        length - is the length of the document.
        checkActivity - is an object including the activities that can be done by this method.
        Returns:
        true if the file is acceptable, false if not.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • checkURLIndexable

        boolean checkURLIndexable​(VersionContext pipelineDescription,
                                  java.lang.String url,
                                  IOutputCheckActivity checkActivity)
                           throws ManifoldCFException,
                                  ServiceInterruption
        Pre-determine whether a document's URL is acceptable. This method is used to help filter out documents that cannot be indexed in advance.
        Parameters:
        pipelineDescription - is the document's pipeline version string, for this connection.
        url - is the URL of the document.
        checkActivity - is an object including the activities that can be done by this method.
        Returns:
        true if the file is acceptable, false if not.
        Throws:
        ManifoldCFException
        ServiceInterruption
      • addOrReplaceDocumentWithException

        int addOrReplaceDocumentWithException​(java.lang.String documentURI,
                                              VersionContext pipelineDescription,
                                              RepositoryDocument document,
                                              java.lang.String authorityNameString,
                                              IOutputAddActivity activities)
                                       throws ManifoldCFException,
                                              ServiceInterruption,
                                              java.io.IOException
        Add (or replace) a document in the output data store using the connector. This method presumes that the connector object has been configured, and it is thus able to communicate with the output data store should that be necessary.
        Parameters:
        documentURI - is the URI of the document. The URI is presumed to be the unique identifier which the output data store will use to process and serve the document. This URI is constructed by the repository connector which fetches the document, and is thus universal across all output connectors.
        pipelineDescription - includes the description string that was constructed for this document by the getOutputDescription() method.
        document - is the document data to be processed (handed to the output data store).
        authorityNameString - is the name of the authority responsible for authorizing any access tokens passed in with the repository document. May be null.
        activities - is the handle to an object that the implementer of a pipeline connector may use to perform operations, such as logging processing activity, or sending a modified document to the next stage in the pipeline.
        Returns:
        the document status (accepted or permanently rejected).
        Throws:
        java.io.IOException - only if there's a stream error reading the document data.
        ManifoldCFException
        ServiceInterruption
      • getFormCheckJavascriptMethodName

        java.lang.String getFormCheckJavascriptMethodName​(int connectionSequenceNumber)
        Obtain the name of the form check javascript method to call.
        Parameters:
        connectionSequenceNumber - is the unique number of this connection within the job.
        Returns:
        the name of the form check javascript method.
      • getFormPresaveCheckJavascriptMethodName

        java.lang.String getFormPresaveCheckJavascriptMethodName​(int connectionSequenceNumber)
        Obtain the name of the form presave check javascript method to call.
        Parameters:
        connectionSequenceNumber - is the unique number of this connection within the job.
        Returns:
        the name of the form presave check javascript method.
      • outputSpecificationHeader

        void outputSpecificationHeader​(IHTTPOutput out,
                                       java.util.Locale locale,
                                       Specification os,
                                       int connectionSequenceNumber,
                                       java.util.List<java.lang.String> tabsArray)
                                throws ManifoldCFException,
                                       java.io.IOException
        Output the specification header section. This method is called in the head section of a job page which has selected a pipeline connection of the current type. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the job editing HTML.
        Parameters:
        out - is the output to which any HTML should be sent.
        locale - is the preferred local of the output.
        os - is the current pipeline specification for this connection.
        connectionSequenceNumber - is the unique number of this connection within the job.
        tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
        Throws:
        ManifoldCFException
        java.io.IOException
      • outputSpecificationBody

        void outputSpecificationBody​(IHTTPOutput out,
                                     java.util.Locale locale,
                                     Specification os,
                                     int connectionSequenceNumber,
                                     int actualSequenceNumber,
                                     java.lang.String tabName)
                              throws ManifoldCFException,
                                     java.io.IOException
        Output the specification body section. This method is called in the body section of a job page which has selected a pipeline connection of the current type. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate <html>, <body>, and <form> tags. The name of the form is "editjob".
        Parameters:
        out - is the output to which any HTML should be sent.
        locale - is the preferred local of the output.
        os - is the current pipeline specification for this job.
        connectionSequenceNumber - is the unique number of this connection within the job.
        actualSequenceNumber - is the connection within the job that has currently been selected.
        tabName - is the current tab name.
        Throws:
        ManifoldCFException
        java.io.IOException
      • processSpecificationPost

        java.lang.String processSpecificationPost​(IPostParameters variableContext,
                                                  java.util.Locale locale,
                                                  Specification os,
                                                  int connectionSequenceNumber)
                                           throws ManifoldCFException
        Process a specification post. This method is called at the start of job's edit or view page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the transformation specification accordingly. The name of the posted form is "editjob".
        Parameters:
        variableContext - contains the post data, including binary file-upload information.
        locale - is the preferred local of the output.
        os - is the current pipeline specification for this job.
        connectionSequenceNumber - is the unique number of this connection within the job.
        Returns:
        null if all is well, or a string error message if there is an error that should prevent saving of the job (and cause a redirection to an error page).
        Throws:
        ManifoldCFException
      • viewSpecification

        void viewSpecification​(IHTTPOutput out,
                               java.util.Locale locale,
                               Specification os,
                               int connectionSequenceNumber)
                        throws ManifoldCFException,
                               java.io.IOException
        View specification. This method is called in the body section of a job's view page. Its purpose is to present the pipeline specification information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate <html> and <body> tags.
        Parameters:
        out - is the output to which any HTML should be sent.
        locale - is the preferred local of the output.
        connectionSequenceNumber - is the unique number of this connection within the job.
        os - is the current pipeline specification for this job.
        Throws:
        ManifoldCFException
        java.io.IOException