Interface IIncrementalIngester
-
- All Known Implementing Classes:
IncrementalIngester
public interface IIncrementalIngester
This interface describes the incremental ingestion API. SOME NOTES: The expected client flow for this API is to: 1) Use the API to fetch a document's version. 2) Base a decision whether to ingest based on that version. 3) If the decision to ingest occurs, then the ingest method in the API is called. The module described by this interface is responsible for keeping track of what has been sent where, and also the corresponding version of each document so indexed. The space over which this takes place is defined by the individual output connection - that is, the output connection seems to "remember" what documents were handed to it. A secondary purpose of this module is to provide a mapping between the key by which a document is described internally (by an identifier hash, plus the name of an identifier space), and the way the document is identified in the output space (by the name of an output connection, plus a URI which is considered local to that output connection space).
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
_rcsid
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description boolean
checkDateIndexable(IPipelineSpecification pipelineSpecification, java.util.Date date, IOutputCheckActivity activity)
Check if a document date is indexable.boolean
checkDocumentIndexable(IPipelineSpecification pipelineSpecification, java.io.File localFile, IOutputCheckActivity activity)
Check if a file is indexable.boolean
checkFetchDocument(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String newDocumentVersion, java.lang.String newAuthorityNameString)
Determine whether we need to fetch or refetch a document.boolean
checkLengthIndexable(IPipelineSpecification pipelineSpecification, long length, IOutputCheckActivity activity)
Pre-determine whether a document's length is indexable by this connector.boolean
checkMimeTypeIndexable(IPipelineSpecification pipelineSpecification, java.lang.String mimeType, IOutputCheckActivity activity)
Check if a mime type is indexable.boolean
checkURLIndexable(IPipelineSpecification pipelineSpecification, java.lang.String url, IOutputCheckActivity activity)
Pre-determine whether a document's URL is indexable by this connector.void
clearAll()
Flush all knowledge of what was ingested before.void
deinstall()
Uninstall the incremental ingestion manager.void
documentCheck(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, long checkTime)
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).void
documentCheckMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, long checkTime)
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).void
documentDelete(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, IOutputRemoveActivity activities)
Delete a document, and all its components, from the search engine index.void
documentDeleteMultiple(IPipelineConnections[] pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities)
Delete multiple documents, and their components, from the search engine index.void
documentDeleteMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities)
Delete multiple documents, and their components, from the search engine index.boolean
documentIngest(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, RepositoryDocument data, long ingestTime, java.lang.String documentURI, IOutputActivity activities)
Ingest a document.void
documentNoData(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, long recordTime, IOutputActivity activities)
Remove a document from specified indexes, just as if an empty document was indexed, and record the necessary version information.void
documentRecord(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, long recordTime)
Record a document version, but don't ingest it.void
documentRemove(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, IOutputRemoveActivity activities)
Remove a document component from the search engine index.void
documentRemoveMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, java.lang.String componentHash, IOutputRemoveActivity activities)
Remove multiple document components from the search engine index.long
getDocumentUpdateInterval(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash)
Calculate the average time interval between changes for a document.long[]
getDocumentUpdateIntervalMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)
Calculate the average time interval between changes for a document.java.lang.String
getFirstIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)
From a pipeline specification, get the name of the output connection that will be indexed first in the pipeline.java.lang.String
getLastIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)
From a pipeline specification, get the name of the output connection that will be indexed last in the pipeline.VersionContext
getOutputDescription(IOutputConnection outputConnection, Specification spec)
Get an output version string for a document.void
getPipelineDocumentIngestData(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash)
Look up ingestion data for a document.void
getPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic[] pipelineSpecificationBasics, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)
Look up ingestion data for a set of documents.void
getPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)
Look up ingestion data for a SET of documents.VersionContext
getTransformationDescription(ITransformationConnection transformationConnection, Specification spec)
Get transformation version string for a document.void
install()
Install the incremental ingestion manager.void
removeOutputConnection(IOutputConnection outputConnection)
Remove all knowledge of an output index from the system.void
resetOutputConnection(IOutputConnection outputConnection)
Reset all documents belonging to a specific output connection, because we've got information that that system has been reconfigured.
-
-
-
Field Detail
-
_rcsid
static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
-
Method Detail
-
install
void install() throws ManifoldCFException
Install the incremental ingestion manager.- Throws:
ManifoldCFException
-
deinstall
void deinstall() throws ManifoldCFException
Uninstall the incremental ingestion manager.- Throws:
ManifoldCFException
-
clearAll
void clearAll() throws ManifoldCFException
Flush all knowledge of what was ingested before.- Throws:
ManifoldCFException
-
getLastIndexedOutputConnectionName
java.lang.String getLastIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)
From a pipeline specification, get the name of the output connection that will be indexed last in the pipeline.- Parameters:
pipelineSpecificationBasic
- is the basic pipeline specification.- Returns:
- the last indexed output connection name.
-
getFirstIndexedOutputConnectionName
java.lang.String getFirstIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)
From a pipeline specification, get the name of the output connection that will be indexed first in the pipeline.- Parameters:
pipelineSpecificationBasic
- is the basic pipeline specification.- Returns:
- the first indexed output connection name.
-
getOutputDescription
VersionContext getOutputDescription(IOutputConnection outputConnection, Specification spec) throws ManifoldCFException, ServiceInterruption
Get an output version string for a document.- Parameters:
outputConnection
- is the output connection associated with this action.spec
- is the output specification.- Returns:
- the description string.
- Throws:
ManifoldCFException
ServiceInterruption
-
getTransformationDescription
VersionContext getTransformationDescription(ITransformationConnection transformationConnection, Specification spec) throws ManifoldCFException, ServiceInterruption
Get transformation version string for a document.- Parameters:
transformationConnection
- is the transformation connection associated with this action.spec
- is the transformation specification.- Returns:
- the description string.
- Throws:
ManifoldCFException
ServiceInterruption
-
checkDateIndexable
boolean checkDateIndexable(IPipelineSpecification pipelineSpecification, java.util.Date date, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Check if a document date is indexable.- Parameters:
pipelineSpecification
- is the IPipelineSpecification object for this pipeline.date
- is the date to checkactivity
- are the activities available to this method.- Returns:
- true if the document with that date is indexable.
- Throws:
ManifoldCFException
ServiceInterruption
-
checkMimeTypeIndexable
boolean checkMimeTypeIndexable(IPipelineSpecification pipelineSpecification, java.lang.String mimeType, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Check if a mime type is indexable.- Parameters:
pipelineSpecification
- is the IPipelineSpecification object for this pipeline.mimeType
- is the mime type to check.activity
- are the activities available to this method.- Returns:
- true if the mimeType is indexable.
- Throws:
ManifoldCFException
ServiceInterruption
-
checkDocumentIndexable
boolean checkDocumentIndexable(IPipelineSpecification pipelineSpecification, java.io.File localFile, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Check if a file is indexable.- Parameters:
pipelineSpecification
- is the IPipelineSpecification object for this pipeline.localFile
- is the local file to check.activity
- are the activities available to this method.- Returns:
- true if the local file is indexable.
- Throws:
ManifoldCFException
ServiceInterruption
-
checkLengthIndexable
boolean checkLengthIndexable(IPipelineSpecification pipelineSpecification, long length, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Pre-determine whether a document's length is indexable by this connector. This method is used by participating repository connectors to help filter out documents that are too long to be indexable.- Parameters:
pipelineSpecification
- is the IPipelineSpecification object for this pipeline.length
- is the length of the document.activity
- are the activities available to this method.- Returns:
- true if the file is indexable.
- Throws:
ManifoldCFException
ServiceInterruption
-
checkURLIndexable
boolean checkURLIndexable(IPipelineSpecification pipelineSpecification, java.lang.String url, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Pre-determine whether a document's URL is indexable by this connector. This method is used by participating repository connectors to help filter out documents that not indexable.- Parameters:
pipelineSpecification
- is the IPipelineSpecification object for this pipeline.url
- is the url of the document.activity
- are the activities available to this method.- Returns:
- true if the file is indexable.
- Throws:
ManifoldCFException
ServiceInterruption
-
checkFetchDocument
boolean checkFetchDocument(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String newDocumentVersion, java.lang.String newAuthorityNameString)
Determine whether we need to fetch or refetch a document. Pass in information including the pipeline specification with existing version info, plus new document and parameter version strings. If no outputs need to be updated, then this method will return false. If any outputs need updating, then true is returned.- Parameters:
pipelineSpecificationWithVersions
- is the pipeline specification including new version info for all transformation and output connections.newDocumentVersion
- is the newly-determined document version.newAuthorityNameString
- is the newly-determined authority name.- Returns:
- true if the document needs to be refetched.
-
documentRecord
void documentRecord(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, long recordTime) throws ManifoldCFException
Record a document version, but don't ingest it. The purpose of this method is to update document version information without reindexing the document.- Parameters:
pipelineSpecificationBasic
- is the basic pipeline specification needed.identifierClass
- is the name of the space in which the identifier hash should be interpreted.identifierHash
- is the hashed document identifier.componentHash
- is the hashed component identifier, if any.documentVersion
- is the document version.recordTime
- is the time at which the recording took place, in milliseconds since epoch.- Throws:
ManifoldCFException
-
documentNoData
void documentNoData(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, long recordTime, IOutputActivity activities) throws ManifoldCFException, ServiceInterruption
Remove a document from specified indexes, just as if an empty document was indexed, and record the necessary version information. This method is conceptually similar to documentIngest(), but does not actually take a document or allow it to be transformed. If there is a document already indexed, it is removed from the index.- Parameters:
pipelineSpecificationWithVersions
- is the pipeline specification with already-fetched output versioning information.identifierClass
- is the name of the space in which the identifier hash should be interpreted.identifierHash
- is the hashed document identifier.componentHash
- is the hashed component identifier, if any.documentVersion
- is the document version.authorityName
- is the name of the authority associated with the document, if any.recordTime
- is the time at which the recording took place, in milliseconds since epoch.activities
- is an object providing a set of methods that the implementer can use to perform the operation.- Throws:
ManifoldCFException
ServiceInterruption
-
documentIngest
boolean documentIngest(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, RepositoryDocument data, long ingestTime, java.lang.String documentURI, IOutputActivity activities) throws ManifoldCFException, ServiceInterruption, java.io.IOException
Ingest a document. This ingests the document, and notes it. If this is a repeat ingestion of the document, this method also REMOVES ALL OLD METADATA. When complete, the index will contain only the metadata described by the RepositoryDocument object passed to this method. ServiceInterruption is thrown if the document ingestion must be rescheduled.- Parameters:
pipelineSpecificationWithVersions
- is the pipeline specification with already-fetched output versioning information.identifierClass
- is the name of the space in which the identifier hash should be interpreted.identifierHash
- is the hashed document identifier.componentHash
- is the hashed component identifier, if any.documentVersion
- is the document version.authorityName
- is the name of the authority associated with the document, if any.data
- is the document data. The data is closed after ingestion is complete.ingestTime
- is the time at which the ingestion took place, in milliseconds since epoch.documentURI
- is the URI of the document, which will be used as the key of the document in the index.activities
- is an object providing a set of methods that the implementer can use to perform the operation.- Returns:
- true if the ingest was ok, false if the ingest is illegal (and should not be repeated).
- Throws:
java.io.IOException
- only if data stream throws an IOException.ManifoldCFException
ServiceInterruption
-
documentRemove
void documentRemove(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Remove a document component from the search engine index.- Parameters:
pipelineConnections
- is the pipeline specification.identifierClass
- is the name of the space in which the identifier hash should be interpreted.identifierHash
- is the hash of the id of the document.componentHash
- is the hashed component identifier, if any.activities
- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFException
ServiceInterruption
-
documentRemoveMultiple
void documentRemoveMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, java.lang.String componentHash, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Remove multiple document components from the search engine index.- Parameters:
pipelineConnections
- is the pipeline specification.identifierClasses
- are the names of the spaces in which the identifier hash should be interpreted.identifierHashes
- are the hashes of the ids of the documents.componentHash
- is the hashed component identifier, if any.activities
- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFException
ServiceInterruption
-
documentCheckMultiple
void documentCheckMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, long checkTime) throws ManifoldCFException
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).- Parameters:
pipelineSpecificationBasic
- is a pipeline specification.identifierClasses
- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes
- are the set of document identifier hashes.checkTime
- is the time at which the check took place, in milliseconds since epoch.- Throws:
ManifoldCFException
-
documentCheck
void documentCheck(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, long checkTime) throws ManifoldCFException
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).- Parameters:
pipelineSpecificationBasic
- is a basic pipeline specification.identifierClass
- is the name of the space in which the identifier hash should be interpreted.identifierHash
- is the hashed document identifier.checkTime
- is the time at which the check took place, in milliseconds since epoch.- Throws:
ManifoldCFException
-
documentDeleteMultiple
void documentDeleteMultiple(IPipelineConnections[] pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Delete multiple documents, and their components, from the search engine index.- Parameters:
pipelineConnections
- are the pipeline specifications associated with the documents.identifierClasses
- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes
- is tha array of document identifier hashes if the documents.activities
- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFException
ServiceInterruption
-
documentDeleteMultiple
void documentDeleteMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Delete multiple documents, and their components, from the search engine index.- Parameters:
pipelineConnections
- is the pipeline specification.identifierClasses
- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes
- is tha array of document identifier hashes if the documents.activities
- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFException
ServiceInterruption
-
documentDelete
void documentDelete(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Delete a document, and all its components, from the search engine index.- Parameters:
pipelineConnections
- is the pipeline specification.identifierClass
- is the name of the space in which the identifier hash should be interpreted.identifierHash
- is the hash of the id of the document.activities
- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFException
ServiceInterruption
-
getPipelineDocumentIngestDataMultiple
void getPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic[] pipelineSpecificationBasics, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Look up ingestion data for a set of documents.- Parameters:
rval
- is a map of output key to document data, in no particular order, which will be loaded with all matching results.pipelineSpecificationBasics
- are the pipeline specifications corresponding to the identifier classes and hashes.identifierClasses
- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes
- is the array of document identifier hashes to look up.- Throws:
ManifoldCFException
-
getPipelineDocumentIngestDataMultiple
void getPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Look up ingestion data for a SET of documents.- Parameters:
rval
- is a map of output key to document data, in no particular order, which will be loaded with all matching results.pipelineSpecificationBasic
- is the pipeline specification for all documents.identifierClasses
- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes
- is the array of document identifier hashes to look up.- Throws:
ManifoldCFException
-
getPipelineDocumentIngestData
void getPipelineDocumentIngestData(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash) throws ManifoldCFException
Look up ingestion data for a document.- Parameters:
rval
- is a map of output key to document data, in no particular order, which will be loaded with all matching results.pipelineSpecificationBasic
- is the pipeline specification for the document.identifierClass
- is the name of the space in which the identifier hash should be interpreted.identifierHash
- is the hash of the id of the document.- Throws:
ManifoldCFException
-
getDocumentUpdateIntervalMultiple
long[] getDocumentUpdateIntervalMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Calculate the average time interval between changes for a document. This is based on the data gathered for the document.- Parameters:
pipelineSpecificationBasic
- is the basic pipeline specification.identifierClasses
- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes
- is the hashes of the ids of the documents.- Returns:
- the number of milliseconds between changes, or 0 if this cannot be calculated.
- Throws:
ManifoldCFException
-
getDocumentUpdateInterval
long getDocumentUpdateInterval(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash) throws ManifoldCFException
Calculate the average time interval between changes for a document. This is based on the data gathered for the document.- Parameters:
pipelineSpecificationBasic
- is the basic pipeline specification.identifierClass
- is the name of the space in which the identifier hash should be interpreted.identifierHash
- is the hash of the id of the document.- Returns:
- the number of milliseconds between changes, or 0 if this cannot be calculated.
- Throws:
ManifoldCFException
-
resetOutputConnection
void resetOutputConnection(IOutputConnection outputConnection) throws ManifoldCFException
Reset all documents belonging to a specific output connection, because we've got information that that system has been reconfigured. This will force all such documents to be reindexed the next time they are checked.- Parameters:
outputConnection
- is the output connection associated with this action.- Throws:
ManifoldCFException
-
removeOutputConnection
void removeOutputConnection(IOutputConnection outputConnection) throws ManifoldCFException
Remove all knowledge of an output index from the system. This is appropriate when the output index no longer exists and you wish to delete the associated job.- Parameters:
outputConnection
- is the output connection associated with this action.- Throws:
ManifoldCFException
-
-