Interface IIncrementalIngester
-
- All Known Implementing Classes:
IncrementalIngester
public interface IIncrementalIngesterThis interface describes the incremental ingestion API. SOME NOTES: The expected client flow for this API is to: 1) Use the API to fetch a document's version. 2) Base a decision whether to ingest based on that version. 3) If the decision to ingest occurs, then the ingest method in the API is called. The module described by this interface is responsible for keeping track of what has been sent where, and also the corresponding version of each document so indexed. The space over which this takes place is defined by the individual output connection - that is, the output connection seems to "remember" what documents were handed to it. A secondary purpose of this module is to provide a mapping between the key by which a document is described internally (by an identifier hash, plus the name of an identifier space), and the way the document is identified in the output space (by the name of an output connection, plus a URI which is considered local to that output connection space).
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String_rcsid
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description booleancheckDateIndexable(IPipelineSpecification pipelineSpecification, java.util.Date date, IOutputCheckActivity activity)Check if a document date is indexable.booleancheckDocumentIndexable(IPipelineSpecification pipelineSpecification, java.io.File localFile, IOutputCheckActivity activity)Check if a file is indexable.booleancheckFetchDocument(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String newDocumentVersion, java.lang.String newAuthorityNameString)Determine whether we need to fetch or refetch a document.booleancheckLengthIndexable(IPipelineSpecification pipelineSpecification, long length, IOutputCheckActivity activity)Pre-determine whether a document's length is indexable by this connector.booleancheckMimeTypeIndexable(IPipelineSpecification pipelineSpecification, java.lang.String mimeType, IOutputCheckActivity activity)Check if a mime type is indexable.booleancheckURLIndexable(IPipelineSpecification pipelineSpecification, java.lang.String url, IOutputCheckActivity activity)Pre-determine whether a document's URL is indexable by this connector.voidclearAll()Flush all knowledge of what was ingested before.voiddeinstall()Uninstall the incremental ingestion manager.voiddocumentCheck(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, long checkTime)Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).voiddocumentCheckMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, long checkTime)Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).voiddocumentDelete(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, IOutputRemoveActivity activities)Delete a document, and all its components, from the search engine index.voiddocumentDeleteMultiple(IPipelineConnections[] pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities)Delete multiple documents, and their components, from the search engine index.voiddocumentDeleteMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities)Delete multiple documents, and their components, from the search engine index.booleandocumentIngest(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, RepositoryDocument data, long ingestTime, java.lang.String documentURI, IOutputActivity activities)Ingest a document.voiddocumentNoData(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, long recordTime, IOutputActivity activities)Remove a document from specified indexes, just as if an empty document was indexed, and record the necessary version information.voiddocumentRecord(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, long recordTime)Record a document version, but don't ingest it.voiddocumentRemove(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, IOutputRemoveActivity activities)Remove a document component from the search engine index.voiddocumentRemoveMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, java.lang.String componentHash, IOutputRemoveActivity activities)Remove multiple document components from the search engine index.longgetDocumentUpdateInterval(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash)Calculate the average time interval between changes for a document.long[]getDocumentUpdateIntervalMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)Calculate the average time interval between changes for a document.java.lang.StringgetFirstIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)From a pipeline specification, get the name of the output connection that will be indexed first in the pipeline.java.lang.StringgetLastIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)From a pipeline specification, get the name of the output connection that will be indexed last in the pipeline.VersionContextgetOutputDescription(IOutputConnection outputConnection, Specification spec)Get an output version string for a document.voidgetPipelineDocumentIngestData(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash)Look up ingestion data for a document.voidgetPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic[] pipelineSpecificationBasics, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)Look up ingestion data for a set of documents.voidgetPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)Look up ingestion data for a SET of documents.VersionContextgetTransformationDescription(ITransformationConnection transformationConnection, Specification spec)Get transformation version string for a document.voidinstall()Install the incremental ingestion manager.voidremoveOutputConnection(IOutputConnection outputConnection)Remove all knowledge of an output index from the system.voidresetOutputConnection(IOutputConnection outputConnection)Reset all documents belonging to a specific output connection, because we've got information that that system has been reconfigured.
-
-
-
Field Detail
-
_rcsid
static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
-
Method Detail
-
install
void install() throws ManifoldCFExceptionInstall the incremental ingestion manager.- Throws:
ManifoldCFException
-
deinstall
void deinstall() throws ManifoldCFExceptionUninstall the incremental ingestion manager.- Throws:
ManifoldCFException
-
clearAll
void clearAll() throws ManifoldCFExceptionFlush all knowledge of what was ingested before.- Throws:
ManifoldCFException
-
getLastIndexedOutputConnectionName
java.lang.String getLastIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)
From a pipeline specification, get the name of the output connection that will be indexed last in the pipeline.- Parameters:
pipelineSpecificationBasic- is the basic pipeline specification.- Returns:
- the last indexed output connection name.
-
getFirstIndexedOutputConnectionName
java.lang.String getFirstIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)
From a pipeline specification, get the name of the output connection that will be indexed first in the pipeline.- Parameters:
pipelineSpecificationBasic- is the basic pipeline specification.- Returns:
- the first indexed output connection name.
-
getOutputDescription
VersionContext getOutputDescription(IOutputConnection outputConnection, Specification spec) throws ManifoldCFException, ServiceInterruption
Get an output version string for a document.- Parameters:
outputConnection- is the output connection associated with this action.spec- is the output specification.- Returns:
- the description string.
- Throws:
ManifoldCFExceptionServiceInterruption
-
getTransformationDescription
VersionContext getTransformationDescription(ITransformationConnection transformationConnection, Specification spec) throws ManifoldCFException, ServiceInterruption
Get transformation version string for a document.- Parameters:
transformationConnection- is the transformation connection associated with this action.spec- is the transformation specification.- Returns:
- the description string.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkDateIndexable
boolean checkDateIndexable(IPipelineSpecification pipelineSpecification, java.util.Date date, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Check if a document date is indexable.- Parameters:
pipelineSpecification- is the IPipelineSpecification object for this pipeline.date- is the date to checkactivity- are the activities available to this method.- Returns:
- true if the document with that date is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkMimeTypeIndexable
boolean checkMimeTypeIndexable(IPipelineSpecification pipelineSpecification, java.lang.String mimeType, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Check if a mime type is indexable.- Parameters:
pipelineSpecification- is the IPipelineSpecification object for this pipeline.mimeType- is the mime type to check.activity- are the activities available to this method.- Returns:
- true if the mimeType is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkDocumentIndexable
boolean checkDocumentIndexable(IPipelineSpecification pipelineSpecification, java.io.File localFile, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Check if a file is indexable.- Parameters:
pipelineSpecification- is the IPipelineSpecification object for this pipeline.localFile- is the local file to check.activity- are the activities available to this method.- Returns:
- true if the local file is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkLengthIndexable
boolean checkLengthIndexable(IPipelineSpecification pipelineSpecification, long length, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Pre-determine whether a document's length is indexable by this connector. This method is used by participating repository connectors to help filter out documents that are too long to be indexable.- Parameters:
pipelineSpecification- is the IPipelineSpecification object for this pipeline.length- is the length of the document.activity- are the activities available to this method.- Returns:
- true if the file is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkURLIndexable
boolean checkURLIndexable(IPipelineSpecification pipelineSpecification, java.lang.String url, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Pre-determine whether a document's URL is indexable by this connector. This method is used by participating repository connectors to help filter out documents that not indexable.- Parameters:
pipelineSpecification- is the IPipelineSpecification object for this pipeline.url- is the url of the document.activity- are the activities available to this method.- Returns:
- true if the file is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkFetchDocument
boolean checkFetchDocument(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String newDocumentVersion, java.lang.String newAuthorityNameString)
Determine whether we need to fetch or refetch a document. Pass in information including the pipeline specification with existing version info, plus new document and parameter version strings. If no outputs need to be updated, then this method will return false. If any outputs need updating, then true is returned.- Parameters:
pipelineSpecificationWithVersions- is the pipeline specification including new version info for all transformation and output connections.newDocumentVersion- is the newly-determined document version.newAuthorityNameString- is the newly-determined authority name.- Returns:
- true if the document needs to be refetched.
-
documentRecord
void documentRecord(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, long recordTime) throws ManifoldCFException
Record a document version, but don't ingest it. The purpose of this method is to update document version information without reindexing the document.- Parameters:
pipelineSpecificationBasic- is the basic pipeline specification needed.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hashed document identifier.componentHash- is the hashed component identifier, if any.documentVersion- is the document version.recordTime- is the time at which the recording took place, in milliseconds since epoch.- Throws:
ManifoldCFException
-
documentNoData
void documentNoData(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, long recordTime, IOutputActivity activities) throws ManifoldCFException, ServiceInterruption
Remove a document from specified indexes, just as if an empty document was indexed, and record the necessary version information. This method is conceptually similar to documentIngest(), but does not actually take a document or allow it to be transformed. If there is a document already indexed, it is removed from the index.- Parameters:
pipelineSpecificationWithVersions- is the pipeline specification with already-fetched output versioning information.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hashed document identifier.componentHash- is the hashed component identifier, if any.documentVersion- is the document version.authorityName- is the name of the authority associated with the document, if any.recordTime- is the time at which the recording took place, in milliseconds since epoch.activities- is an object providing a set of methods that the implementer can use to perform the operation.- Throws:
ManifoldCFExceptionServiceInterruption
-
documentIngest
boolean documentIngest(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, RepositoryDocument data, long ingestTime, java.lang.String documentURI, IOutputActivity activities) throws ManifoldCFException, ServiceInterruption, java.io.IOException
Ingest a document. This ingests the document, and notes it. If this is a repeat ingestion of the document, this method also REMOVES ALL OLD METADATA. When complete, the index will contain only the metadata described by the RepositoryDocument object passed to this method. ServiceInterruption is thrown if the document ingestion must be rescheduled.- Parameters:
pipelineSpecificationWithVersions- is the pipeline specification with already-fetched output versioning information.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hashed document identifier.componentHash- is the hashed component identifier, if any.documentVersion- is the document version.authorityName- is the name of the authority associated with the document, if any.data- is the document data. The data is closed after ingestion is complete.ingestTime- is the time at which the ingestion took place, in milliseconds since epoch.documentURI- is the URI of the document, which will be used as the key of the document in the index.activities- is an object providing a set of methods that the implementer can use to perform the operation.- Returns:
- true if the ingest was ok, false if the ingest is illegal (and should not be repeated).
- Throws:
java.io.IOException- only if data stream throws an IOException.ManifoldCFExceptionServiceInterruption
-
documentRemove
void documentRemove(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Remove a document component from the search engine index.- Parameters:
pipelineConnections- is the pipeline specification.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hash of the id of the document.componentHash- is the hashed component identifier, if any.activities- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFExceptionServiceInterruption
-
documentRemoveMultiple
void documentRemoveMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, java.lang.String componentHash, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Remove multiple document components from the search engine index.- Parameters:
pipelineConnections- is the pipeline specification.identifierClasses- are the names of the spaces in which the identifier hash should be interpreted.identifierHashes- are the hashes of the ids of the documents.componentHash- is the hashed component identifier, if any.activities- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFExceptionServiceInterruption
-
documentCheckMultiple
void documentCheckMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, long checkTime) throws ManifoldCFException
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).- Parameters:
pipelineSpecificationBasic- is a pipeline specification.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- are the set of document identifier hashes.checkTime- is the time at which the check took place, in milliseconds since epoch.- Throws:
ManifoldCFException
-
documentCheck
void documentCheck(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, long checkTime) throws ManifoldCFException
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).- Parameters:
pipelineSpecificationBasic- is a basic pipeline specification.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hashed document identifier.checkTime- is the time at which the check took place, in milliseconds since epoch.- Throws:
ManifoldCFException
-
documentDeleteMultiple
void documentDeleteMultiple(IPipelineConnections[] pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Delete multiple documents, and their components, from the search engine index.- Parameters:
pipelineConnections- are the pipeline specifications associated with the documents.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- is tha array of document identifier hashes if the documents.activities- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFExceptionServiceInterruption
-
documentDeleteMultiple
void documentDeleteMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Delete multiple documents, and their components, from the search engine index.- Parameters:
pipelineConnections- is the pipeline specification.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- is tha array of document identifier hashes if the documents.activities- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFExceptionServiceInterruption
-
documentDelete
void documentDelete(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Delete a document, and all its components, from the search engine index.- Parameters:
pipelineConnections- is the pipeline specification.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hash of the id of the document.activities- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFExceptionServiceInterruption
-
getPipelineDocumentIngestDataMultiple
void getPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic[] pipelineSpecificationBasics, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Look up ingestion data for a set of documents.- Parameters:
rval- is a map of output key to document data, in no particular order, which will be loaded with all matching results.pipelineSpecificationBasics- are the pipeline specifications corresponding to the identifier classes and hashes.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- is the array of document identifier hashes to look up.- Throws:
ManifoldCFException
-
getPipelineDocumentIngestDataMultiple
void getPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Look up ingestion data for a SET of documents.- Parameters:
rval- is a map of output key to document data, in no particular order, which will be loaded with all matching results.pipelineSpecificationBasic- is the pipeline specification for all documents.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- is the array of document identifier hashes to look up.- Throws:
ManifoldCFException
-
getPipelineDocumentIngestData
void getPipelineDocumentIngestData(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash) throws ManifoldCFException
Look up ingestion data for a document.- Parameters:
rval- is a map of output key to document data, in no particular order, which will be loaded with all matching results.pipelineSpecificationBasic- is the pipeline specification for the document.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hash of the id of the document.- Throws:
ManifoldCFException
-
getDocumentUpdateIntervalMultiple
long[] getDocumentUpdateIntervalMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Calculate the average time interval between changes for a document. This is based on the data gathered for the document.- Parameters:
pipelineSpecificationBasic- is the basic pipeline specification.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- is the hashes of the ids of the documents.- Returns:
- the number of milliseconds between changes, or 0 if this cannot be calculated.
- Throws:
ManifoldCFException
-
getDocumentUpdateInterval
long getDocumentUpdateInterval(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash) throws ManifoldCFException
Calculate the average time interval between changes for a document. This is based on the data gathered for the document.- Parameters:
pipelineSpecificationBasic- is the basic pipeline specification.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hash of the id of the document.- Returns:
- the number of milliseconds between changes, or 0 if this cannot be calculated.
- Throws:
ManifoldCFException
-
resetOutputConnection
void resetOutputConnection(IOutputConnection outputConnection) throws ManifoldCFException
Reset all documents belonging to a specific output connection, because we've got information that that system has been reconfigured. This will force all such documents to be reindexed the next time they are checked.- Parameters:
outputConnection- is the output connection associated with this action.- Throws:
ManifoldCFException
-
removeOutputConnection
void removeOutputConnection(IOutputConnection outputConnection) throws ManifoldCFException
Remove all knowledge of an output index from the system. This is appropriate when the output index no longer exists and you wish to delete the associated job.- Parameters:
outputConnection- is the output connection associated with this action.- Throws:
ManifoldCFException
-
-