Class IncrementalIngester
- java.lang.Object
-
- org.apache.manifoldcf.core.database.BaseTable
-
- org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester
-
- All Implemented Interfaces:
IIncrementalIngester
public class IncrementalIngester extends BaseTable implements IIncrementalIngester
Incremental ingestion API implementation. This class is responsible for keeping track of what has been sent where, and also the corresponding version of each document so indexed. The space over which this takes place is defined by the individual output connection - that is, the output connection seems to "remember" what documents were handed to it. A secondary purpose of this module is to provide a mapping between the key by which a document is described internally (by an identifier hash, plus the name of an identifier space), and the way the document is identified in the output space (by the name of an output connection, plus a URI which is considered local to that output connection space).
ingeststatusField Type Description id BIGINT Primary Key connectionname VARCHAR(32) Reference:outputconnections.connectionname dockey VARCHAR(73) componenthash VARCHAR(40) docuri LONGTEXT urihash VARCHAR(40) lastversion LONGTEXT lastoutputversion LONGTEXT lasttransformationversion LONGTEXT changecount BIGINT firstingest BIGINT lastingest BIGINT authorityname VARCHAR(32)
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static classIncrementalIngester.DeleteInfoThis class contains the information necessary to delete a documentprotected static classIncrementalIngester.MonitoredAddActivityWrapperThis class passes everything through, and monitors what happens so that the framework can compensate for any transformation connector coding errors.protected static classIncrementalIngester.OutputActivitiesWrapperprotected static classIncrementalIngester.OutputAddActivitiesWrapperclassIncrementalIngester.OutputAddEntryPointprotected static classIncrementalIngester.OutputRecordingActivityWrapper class for add activity.protected static classIncrementalIngester.OutputRemoveActivitiesWrapperstatic classIncrementalIngester.PipelineAddEntryPointThis class describes the entry stage of an add pipeline.static classIncrementalIngester.PipelineAddFanoutThis class describes the entry stage of multiple siblings in an add pipeline.static classIncrementalIngester.PipelineCheckEntryPointThis class describes the entry stage of a check pipeline.static classIncrementalIngester.PipelineCheckFanoutThis class describes the entry stage of multiple siblings in a check pipeline.protected classIncrementalIngester.PipelineObjectprotected classIncrementalIngester.PipelineObjectWithVersionsprotected static classIncrementalIngester.TransformationRecordingActivityWrapper class for add activity.
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String_rcsidprotected static java.lang.StringauthorityNameFieldprotected static java.lang.StringchangeCountFieldprotected static java.lang.StringcomponentHashFieldprotected IOutputConnectionManagerconnectionManagerprotected static java.lang.StringdocKeyFieldprotected static java.lang.StringdocURIFieldprotected static java.lang.StringfirstIngestFieldprotected static java.lang.StringidFieldprotected static java.lang.StringlastIngestFieldprotected static java.lang.StringlastOutputVersionFieldprotected static java.lang.StringlastTransformationVersionFieldprotected static java.lang.StringlastVersionFieldprotected ILockManagerlockManagerprotected IOutputConnectorPooloutputConnectorPoolprotected static java.lang.StringoutputConnNameFieldprotected IThreadContextthreadContextprotected ITransformationConnectorPooltransformationConnectorPoolprotected static java.lang.StringuriHashField-
Fields inherited from class org.apache.manifoldcf.core.database.BaseTable
dbInterface, tableName
-
-
Constructor Summary
Constructors Constructor Description IncrementalIngester(IThreadContext threadContext, IDBInterface database)Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description booleancheckDateIndexable(IPipelineSpecification pipelineSpecification, java.util.Date date, IOutputCheckActivity activity)Check if a date is indexable.booleancheckDocumentIndexable(IPipelineSpecification pipelineSpecification, java.io.File localFile, IOutputCheckActivity activity)Check if a file is indexable.booleancheckFetchDocument(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String newDocumentVersion, java.lang.String newAuthorityNameString)Determine whether we need to fetch or refetch a document.booleancheckLengthIndexable(IPipelineSpecification pipelineSpecification, long length, IOutputCheckActivity activity)Pre-determine whether a document's length is indexable by this connector.booleancheckMimeTypeIndexable(IPipelineSpecification pipelineSpecification, java.lang.String mimeType, IOutputCheckActivity activity)Check if a mime type is indexable.booleancheckURLIndexable(IPipelineSpecification pipelineSpecification, java.lang.String url, IOutputCheckActivity activity)Pre-determine whether a document's URL is indexable by this connector.voidclearAll()Flush all knowledge of what was ingested before.protected static java.lang.String[]computeLockArray(java.lang.String documentURIHash, java.lang.String oldURIHash, java.lang.String outputConnectionName)protected static java.lang.StringcomputePackedTransformationVersion(IPipelineSpecification pipelineSpecification, int stage)Compute a transformation version given a pipeline specification and starting output stage.protected static java.lang.StringcreateURILockName(java.lang.String outputConnectionName, java.lang.String uriHash)voiddeinstall()Uninstall the incremental ingestion manager.protected voiddeleteRowIds(java.util.List<java.lang.Long> list)Delete a chunk of row ids.voiddocumentCheck(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, long checkTime)Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).voiddocumentCheckMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, long checkTime)Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).voiddocumentDelete(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, IOutputRemoveActivity activities)Delete a document from the search engine index.voiddocumentDeleteMultiple(IPipelineConnections[] pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities)Delete multiple documents from the search engine index.voiddocumentDeleteMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity originalActivities)Delete multiple documents from the search engine index.booleandocumentIngest(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, RepositoryDocument data, long ingestTime, java.lang.String documentURI, IOutputActivity activities)Ingest a document.voiddocumentNoData(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, long recordTime, IOutputActivity activities)Remove a document from specified indexes, just as if an empty document was indexed, and record the necessary version information.voiddocumentRecord(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, long recordTime)Record a document version, but don't ingest it.voiddocumentRemove(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, IOutputRemoveActivity activities)Remove a document component from the search engine index.voiddocumentRemoveMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, java.lang.String componentHash, IOutputRemoveActivity activities)Remove multiple document components from the search engine index.protected static java.lang.String[]extractOutputConnectionNames(IPipelineSpecificationBasic pipelineSpecificationBasic)protected voidfindRowIdsForDocIds(java.lang.String[] outputConnectionNames, java.util.Set<java.lang.Long> rowIDSet, java.util.List<java.lang.String> paramValues)Given values and parameters corresponding to a set of hash values, add corresponding table row id's to the output map.protected voidfindRowIdsForDocIds(java.lang.String outputConnectionName, java.util.Set<java.lang.Long> rowIDSet, java.util.List<java.lang.String> paramValues)Given values and parameters corresponding to a set of hash values, add corresponding table row id's to the output map.protected voidfindRowIdsForDocIds(java.lang.String outputConnectionName, java.util.Set<java.lang.Long> rowIDSet, java.util.List<java.lang.String> paramValues, java.lang.String componentHash)Given values and parameters corresponding to a set of hash values, add corresponding table row id's to the output map.protected voidfindRowIdsForURIs(java.lang.String outputConnectionName, java.util.Set<java.lang.Long> rowIDSet, java.util.Set<java.lang.String> uris, java.util.List<java.lang.String> hashParamValues)Given values and parameters corresponding to a set of hash values, add corresponding table row id's to the output map.longgetDocumentUpdateInterval(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash)Calculate the average time interval between changes for a document.long[]getDocumentUpdateIntervalMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)Calculate the average time interval between changes for a document.protected voidgetDocumentURIChunk(java.util.List<IncrementalIngester.DeleteInfo> rval, java.lang.String outputConnectionName, java.util.List<java.lang.String> list)Get a chunk of document uris.protected voidgetDocumentURIChunk(java.util.List<IncrementalIngester.DeleteInfo> rval, java.lang.String outputConnectionName, java.util.List<java.lang.String> list, java.lang.String componentHash)Get a chunk of document uris.protected java.util.List<IncrementalIngester.DeleteInfo>getDocumentURIMultiple(java.lang.String outputConnectionName, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)Find out what URIs a SET of document URIs are currently ingested.protected java.util.List<IncrementalIngester.DeleteInfo>getDocumentURIMultiple(java.lang.String outputConnectionName, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, java.lang.String componentHash)Find out what URIs a SET of document URIs are currently ingested.java.lang.StringgetFirstIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)From a pipeline specification, get the name of the output connection that will be indexed first in the pipeline.protected voidgetIntervals(long[] rval, java.lang.String[] outputConnectionNames, java.util.List<java.lang.String> list, java.util.Map<java.lang.String,java.lang.Integer> returnMap)Query for and calculate the interval for a bunch of hashcodes.java.lang.StringgetLastIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)From a pipeline specification, get the name of the output connection that will be indexed last in the pipeline.VersionContextgetOutputDescription(IOutputConnection outputConnection, Specification spec)Get an output version string for a document.voidgetPipelineDocumentIngestData(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash)Look up ingestion data for a document.protected voidgetPipelineDocumentIngestDataChunk(IngestStatuses rval, java.util.Map<java.lang.String,java.lang.Integer> map, java.lang.String[] outputConnectionNames, java.util.List<java.lang.String> list, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)Get a chunk of document ingest data records.voidgetPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic[] pipelineSpecificationBasics, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)Look up ingestion data for a set of documents.voidgetPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)Look up ingestion data for a SET of documents.VersionContextgetTransformationDescription(ITransformationConnection transformationConnection, Specification spec)Get transformation version string for a document.voidinstall()Install the incremental ingestion manager.protected static java.lang.StringmakeKey(java.lang.String documentClass, java.lang.String documentHash)Make a key from a document class and a hashprotected intmaxClauseDocumentIngestDataChunk(java.lang.String outputConnectionName)Count the clausesprotected intmaxClauseDocumentURIChunk(java.lang.String outputConnectionName)Calculate how many clauses at a timeprotected intmaxClauseDocumentURIChunk(java.lang.String outputConnectionName, java.lang.String componentHash)Calculate how many clauses at a timeprotected intmaxClauseGetIntervals(java.lang.String[] outputConnectionNames)Calculate the number of clauses.protected intmaxClausePipelineDocumentIngestDataChunk(java.lang.String[] outputConnectionNames)Count the clausesprotected intmaxClausesDeleteRowIds()Calculate the maximum number of clauses.protected intmaxClausesRowIdsForDocIds(java.lang.String outputConnectionName)Calculate the maximum number of doc ids we should use.protected intmaxClausesRowIdsForDocIds(java.lang.String[] outputConnectionNames)Calculate the maximum number of doc ids we should use.protected intmaxClausesRowIdsForDocIds(java.lang.String outputConnectionName, java.lang.String componentHash)Calculate the maximum number of doc ids we should use.protected intmaxClausesRowIdsForURIs(java.lang.String outputConnectionName)Calculate the clauses.protected intmaxClausesUpdateRowIds()Calculate the number of clauses.protected voidnoteDocumentIngest(java.lang.String outputConnectionName, java.lang.String docKey, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String transformationVersion, java.lang.String outputVersion, java.lang.String authorityNameString, long ingestTime, java.lang.String documentURI, java.lang.String documentURIHash)Note the ingestion of a document, or the "update" of a document.protected static voidpack(java.lang.StringBuilder sb, java.lang.String value, char delim)protected static voidpackList(java.lang.StringBuilder output, java.lang.String[] values, char delimiter)protected IncrementalIngester.PipelineObjectpipelineGrab(IPipelineSpecification pipelineConnections)Grab the entire pipeline.protected IncrementalIngester.PipelineObjectWithVersionspipelineGrabWithVersions(IPipelineSpecificationWithVersions pipelineConnections)Grab the entire pipeline.protected voidremoveDocument(IOutputConnection connection, java.lang.String documentURI, java.lang.String outputDescription, IOutputRemoveActivity activities)Remove document, using the specified output connection, via the standard pool.voidremoveOutputConnection(IOutputConnection outputConnection)Remove all knowledge of an output index from the system.voidresetOutputConnection(IOutputConnection outputConnection)Reset all documents belonging to a specific output connection, because we've got information that that system has been reconfigured.protected voidupdateRowIds(java.util.List<java.lang.Long> list, long checkTime)Update a chunk of row ids.-
Methods inherited from class org.apache.manifoldcf.core.database.BaseTable
addTableIndex, analyzeTable, beginTransaction, buildConjunctionClause, constructCountClause, constructDistinctOnClause, constructDoubleCastClause, constructOffsetLimitClause, constructRegexpClause, constructSubstringClause, endTransaction, findConjunctionClauseMax, getDatabaseCacheKey, getDBInterface, getMaxInClause, getMaxOrClause, getSleepAmt, getTableIndexes, getTableName, getTableSchema, getTransactionID, getWindowedReportMaxRows, makeTableKey, noteModifications, performAddIndex, performAlter, performCommit, performCreate, performDelete, performDrop, performInsert, performModification, performQuery, performQuery, performRemoveIndex, performUpdate, prepareRowForSave, readRow, reindexTable, signalRollback, sleepFor
-
-
-
-
Field Detail
-
_rcsid
public static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
idField
protected static final java.lang.String idField
- See Also:
- Constant Field Values
-
outputConnNameField
protected static final java.lang.String outputConnNameField
- See Also:
- Constant Field Values
-
docKeyField
protected static final java.lang.String docKeyField
- See Also:
- Constant Field Values
-
componentHashField
protected static final java.lang.String componentHashField
- See Also:
- Constant Field Values
-
docURIField
protected static final java.lang.String docURIField
- See Also:
- Constant Field Values
-
uriHashField
protected static final java.lang.String uriHashField
- See Also:
- Constant Field Values
-
lastVersionField
protected static final java.lang.String lastVersionField
- See Also:
- Constant Field Values
-
lastOutputVersionField
protected static final java.lang.String lastOutputVersionField
- See Also:
- Constant Field Values
-
lastTransformationVersionField
protected static final java.lang.String lastTransformationVersionField
- See Also:
- Constant Field Values
-
changeCountField
protected static final java.lang.String changeCountField
- See Also:
- Constant Field Values
-
firstIngestField
protected static final java.lang.String firstIngestField
- See Also:
- Constant Field Values
-
lastIngestField
protected static final java.lang.String lastIngestField
- See Also:
- Constant Field Values
-
authorityNameField
protected static final java.lang.String authorityNameField
- See Also:
- Constant Field Values
-
threadContext
protected final IThreadContext threadContext
-
lockManager
protected final ILockManager lockManager
-
connectionManager
protected final IOutputConnectionManager connectionManager
-
outputConnectorPool
protected final IOutputConnectorPool outputConnectorPool
-
transformationConnectorPool
protected final ITransformationConnectorPool transformationConnectorPool
-
-
Constructor Detail
-
IncrementalIngester
public IncrementalIngester(IThreadContext threadContext, IDBInterface database) throws ManifoldCFException
Constructor.- Throws:
ManifoldCFException
-
-
Method Detail
-
install
public void install() throws ManifoldCFExceptionInstall the incremental ingestion manager.- Specified by:
installin interfaceIIncrementalIngester- Throws:
ManifoldCFException
-
deinstall
public void deinstall() throws ManifoldCFExceptionUninstall the incremental ingestion manager.- Specified by:
deinstallin interfaceIIncrementalIngester- Throws:
ManifoldCFException
-
clearAll
public void clearAll() throws ManifoldCFExceptionFlush all knowledge of what was ingested before.- Specified by:
clearAllin interfaceIIncrementalIngester- Throws:
ManifoldCFException
-
getLastIndexedOutputConnectionName
public java.lang.String getLastIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)
From a pipeline specification, get the name of the output connection that will be indexed last in the pipeline.- Specified by:
getLastIndexedOutputConnectionNamein interfaceIIncrementalIngester- Parameters:
pipelineSpecificationBasic- is the basic pipeline specification.- Returns:
- the last indexed output connection name.
-
getFirstIndexedOutputConnectionName
public java.lang.String getFirstIndexedOutputConnectionName(IPipelineSpecificationBasic pipelineSpecificationBasic)
From a pipeline specification, get the name of the output connection that will be indexed first in the pipeline.- Specified by:
getFirstIndexedOutputConnectionNamein interfaceIIncrementalIngester- Parameters:
pipelineSpecificationBasic- is the basic pipeline specification.- Returns:
- the first indexed output connection name.
-
checkDateIndexable
public boolean checkDateIndexable(IPipelineSpecification pipelineSpecification, java.util.Date date, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Check if a date is indexable.- Specified by:
checkDateIndexablein interfaceIIncrementalIngester- Parameters:
pipelineSpecification- is the IPipelineSpecification object for this pipeline.date- is the date to check.activity- are the activities available to this method.- Returns:
- true if the mimeType is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkMimeTypeIndexable
public boolean checkMimeTypeIndexable(IPipelineSpecification pipelineSpecification, java.lang.String mimeType, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Check if a mime type is indexable.- Specified by:
checkMimeTypeIndexablein interfaceIIncrementalIngester- Parameters:
pipelineSpecification- is the IPipelineSpecification object for this pipeline.mimeType- is the mime type to check.activity- are the activities available to this method.- Returns:
- true if the mimeType is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkDocumentIndexable
public boolean checkDocumentIndexable(IPipelineSpecification pipelineSpecification, java.io.File localFile, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Check if a file is indexable.- Specified by:
checkDocumentIndexablein interfaceIIncrementalIngester- Parameters:
pipelineSpecification- is the IPipelineSpecification object for this pipeline.localFile- is the local file to check.activity- are the activities available to this method.- Returns:
- true if the local file is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkLengthIndexable
public boolean checkLengthIndexable(IPipelineSpecification pipelineSpecification, long length, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Pre-determine whether a document's length is indexable by this connector. This method is used by participating repository connectors to help filter out documents that are too long to be indexable.- Specified by:
checkLengthIndexablein interfaceIIncrementalIngester- Parameters:
pipelineSpecification- is the IPipelineSpecification object for this pipeline.length- is the length of the document.activity- are the activities available to this method.- Returns:
- true if the file is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkURLIndexable
public boolean checkURLIndexable(IPipelineSpecification pipelineSpecification, java.lang.String url, IOutputCheckActivity activity) throws ManifoldCFException, ServiceInterruption
Pre-determine whether a document's URL is indexable by this connector. This method is used by participating repository connectors to help filter out documents that not indexable.- Specified by:
checkURLIndexablein interfaceIIncrementalIngester- Parameters:
pipelineSpecification- is the IPipelineSpecification object for this pipeline.url- is the url of the document.activity- are the activities available to this method.- Returns:
- true if the file is indexable.
- Throws:
ManifoldCFExceptionServiceInterruption
-
pipelineGrabWithVersions
protected IncrementalIngester.PipelineObjectWithVersions pipelineGrabWithVersions(IPipelineSpecificationWithVersions pipelineConnections) throws ManifoldCFException
Grab the entire pipeline.- Parameters:
pipelineConnections- - the pipeline specification with version information- Returns:
- the pipeline description, or null if any part of the pipeline cannot be grabbed.
- Throws:
ManifoldCFException
-
pipelineGrab
protected IncrementalIngester.PipelineObject pipelineGrab(IPipelineSpecification pipelineConnections) throws ManifoldCFException
Grab the entire pipeline.- Parameters:
pipelineConnections- - the pipeline specification- Returns:
- the pipeline description, or null if any part of the pipeline cannot be grabbed.
- Throws:
ManifoldCFException
-
getOutputDescription
public VersionContext getOutputDescription(IOutputConnection outputConnection, Specification spec) throws ManifoldCFException, ServiceInterruption
Get an output version string for a document.- Specified by:
getOutputDescriptionin interfaceIIncrementalIngester- Parameters:
outputConnection- is the output connection associated with this action.spec- is the output specification.- Returns:
- the description string.
- Throws:
ManifoldCFExceptionServiceInterruption
-
getTransformationDescription
public VersionContext getTransformationDescription(ITransformationConnection transformationConnection, Specification spec) throws ManifoldCFException, ServiceInterruption
Get transformation version string for a document.- Specified by:
getTransformationDescriptionin interfaceIIncrementalIngester- Parameters:
transformationConnection- is the transformation connection associated with this action.spec- is the transformation specification.- Returns:
- the description string.
- Throws:
ManifoldCFExceptionServiceInterruption
-
checkFetchDocument
public boolean checkFetchDocument(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String newDocumentVersion, java.lang.String newAuthorityNameString)
Determine whether we need to fetch or refetch a document. Pass in information including the pipeline specification with existing version info, plus new document and parameter version strings. If no outputs need to be updated, then this method will return false. If any outputs need updating, then true is returned.- Specified by:
checkFetchDocumentin interfaceIIncrementalIngester- Parameters:
pipelineSpecificationWithVersions- is the pipeline specification including new version info for all transformation and output connections.newDocumentVersion- is the newly-determined document version.newAuthorityNameString- is the newly-determined authority name.- Returns:
- true if the document needs to be refetched.
-
computePackedTransformationVersion
protected static java.lang.String computePackedTransformationVersion(IPipelineSpecification pipelineSpecification, int stage)
Compute a transformation version given a pipeline specification and starting output stage.- Parameters:
pipelineSpecification- is the pipeline specification.stage- is the stage number of the output stage.- Returns:
- the transformation version string, which will be a composite of all the transformations applied.
-
packList
protected static void packList(java.lang.StringBuilder output, java.lang.String[] values, char delimiter)
-
pack
protected static void pack(java.lang.StringBuilder sb, java.lang.String value, char delim)
-
documentRecord
public void documentRecord(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, long recordTime) throws ManifoldCFException
Record a document version, but don't ingest it. The purpose of this method is to update document version information without reindexing the document.- Specified by:
documentRecordin interfaceIIncrementalIngester- Parameters:
pipelineSpecificationBasic- is the basic pipeline specification needed.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hashed document identifier.componentHash- is the hashed component identifier, if any.documentVersion- is the document version.recordTime- is the time at which the recording took place, in milliseconds since epoch.- Throws:
ManifoldCFException
-
documentNoData
public void documentNoData(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, long recordTime, IOutputActivity activities) throws ManifoldCFException, ServiceInterruption
Remove a document from specified indexes, just as if an empty document was indexed, and record the necessary version information. This method is conceptually similar to documentIngest(), but does not actually take a document or allow it to be transformed. If there is a document already indexed, it is removed from the index.- Specified by:
documentNoDatain interfaceIIncrementalIngester- Parameters:
pipelineSpecificationWithVersions- is the pipeline specification with already-fetched output versioning information.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hashed document identifier.componentHash- is the hashed component identifier, if any.documentVersion- is the document version.authorityName- is the name of the authority associated with the document, if any.recordTime- is the time at which the recording took place, in milliseconds since epoch.activities- is an object providing a set of methods that the implementer can use to perform the operation.- Throws:
ManifoldCFExceptionServiceInterruption
-
documentIngest
public boolean documentIngest(IPipelineSpecificationWithVersions pipelineSpecificationWithVersions, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String authorityName, RepositoryDocument data, long ingestTime, java.lang.String documentURI, IOutputActivity activities) throws ManifoldCFException, ServiceInterruption, java.io.IOException
Ingest a document. This ingests the document, and notes it. If this is a repeat ingestion of the document, this method also REMOVES ALL OLD METADATA. When complete, the index will contain only the metadata described by the RepositoryDocument object passed to this method. ServiceInterruption is thrown if the document ingestion must be rescheduled.- Specified by:
documentIngestin interfaceIIncrementalIngester- Parameters:
pipelineSpecificationWithVersions- is the pipeline specification with already-fetched output versioning information.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hashed document identifier.componentHash- is the hashed component identifier, if any.documentVersion- is the document version.authorityName- is the name of the authority associated with the document, if any.data- is the document data. The data is closed after ingestion is complete.ingestTime- is the time at which the ingestion took place, in milliseconds since epoch.documentURI- is the URI of the document, which will be used as the key of the document in the index.activities- is an object providing a set of methods that the implementer can use to perform the operation.- Returns:
- true if the ingest was ok, false if the ingest is illegal (and should not be repeated).
- Throws:
java.io.IOException- only if data stream throws an IOException.ManifoldCFExceptionServiceInterruption
-
documentRemove
public void documentRemove(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String componentHash, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Remove a document component from the search engine index.- Specified by:
documentRemovein interfaceIIncrementalIngester- Parameters:
pipelineConnections- is the pipeline specification.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hash of the id of the document.componentHash- is the hashed component identifier, if any.activities- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFExceptionServiceInterruption
-
extractOutputConnectionNames
protected static java.lang.String[] extractOutputConnectionNames(IPipelineSpecificationBasic pipelineSpecificationBasic)
-
documentCheckMultiple
public void documentCheckMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, long checkTime) throws ManifoldCFException
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).- Specified by:
documentCheckMultiplein interfaceIIncrementalIngester- Parameters:
pipelineSpecificationBasic- is a pipeline specification.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- are the set of document identifier hashes.checkTime- is the time at which the check took place, in milliseconds since epoch.- Throws:
ManifoldCFException
-
documentCheck
public void documentCheck(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash, long checkTime) throws ManifoldCFException
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).- Specified by:
documentCheckin interfaceIIncrementalIngester- Parameters:
pipelineSpecificationBasic- is a basic pipeline specification.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hashed document identifier.checkTime- is the time at which the check took place, in milliseconds since epoch.- Throws:
ManifoldCFException
-
maxClausesUpdateRowIds
protected int maxClausesUpdateRowIds()
Calculate the number of clauses.
-
updateRowIds
protected void updateRowIds(java.util.List<java.lang.Long> list, long checkTime) throws ManifoldCFExceptionUpdate a chunk of row ids.- Throws:
ManifoldCFException
-
documentDeleteMultiple
public void documentDeleteMultiple(IPipelineConnections[] pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Delete multiple documents from the search engine index.- Specified by:
documentDeleteMultiplein interfaceIIncrementalIngester- Parameters:
pipelineConnections- are the pipeline specifications associated with the documents.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- is tha array of document identifier hashes if the documents.activities- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFExceptionServiceInterruption
-
createURILockName
protected static java.lang.String createURILockName(java.lang.String outputConnectionName, java.lang.String uriHash)
-
documentDeleteMultiple
public void documentDeleteMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity originalActivities) throws ManifoldCFException, ServiceInterruption
Delete multiple documents from the search engine index.- Specified by:
documentDeleteMultiplein interfaceIIncrementalIngester- Parameters:
pipelineConnections- is the pipeline specification.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- is tha array of document identifier hashes if the documents.originalActivities- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFExceptionServiceInterruption
-
documentRemoveMultiple
public void documentRemoveMultiple(IPipelineConnections pipelineConnections, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, java.lang.String componentHash, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Remove multiple document components from the search engine index.- Specified by:
documentRemoveMultiplein interfaceIIncrementalIngester- Parameters:
pipelineConnections- is the pipeline specification.identifierClasses- are the names of the spaces in which the identifier hash should be interpreted.identifierHashes- are the hashes of the ids of the documents.componentHash- is the hashed component identifier, if any.activities- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFExceptionServiceInterruption
-
maxClausesRowIdsForURIs
protected int maxClausesRowIdsForURIs(java.lang.String outputConnectionName)
Calculate the clauses.
-
findRowIdsForURIs
protected void findRowIdsForURIs(java.lang.String outputConnectionName, java.util.Set<java.lang.Long> rowIDSet, java.util.Set<java.lang.String> uris, java.util.List<java.lang.String> hashParamValues) throws ManifoldCFExceptionGiven values and parameters corresponding to a set of hash values, add corresponding table row id's to the output map.- Throws:
ManifoldCFException
-
maxClausesRowIdsForDocIds
protected int maxClausesRowIdsForDocIds(java.lang.String outputConnectionName)
Calculate the maximum number of doc ids we should use.
-
maxClausesRowIdsForDocIds
protected int maxClausesRowIdsForDocIds(java.lang.String outputConnectionName, java.lang.String componentHash)Calculate the maximum number of doc ids we should use.
-
maxClausesRowIdsForDocIds
protected int maxClausesRowIdsForDocIds(java.lang.String[] outputConnectionNames)
Calculate the maximum number of doc ids we should use.
-
findRowIdsForDocIds
protected void findRowIdsForDocIds(java.lang.String outputConnectionName, java.util.Set<java.lang.Long> rowIDSet, java.util.List<java.lang.String> paramValues) throws ManifoldCFExceptionGiven values and parameters corresponding to a set of hash values, add corresponding table row id's to the output map.- Throws:
ManifoldCFException
-
findRowIdsForDocIds
protected void findRowIdsForDocIds(java.lang.String outputConnectionName, java.util.Set<java.lang.Long> rowIDSet, java.util.List<java.lang.String> paramValues, java.lang.String componentHash) throws ManifoldCFExceptionGiven values and parameters corresponding to a set of hash values, add corresponding table row id's to the output map.- Throws:
ManifoldCFException
-
findRowIdsForDocIds
protected void findRowIdsForDocIds(java.lang.String[] outputConnectionNames, java.util.Set<java.lang.Long> rowIDSet, java.util.List<java.lang.String> paramValues) throws ManifoldCFExceptionGiven values and parameters corresponding to a set of hash values, add corresponding table row id's to the output map.- Throws:
ManifoldCFException
-
maxClausesDeleteRowIds
protected int maxClausesDeleteRowIds()
Calculate the maximum number of clauses.
-
deleteRowIds
protected void deleteRowIds(java.util.List<java.lang.Long> list) throws ManifoldCFExceptionDelete a chunk of row ids.- Throws:
ManifoldCFException
-
documentDelete
public void documentDelete(IPipelineConnections pipelineConnections, java.lang.String identifierClass, java.lang.String identifierHash, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Delete a document from the search engine index.- Specified by:
documentDeletein interfaceIIncrementalIngester- Parameters:
pipelineConnections- is the pipeline specification.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hash of the id of the document.activities- is the object to use to log the details of the ingestion attempt. May be null.- Throws:
ManifoldCFExceptionServiceInterruption
-
getDocumentURIMultiple
protected java.util.List<IncrementalIngester.DeleteInfo> getDocumentURIMultiple(java.lang.String outputConnectionName, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Find out what URIs a SET of document URIs are currently ingested.- Parameters:
identifierHashes- is the array of document id's to check.- Returns:
- the array of current document uri's. Null returned for identifiers that don't exist in the index.
- Throws:
ManifoldCFException
-
getDocumentURIMultiple
protected java.util.List<IncrementalIngester.DeleteInfo> getDocumentURIMultiple(java.lang.String outputConnectionName, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, java.lang.String componentHash) throws ManifoldCFException
Find out what URIs a SET of document URIs are currently ingested.- Parameters:
outputConnectionName- is the output connection name.identifierClasses- is the array of identifier classes.identifierHashes- is the array of document id's to check.componentHash- is the component hash to check.- Returns:
- the array of current document uri's. Null returned for identifiers that don't exist in the index.
- Throws:
ManifoldCFException
-
getPipelineDocumentIngestDataMultiple
public void getPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic[] pipelineSpecificationBasics, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Look up ingestion data for a set of documents.- Specified by:
getPipelineDocumentIngestDataMultiplein interfaceIIncrementalIngester- Parameters:
rval- is a map of output key to document data, in no particular order, which will be loaded with all matching results.pipelineSpecificationBasics- are the pipeline specifications corresponding to the identifier classes and hashes.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- is the array of document identifier hashes to look up.- Throws:
ManifoldCFException
-
getPipelineDocumentIngestDataMultiple
public void getPipelineDocumentIngestDataMultiple(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Look up ingestion data for a SET of documents.- Specified by:
getPipelineDocumentIngestDataMultiplein interfaceIIncrementalIngester- Parameters:
rval- is a map of output key to document data, in no particular order, which will be loaded with all matching results.pipelineSpecificationBasic- is the pipeline specification for all documents.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- is the array of document identifier hashes to look up.- Throws:
ManifoldCFException
-
getPipelineDocumentIngestDataChunk
protected void getPipelineDocumentIngestDataChunk(IngestStatuses rval, java.util.Map<java.lang.String,java.lang.Integer> map, java.lang.String[] outputConnectionNames, java.util.List<java.lang.String> list, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Get a chunk of document ingest data records.- Parameters:
rval- is the document ingest status array where the data should be put.map- is the map from id to index.list- is the parameter list for the query.- Throws:
ManifoldCFException
-
getPipelineDocumentIngestData
public void getPipelineDocumentIngestData(IngestStatuses rval, IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash) throws ManifoldCFException
Look up ingestion data for a document.- Specified by:
getPipelineDocumentIngestDatain interfaceIIncrementalIngester- Parameters:
rval- is a map of output key to document data, in no particular order, which will be loaded with all matching results.pipelineSpecificationBasic- is the pipeline specification for the document.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hash of the id of the document.- Throws:
ManifoldCFException
-
getDocumentUpdateIntervalMultiple
public long[] getDocumentUpdateIntervalMultiple(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes) throws ManifoldCFException
Calculate the average time interval between changes for a document. This is based on the data gathered for the document.- Specified by:
getDocumentUpdateIntervalMultiplein interfaceIIncrementalIngester- Parameters:
pipelineSpecificationBasic- is the basic pipeline specification.identifierClasses- are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes- is the hashes of the ids of the documents.- Returns:
- the number of milliseconds between changes, or 0 if this cannot be calculated.
- Throws:
ManifoldCFException
-
getDocumentUpdateInterval
public long getDocumentUpdateInterval(IPipelineSpecificationBasic pipelineSpecificationBasic, java.lang.String identifierClass, java.lang.String identifierHash) throws ManifoldCFException
Calculate the average time interval between changes for a document. This is based on the data gathered for the document.- Specified by:
getDocumentUpdateIntervalin interfaceIIncrementalIngester- Parameters:
pipelineSpecificationBasic- is the basic pipeline specification.identifierClass- is the name of the space in which the identifier hash should be interpreted.identifierHash- is the hash of the id of the document.- Returns:
- the number of milliseconds between changes, or 0 if this cannot be calculated.
- Throws:
ManifoldCFException
-
maxClauseGetIntervals
protected int maxClauseGetIntervals(java.lang.String[] outputConnectionNames)
Calculate the number of clauses.
-
getIntervals
protected void getIntervals(long[] rval, java.lang.String[] outputConnectionNames, java.util.List<java.lang.String> list, java.util.Map<java.lang.String,java.lang.Integer> returnMap) throws ManifoldCFExceptionQuery for and calculate the interval for a bunch of hashcodes.- Parameters:
rval- is the array to stuff calculated return values into.list- is the list of parameters.returnMap- is a mapping from document id to rval index.- Throws:
ManifoldCFException
-
resetOutputConnection
public void resetOutputConnection(IOutputConnection outputConnection) throws ManifoldCFException
Reset all documents belonging to a specific output connection, because we've got information that that system has been reconfigured. This will force all such documents to be reindexed the next time they are checked.- Specified by:
resetOutputConnectionin interfaceIIncrementalIngester- Parameters:
outputConnection- is the output connection associated with this action.- Throws:
ManifoldCFException
-
removeOutputConnection
public void removeOutputConnection(IOutputConnection outputConnection) throws ManifoldCFException
Remove all knowledge of an output index from the system. This is appropriate when the output index no longer exists and you wish to delete the associated job.- Specified by:
removeOutputConnectionin interfaceIIncrementalIngester- Parameters:
outputConnection- is the output connection associated with this action.- Throws:
ManifoldCFException
-
noteDocumentIngest
protected void noteDocumentIngest(java.lang.String outputConnectionName, java.lang.String docKey, java.lang.String componentHash, java.lang.String documentVersion, java.lang.String transformationVersion, java.lang.String outputVersion, java.lang.String authorityNameString, long ingestTime, java.lang.String documentURI, java.lang.String documentURIHash) throws ManifoldCFExceptionNote the ingestion of a document, or the "update" of a document.- Parameters:
outputConnectionName- is the name of the output connection.docKey- is the key string describing the document.componentHash- is the component identifier hash for this document.documentVersion- is a string describing the new version of the document.transformationVersion- is a string describing all current transformations for the document.outputVersion- is the version string calculated for the output connection.authorityNameString- is the name of the relevant authority connection.ingestTime- is the time at which the ingestion took place, in milliseconds since epoch.documentURI- is the uri the document can be accessed at, or null (which signals that we are to record the version, but no ingestion took place).documentURIHash- is the hash of the document uri.- Throws:
ManifoldCFException
-
maxClauseDocumentURIChunk
protected int maxClauseDocumentURIChunk(java.lang.String outputConnectionName)
Calculate how many clauses at a time
-
getDocumentURIChunk
protected void getDocumentURIChunk(java.util.List<IncrementalIngester.DeleteInfo> rval, java.lang.String outputConnectionName, java.util.List<java.lang.String> list) throws ManifoldCFException
Get a chunk of document uris.- Parameters:
rval- is the string array where the uris should be put.list- are the doc keys for the query.- Throws:
ManifoldCFException
-
maxClauseDocumentURIChunk
protected int maxClauseDocumentURIChunk(java.lang.String outputConnectionName, java.lang.String componentHash)Calculate how many clauses at a time
-
getDocumentURIChunk
protected void getDocumentURIChunk(java.util.List<IncrementalIngester.DeleteInfo> rval, java.lang.String outputConnectionName, java.util.List<java.lang.String> list, java.lang.String componentHash) throws ManifoldCFException
Get a chunk of document uris.- Parameters:
rval- is the string array where the uris should be put.list- are the doc keys for the query.componentHash- is the component hash, if any, for the query.- Throws:
ManifoldCFException
-
maxClauseDocumentIngestDataChunk
protected int maxClauseDocumentIngestDataChunk(java.lang.String outputConnectionName)
Count the clauses
-
maxClausePipelineDocumentIngestDataChunk
protected int maxClausePipelineDocumentIngestDataChunk(java.lang.String[] outputConnectionNames)
Count the clauses
-
removeDocument
protected void removeDocument(IOutputConnection connection, java.lang.String documentURI, java.lang.String outputDescription, IOutputRemoveActivity activities) throws ManifoldCFException, ServiceInterruption
Remove document, using the specified output connection, via the standard pool.
-
makeKey
protected static java.lang.String makeKey(java.lang.String documentClass, java.lang.String documentHash)Make a key from a document class and a hash
-
computeLockArray
protected static java.lang.String[] computeLockArray(java.lang.String documentURIHash, java.lang.String oldURIHash, java.lang.String outputConnectionName)
-
-