Apache > ManifoldCF > Release Documentation
 

Programmatic Operation

Programmatic Operation

A certain subset of ManifoldCF users want to think of ManifoldCF as an engine that they can poke from whatever other system they are developing. While ManifoldCF is not precisely a document indexing engine per se, it can certainly be controlled programmatically. Right now, there are three principle ways of achieving this control.

Control by Servlet API

ManifoldCF provides a servlet-based JSON API that gives you the complete ability to define connections and jobs, and control job execution. You can read about JSON here. The API is designed to be RESTful in character. Thus, it makes full use of the HTTP verbs GET, PUT, POST, and DELETE, and represents objects as URLs.

URL format

The basic format of the JSON servlet resource URLs is as follows:

http[s]://<server_and_port>/mcf-api-service/json/<resource>

The servlet ignores request data, except when the PUT or POST verb is used. In that case, the request data is presumed to be a JSON object. The servlet responds either with an error response code (either 400 or 500) with an appropriate explanatory message, or with a 200 (OK), 201 (CREATED), 401 (UNAUTHORIZED), or 404 (NOT FOUND) response code along with a response JSON object.

JSON equivalents for ManifoldCF

ManifoldCF treats certain JSON forms as equivalent, for the purposes of readability. For example, the array form "foo" : [ { ... } ] is treated equivalently to "foo" : { }, whenever there is only one array element. This gives a coder some flexibility as to how s/he encodes JSON in requests. Please also be aware that similar compressions will occur in the JSON responses from the API servlet, and your code must be able to deal with this possibility. The following table describes some of the equivalences:

FormEquivalent
[ { ... } ]{ ... }
"foo" : { "_value_" : "bar" }"foo" : "bar"
"_children_" : [ "foo":{ ... }, "foo":{ ... } ]"foo" : [ { ... }, { ... } ]

Resources and commands

The actual available resources and commands are as follows:

ResourceVerbWhat it doesInput format/query argsOutput format
LOGINPOSTLog in the specified user{"userID":<user_name>, "password":<password>}{}
authorizationdomainsGETList all registered authorization domainsN/A{"authorizationdomain":[<list_of_authorization_domain_objects>]} OR {"error":<error_text>}
outputconnectorsGETList all registered output connectorsN/A{"outputconnector":[<list_of_output_connector_objects>]} OR {"error":<error_text>}
transformationconnectorsGETList all registered transformation connectorsN/A{"transformationconnector":[<list_of_transformation_connector_objects>]} OR {"error":<error_text>}
mappingconnectorsGETList all registered mapping connectorsN/A{"mappingconnector":[<list_of_mapping_connector_objects>]} OR {"error":<error_text>}
authorityconnectorsGETList all registered authority connectorsN/A{"authorityconnector":[<list_of_authority_connector_objects>]} OR {"error":<error_text>}
repositoryconnectorsGETList all registered repository connectorsN/A{"repositoryconnector":[<list_of_repository_connector_objects>]} OR {"error":<error_text>}
notificationconnectorsGETList all registered notification connectorsN/A{"notificationconnector":[<list_of_notification_connector_objects>]} OR {"error":<error_text>}
authoritygroupsGETList all authority groupsN/A{"authoritygroup":[<list_of_authority_group_objects>]} OR {"error":<error_text>}
authoritygroups/<encoded_group_name>GETGet a specific authority groupN/A{"authoritygroup":<authority_group_object>} OR { } OR {"error":<error_text>}
authoritygroups/<encoded_group_name>PUTSave or create an authority group{"authoritygroup":<authority_group_object>}{ } OR {"error":<error_text>}
authoritygroups/<encoded_group_name>DELETEDelete an authority groupN/A{ } OR {"error":<error_text>}
outputconnectionsGETList all output connectionsN/A{"outputconnection":[<list_of_output_connection_objects>]} OR {"error":<error_text>}
outputconnections/<encoded_connection_name>GETGet a specific output connectionN/A{"outputconnection":<output_connection_object>} OR { } OR {"error":<error_text>}
outputconnections/<encoded_connection_name>PUTSave or create an output connection{"outputconnection":<output_connection_object>}{ } OR {"error":<error_text>}
outputconnections/<encoded_connection_name>DELETEDelete an output connectionN/A{ } OR {"error":<error_text>}
status/outputconnections/<encoded_connection_name>GETCheck the status of an output connectionN/A{"check_result":<message>} OR {"error":<error_text>}
info/outputconnections/<encoded_connection_name>/<connector_specific_resource>GETRetrieve arbitrary connector-specific resourceN/A<response_data> OR {"error":<error_text>} OR {"service_interruption":<error_text>}
transformationconnectionsGETList all transformation connectionsN/A{"transformationconnection":[<list_of_transformation_connection_objects>]} OR {"error":<error_text>}
transformationconnections/<encoded_connection_name>GETGet a specific transformation connectionN/A{"transformationconnection":<transformation_connection_object>} OR { } OR {"error":<error_text>}
transformationconnections/<encoded_connection_name>PUTSave or create a transformation connection{"outputconnection":<transformation_connection_object>}{ } OR {"error":<error_text>}
transformationconnections/<encoded_connection_name>DELETEDelete a transformation connectionN/A{ } OR {"error":<error_text>}
status/transformationconnections/<encoded_connection_name>GETCheck the status of a transformation connectionN/A{"check_result":<message>} OR {"error":<error_text>}
info/transformationconnections/<encoded_connection_name>/<connector_specific_resource>GETRetrieve arbitrary connector-specific resourceN/A<response_data> OR {"error":<error_text>} OR {"service_interruption":<error_text>}
clearversions/<encoded_output_connection_name>PUTForget previous indexed document versionsN/A{ } OR {"error":<error_text>}
clearrecords/<encoded_output_connection_name>PUTRemove all previous indexing recordsN/A{ } OR {"error":<error_text>}
mappingconnectionsGETList all mapping connectionsN/A{"mappingconnection":[<list_of_mapping_connection_objects>]} OR {"error":<error_text>}
mappingconnections/<encoded_connection_name>GETGet a specific mapping connectionN/A{"mappingconnection":<mapping_connection_object>} OR { } OR {"error":<error_text>}
mappingconnections/<encoded_connection_name>PUTSave or create a mapping connection{"mappingconnection":<mapping_connection_object>}{ } OR {"error":<error_text>}
mappingconnections/<encoded_connection_name>DELETEDelete a mapping connectionN/A{ } OR {"error":<error_text>}
status/mappingconnections/<encoded_connection_name>GETCheck the status of a mapping connectionN/A{"check_result":<message>} OR {"error":<error_text>}
authorityconnectionsGETList all authority connectionsN/A{"authorityconnection":[<list_of_authority_connection_objects>]} OR {"error":<error_text>}
authorityconnections/<encoded_connection_name>GETGet a specific authority connectionN/A{"authorityconnection":<authority_connection_object>} OR { } OR {"error":<error_text>}
authorityconnections/<encoded_connection_name>PUTSave or create an authority connection{"authorityconnection":<authority_connection_object>}{ } OR {"error":<error_text>}
authorityconnections/<encoded_connection_name>DELETEDelete an authority connectionN/A{ } OR {"error":<error_text>}
status/authorityconnections/<encoded_connection_name>GETCheck the status of an authority connectionN/A{"check_result":<message>} OR {"error":<error_text>}
repositoryconnectionsGETList all repository connectionsN/A{"repositoryconnection":[<list_of_repository_connection_objects>]} OR {"error":<error_text>}
repositoryconnections/<encoded_connection_name>GETGet a specific repository connectionN/A{"repositoryconnection":<repository_connection_object>} OR { } OR {"error":<error_text>}
repositoryconnections/<encoded_connection_name>PUTSave or create a repository connection{"repositoryconnection":<repository_connection_object>}{ } OR {"error":<error_text>}
repositoryconnections/<encoded_connection_name>DELETEDelete a repository connectionN/A{ } OR {"error":<error_text>}
status/repositoryconnections/<encoded_connection_name>GETCheck the status of a repository connectionN/A{"check_result":<message>} OR {"error":<error_text>}
info/repositoryconnections/<encoded_connection_name>/<connector_specific_resource>GETRetrieve arbitrary connector-specific resourceN/A<response_data> OR {"error":<error_text>} OR {"service_interruption":<error_text>}
notificationconnectionsGETList all notification connectionsN/A{"notificationconnection":[<list_of_notification_connection_objects>]} OR {"error":<error_text>}
notificationconnections/<encoded_connection_name>GETGet a specific notification connectionN/A{"notificationconnection":<notification_connection_object>} OR { } OR {"error":<error_text>}
notificationconnections/<encoded_connection_name>PUTSave or create a notification connection{"notificationconnection":<notification_connection_object>}{ } OR {"error":<error_text>}
notificationconnections/<encoded_connection_name>DELETEDelete a notification connectionN/A{ } OR {"error":<error_text>}
status/notificationconnections/<encoded_connection_name>GETCheck the status of a notification connectionN/A{"check_result":<message>} OR {"error":<error_text>}
info/notificationconnections/<encoded_connection_name>/<connector_specific_resource>GETRetrieve arbitrary connector-specific resourceN/A<response_data> OR {"error":<error_text>} OR {"service_interruption":<error_text>}
clearhistory/<encoded_repository_connection_name>PUTClear history linked with repository connectionN/A<response_data> OR {"error":<error_text>} OR {"service_interruption":<error_text>}
jobsGETList all job definitionsN/A{"job":[<list_of_job_objects>]} OR {"error":<error_text>}
jobsPOSTCreate a job{"job":<job_object>}{"job_id":<job_identifier>} OR {"error":<error_text>}
jobs/<job_id>GETGet a specific job definitionN/A{"job":<job_object_>} OR { } OR {"error":<error_text>}
jobs/<job_id>PUTSave a job definition{"job":<job_object>}{ } OR {"error":<error_text>}
jobs/<job_id>DELETEDelete a job definitionN/A{ } OR {"error":<error_text>}
jobstatusesGETList all jobs and their statusmaxcount=<maximum_documents_to_count>{"jobstatus":[<list_of_job_status_objects>]} OR {"error":<error_text>}
jobstatuses/<job_id>GETGet a specific job's statusmaxcount=<maximum_documents_to_count>{"jobstatus":<job_status_object>} OR { } OR {"error":<error_text>}
jobstatusesnocounts<job_id>GETList all jobs and their status, returning '0' for all countsN/A{"jobstatus":[<list_of_job_status_objects>]} OR { } OR {"error":<error_text>}
jobstatusesnocounts/<job_id>GETGet a specific job's status, returning '0' for all countsN/A{"jobstatus":<job_status_object>} OR { } OR {"error":<error_text>}
start/<job_id>PUTStart a specified job manuallyN/A{ } OR {"error":<error_text>}
startminimal/<job_id>PUTStart a specified job manually, minimal run requestedN/A{ } OR {"error":<error_text>}
abort/<job_id>PUTAbort a specified jobN/A{ } OR {"error":<error_text>}
restart/<job_id>PUTStop and start a specified jobN/A{ } OR {"error":<error_text>}
restartminimal/<job_id>PUTStop and start a specified job, minimal run requestedN/A{ } OR {"error":<error_text>}
pause/<job_id>PUTPause a specified jobN/A{ } OR {"error":<error_text>}
resume/<job_id>PUTResume a specified jobN/A{ } OR {"error":<error_text>}
reseed/<job_id>PUTReset incremental seeding for a specified jobN/A{ } OR {"error":<error_text>}
repositoryconnectionhistory/<encoded_connection_name>GETGet a history report<history_query_parameters>{"row":[{"column":[{"name":<col_name>,"value":<col_value>}, ...]}, ...]} OR {"error":<error_text>}
repositoryconnectionquery/<encoded_connection_name>GETGet a queue report<queue_query_parameters>{"row":[{"column":[{"name":<col_name>,"value":<col_value>}, ...]}, ...]} OR {"error":<error_text>}
repositoryconnectionactivities/<encoded_connection_name>GETGet a list of legal activities for a connectionN/A{"activity":[<activity_name>, ...]} OR {"error":<error_text>}
repositoryconnectionjobs/<encoded_connection_name>GETGet a list of jobs for a connectionN/A{"job":[<list_of_job_objects>]} OR {"error":<error_text>}

History query parameters

The history query parameters and their meanings are as follows:

ParameterReport typeMultivalued?Meaning
reportAllNoThe kind of history report desired; legal values are "simple", "maxactivity", "maxbandwidth", and "result"; defaults to "simple"
starttimeAllNoStarting time in ms since epoch; defaults to "0"
endtimeAllNoEnding time in ms since epoch; defaults to now
activityAllYesWhich activities you want to see
entitymatchAllNoRegular expression matching entity identifier; defaults to ""
entitymatch_insensitiveAllNoCase insensitive version of entitymatch
resultcodematchAllNoRegular expression match result code; defaults to ""
resultcodematch_insensitiveAllNoCase insensitive version of resultcodematch
sortcolumnAllYesResult column to sort the result by
sortcolumn_directionAllYesDirection to sort the corresponding column ("ascending" or "descending")
startrowAllNoStarting row in resultset to return; defaults to 0
rowcountAllNoMaximum number of rows to return; defaults to 20
idbucketmaxactivity, maxbandwidth, resultNoRegular expression selecting which part of the entity identifier to use as an aggregation key; defaults to "()"
idbucket_insensitivemaxactivity, maxbandwidth, resultNoCase insensitive version of idbucket
resultcodebucketresultNoRegular expression selecting which part of the result code to use as an aggregation key; defaults to "(.*)"
resultcodebucket_insensitiveresultNoCase insensitive version of resultcodebucket
intervalmaxactivity, maxbandwidthNoSize of window in milliseconds for assessing rate; defaults to 300000

Each report type has different return columns, as listed below:

Report typeReturn columns
simplestarttime, resultcode, resultdesc, identifier, activity, bytes, elapsedtime
maxactivitystarttime, endtime, activitycount, idbucket
maxbandwidthstarttime, endtime, bytecount, idbucket
resultidbucket, resultcodebucket, eventcount

Queue query parameters

The queue query parameters and their meanings are as follows:

ParameterReport typeMultivalued?Meaning
reportAllNoThe kind of queue report desired; legal values are "document" or "status"; defaults to "document"
nowAllNoThe time in milliseconds since epoch to perform the queue assessment relative to; defaults to current time
idmatchAllNoRegular expression matching document identifier; defaults to ""
idmatch_insensitiveAllNoCase insensitive version of idmatch
statematchAllYesState to match; valid values are "neverprocessed", "previouslyprocessed", "outofscope"
statusmatchAllYesStatus to match; valid values are "inactive", "processing", "expiring", "deleting", "readyforprocessing", "readyforexpiration", "waitingforprocessing", "waitingforexpiration", "waitingforever", and "hopcountexceeded"
sortcolumnAllYesResult column to sort the result by
sortcolumn_directionAllYesDirection to sort the corresponding column ("ascending" or "descending")
startrowAllNoStarting row in resultset to return; defaults to 0
rowcountAllNoMaximum number of rows to return; defaults to 20
idbucketstatusNoRegular expression selecting which part of the document identifier to use as an aggregation key; defaults to "()"
idbucket_insensitivestatusNoCase insensitive version of idbucket

Each report type has different return columns, as listed below:

Report typeReturn columns
documentidentifier, job, state, status, scheduled, action, retrycount, retrylimit
statusidbucket, inactive, processing, expiring, deleting, processready, expireready, processwaiting, expirewaiting, waitingforever, hopcountexceeded

Authorization domain objects

The JSON fields an authorization domain object has are as follows:

FieldMeaning
"description"The optional description of the authorization domain
"domain_name"The internal name of the authorization domain, i.e. what is sent to the Authority Service

Output connector objects

The JSON fields an output connector object has are as follows:

FieldMeaning
"description"The optional description of the connector
"class_name"The class name of the class implementing the connector

Transformation connector objects

The JSON fields a transformation connector object has are as follows:

FieldMeaning
"description"The optional description of the connector
"class_name"The class name of the class implementing the connector

Mapping connector objects

The JSON fields a mapping connector object has are as follows:

FieldMeaning
"description"The optional description of the connector
"class_name"The class name of the class implementing the connector

Authority connector objects

The JSON fields an authority connector object has are as follows:

FieldMeaning
"description"The optional description of the connector
"class_name"The class name of the class implementing the connector

Repository connector objects

The JSON fields a repository connector object has are as follows:

FieldMeaning
"description"The optional description of the connector
"class_name"The class name of the class implementing the connector

Notification connector objects

The JSON fields a repository connector object has are as follows:

FieldMeaning
"description"The optional description of the connector
"class_name"The class name of the class implementing the connector

Authority group objects

Authority group names, when they are part of a URL, should be encoded as follows:

  1. All instances of '.' should be replaced by '..'.
  2. All instances of '/' should be replaced by '.+'.
  3. The URL should be encoded using standard URL utf-8-based %-encoding.

The JSON fields an authority group object has are as follows:

FieldMeaning
"name"The unique name of the group
"description"The description of the group

Output connection objects

Output connection names, when they are part of a URL, should be encoded as follows:

  1. All instances of '.' should be replaced by '..'.
  2. All instances of '/' should be replaced by '.+'.
  3. The URL should be encoded using standard URL utf-8-based %-encoding.

The JSON fields an output connection object has are as follows:

FieldMeaning
"name"The unique name of the connection
"description"The description of the connection
"class_name"The java class name of the class implementing the connection
"max_connections"The total number of outstanding connections allowed to exist at a time
"configuration"The configuration object for the connection, which is specific to the connection class

Transformation connection objects

Transformation connection names, when they are part of a URL, should be encoded as follows:

  1. All instances of '.' should be replaced by '..'.
  2. All instances of '/' should be replaced by '.+'.
  3. The URL should be encoded using standard URL utf-8-based %-encoding.

The JSON fields an output connection object has are as follows:

FieldMeaning
"name"The unique name of the connection
"description"The description of the connection
"class_name"The java class name of the class implementing the connection
"max_connections"The total number of outstanding connections allowed to exist at a time
"configuration"The configuration object for the connection, which is specific to the connection class

Mapping connection objects

Mapping connection names, when they are part of a URL, should be encoded as follows:

  1. All instances of '.' should be replaced by '..'.
  2. All instances of '/' should be replaced by '.+'.
  3. The URL should be encoded using standard URL utf-8-based %-encoding.

The JSON fields for a mapping connection object are as follows:

FieldMeaning
"name"The unique name of the connection
"description"The description of the connection
"class_name"The java class name of the class implementing the connection
"max_connections"The total number of outstanding connections allowed to exist at a time
"configuration"The configuration object for the connection, which is specific to the connection class
"prerequisite"The mapping connection prerequisite, if any

Authority connection objects

Authority connection names, when they are part of a URL, should be encoded as follows:

  1. All instances of '.' should be replaced by '..'.
  2. All instances of '/' should be replaced by '.+'.
  3. The URL should be encoded using standard URL utf-8-based %-encoding.

The JSON fields for an authority connection object are as follows:

FieldMeaning
"name"The unique name of the connection
"description"The description of the connection
"class_name"The java class name of the class implementing the connection
"max_connections"The total number of outstanding connections allowed to exist at a time
"configuration"The configuration object for the connection, which is specific to the connection class
"prerequisite"The mapping connection prerequisite, if any
"authdomain"The authorization domain for the authority connection, if any
"authgroup"The required authority group for the authority connection

Repository connection objects

Repository connection names, when they are part of a URL, should be encoded as follows:

  1. All instances of '.' should be replaced by '..'.
  2. All instances of '/' should be replaced by '.+'.
  3. The URL should be encoded using standard URL utf-8-based %-encoding.

The JSON fields for a repository connection object are as follows:

FieldMeaning
"name"The unique name of the connection
"description"The description of the connection
"class_name"The java class name of the class implementing the connection
"max_connections"The total number of outstanding connections allowed to exist at a time
"configuration"The configuration object for the connection, which is specific to the connection class
"acl_authority"The (optional) name of the authority group that will enforce security for this connection
"throttle"An array of throttle objects, which control how quickly documents can be requested from this connection

Each throttle object has the following fields:

FieldMeaning
"match"The regular expression which is used to match a document's bins to determine if the throttle should be applied
"match_description"Optional text describing the meaning of the throttle
"rate"The maximum fetch rate to use if the throttle applies, in fetches per minute

Notification connection objects

Notification connection names, when they are part of a URL, should be encoded as follows:

  1. All instances of '.' should be replaced by '..'.
  2. All instances of '/' should be replaced by '.+'.
  3. The URL should be encoded using standard URL utf-8-based %-encoding.

The JSON fields for a notification connection object are as follows:

FieldMeaning
"name"The unique name of the connection
"description"The description of the connection
"class_name"The java class name of the class implementing the connection
"max_connections"The total number of outstanding connections allowed to exist at a time
"configuration"The configuration object for the connection, which is specific to the connection class

Job objects

The JSON fields for a job are is as follows:

FieldMeaning
"id"The job's identifier, if present. If not present, ManifoldCF will create one (and will also create the job when saved).
"description"Text describing the job
"repository_connection"The name of the repository connection to use with the job
"document_specification"The document specification object for the job, whose format is repository-connection specific
"start_mode"The start mode for the job, which can be one of "schedule window start", "schedule window anytime", or "manual"
"run_mode"The run mode for the job, which can be either "continuous" or "scan once"
"hopcount_mode"The hopcount mode for the job, which can be either "accurate", "no delete", "never delete"
"priority"The job's priority, typically "5"
"recrawl_interval"The default time between recrawl of documents (if the job is "continuous"), in milliseconds, or "infinite" for infinity
"max_recrawl_interval"The maximum time between recrawl of documents (if the job is "continuous"), in milliseconds, or "infinite" for infinity
"expiration_interval"The time until a document expires (if the job is "continuous"), in milliseconds, or "infinite" for infinity
"reseed_interval"The time between reseeding operations (if the job is "continuous"), in milliseconds, or "infinite" for infinity
"hopcount"An array of hopcount objects, describing the link types and associated maximum hops permitted for the job
"schedule"An array of schedule objects, describing when the job should be started and run
"pipelinestage"An array of pipelinestage objects, describing what the transformation pipeline is
"notificationstage"An array of notificationstage objects, describing what the notifications are

Each pipelinestage object has the following fields:

FieldMeaning
"stage_id"The unique identifier for the pipeline stage
"stage_prerequisite"The unique identifier for the preceding pipeline stage; may be missing if none
"stage_isoutput""true" if the stage is an output connection
"stage_connectionname"The connection name for the pipeline stage
"stage_description"A description of the pipeline stage
"stage_specification"The specification string for the pipeline stage

Each notificationstage object has the following fields:

FieldMeaning
"stage_id"The unique identifier for the notification stage
"stage_connectionname"The connection name for the notification stage
"stage_description"A description of the notification stage
"stage_specification"The specification string for the notification stage

Each hopcount object has the following fields:

FieldMeaning
"link_type"The connection-type-dependent type of a link for which a hop count restriction is specified
"count"The maximum number of hops allowed for the associated link type, starting at a seed

Each schedule object has the following fields:

FieldMeaning
"timezone"The optional time zone for the schedule object; if not present the default server time zone is used
"duration"The optional length of the described time window, in milliseconds; if not present, duration is considered infinite
"dayofweek"The optional day-of-the-week enumeration object
"monthofyear"The optional month-of-the-year enumeration object
"dayofmonth"The optional day-of-the-month enumeration object
"year"The optional year enumeration object
"hourofday"The optional hour-of-the-day enumeration object
"minutesofhour"The optional minutes-of-the-hour enumeration object
"requestminimum"Optional flag indicating whether the job run will be minimal or not ("true" means minimal)

Each enumeration object describes an array of integers using the form:

{"value":[<integer_list>]}

Each integer is a zero-based index describing which entity is being specified. For example, for "dayofweek", 0 corresponds to Sunday, etc., and thus "dayofweek":{"value":[0,6]} would describe Saturdays and Sundays.

Job status objects

The JSON fields of a job status object are as follows:

FieldMeaning
"job_id"The job identifier
"status"The job status, having the possible values: "not yet run", "running", "paused", "done", "waiting", "stopping", "resuming", "starting up", "cleaning up", "error", "aborting", "restarting", "running no connector", and "terminating"
"error_text"The error text, if the status is "error"
"start_time"The job start time, in milliseconds since Jan 1, 1970
"end_time"The job end time, in milliseconds since Jan 1, 1970
"documents_in_queue"The total number of documents in the queue for the job
"documents_outstanding"The number of documents for the job that are currently considered 'active'
"documents_processed"The number of documents that in the queue for the job that have been processed at least once

Connection-type-specific objects

As you may note when trying to use the above JSON API methods, you cannot get very far in defining connections or jobs without knowing the JSON format of a connection's configuration information, or a job's connection-specific document specification and output specification information. The form of these objects is controlled by the Java implementation of the underlying connector, and is translated directly into JSON, so if you write your own connector you should be able to figure out what it will be in the API. For connectors already part of ManifoldCF, it remains an ongoing task to document these connector-specific objects. This task is not yet underway.

Luckily, it is pretty easy to learn a lot about the objects in question by simply creating connections and jobs in the ManifoldCF crawler UI, and then inspecting the resulting JSON objects through the API. In this way, it should be possible to do a decent job of coding most API-based integrations. The one place where difficulties will certainly occur will be if you try to completely replace the ManifoldCF crawler UI with one of your own. This is because most connectors have methods that communicate with their respective back-ends in order to allow the user to select appropriate values. For example, the path drill-down that is presented by the LiveLink connector requires that the connector interrogate the appropriate LiveLink repository in order to populate its path selection pull-downs. There is, at this time, only one sanctioned way to accomplish the same job using the API, which is to use the appropriate "connection_type/execute/type-specific_command" command to perform the necessary functions. Some set of useful functions has been coded for every appropriate connector, but the exact commands for every connector, and their JSON syntax, remains undocumented for now.

File system connector

The file system connector has no configuration information, and no connector-specific commands. However, it does have document specification information. The information looks something like this:

{"startpoint":[{"_attribute_path":"c:\path_to_files","include":[{"_attribute_type":"file","_attribute_match":"*.txt"},{"_attribute_type":"file","_attribute_match":"*.doc"\,"_attribute_type":"directory","_attribute_match":"*"],"exclude":["*.mov"]]}

As you can see, multiple starting paths are possible, and the inclusion and exclusion rules also can be one or multiple.

Control via Commands

For script writers, there currently exist a number of ManifoldCF execution commands. These commands are primarily rich in the area of definition of connections and jobs, controlling jobs, and running reports. The following table lists the current suite.

CommandWhat it does
org.apache.manifoldcf.agents.DefineOutputConnectionCreate a new output connection
org.apache.manifoldcf.agents.DeleteOutputConnectionDelete an existing output connection
org.apache.manifoldcf.agents.DefineTransformationConnectionCreate a new transformation connection
org.apache.manifoldcf.agents.DeleteTransformationConnectionDelete an existing transformation connection
org.apache.manifoldcf.authorities.ChangeAuthSpecModify an authority's configuration information
org.apache.manifoldcf.authorities.CheckAllCheck all authorities to be sure they are functioning
org.apache.manifoldcf.authorities.DefineAuthorityConnectionCreate a new authority connection
org.apache.manifoldcf.authorities.DeleteAuthorityConnectionDelete an existing authority connection
org.apache.manifoldcf.authorities.DefineMappingConnectionCreate a new mapping connection
org.apache.manifoldcf.authorities.DeleteMappingConnectionDelete an existing mapping connection
org.apache.manifoldcf.crawler.AbortJobAbort a running job
org.apache.manifoldcf.crawler.AddScheduledTimeAdd a schedule record to a job
org.apache.manifoldcf.crawler.ChangeJobDocSpecModify a job's specification information
org.apache.manifoldcf.crawler.DefineJobCreate a new job
org.apache.manifoldcf.crawler.DefineRepositoryConnectionCreate a new repository connection
org.apache.manifoldcf.crawler.DeleteJobDelete an existing job
org.apache.manifoldcf.crawler.DeleteRepositoryConnectionDelete an existing repository connection
org.apache.manifoldcf.crawler.ExportConfigurationWrite the complete list of all connection definitions and job specifications to a file
org.apache.manifoldcf.crawler.FindJobLocate a job identifier given a job's name
org.apache.manifoldcf.crawler.GetJobScheduleFind a job's schedule given a job's identifier
org.apache.manifoldcf.crawler.ImportConfigurationImport configuration as written by a previous ExportConfiguration command
org.apache.manifoldcf.crawler.ListJobStatusesList the status of all jobs
org.apache.manifoldcf.crawler.ListJobsList the identifiers for all jobs
org.apache.manifoldcf.crawler.PauseJobGiven a job identifier, pause the specified job
org.apache.manifoldcf.crawler.RestartJobGiven a job identifier, restart the specified job
org.apache.manifoldcf.crawler.RunDocumentStatusRun a document status report
org.apache.manifoldcf.crawler.RunMaxActivityHistoryRun a maximum activity report
org.apache.manifoldcf.crawler.RunMaxBandwidthHistoryRun a maximum bandwidth report
org.apache.manifoldcf.crawler.RunQueueStatusRun a queue status report
org.apache.manifoldcf.crawler.RunResultHistoryRun a result history report
org.apache.manifoldcf.crawler.RunSimpleHistoryRun a simply history report
org.apache.manifoldcf.crawler.StartJobStart a job
org.apache.manifoldcf.crawler.WaitForJobDeletedAfter a job has been deleted, wait until the delete has completed
org.apache.manifoldcf.crawler.WaitForJobInactiveAfter a job has been started or aborted, wait until the job ceases all activity
org.apache.manifoldcf.crawler.WaitJobPausedAfter a job has been paused, wait for the pause to take effect

Control by direct code

Control by direct java code is quite a reasonable thing to do. The sources of the above commands should give a pretty clear idea how to proceed, if that's what you want to do.

Caveats

The above commands know nothing about the differences between connection types. Instead, they deal with configuration and specification information in the form of XML documents. Normally, these XML documents are hidden from a system integrator, unless they happen to look into the database with a tool such as psql. But the API commands above often will require such XML documents to be included as part of the command execution.

This has one major consequence. Any application that would manipulate connections and jobs directly cannot be connection-type independent - these applications must know the proper form of XML to submit to the command. So, it is not possible to use these command APIs to write one's own UI wrapper, without sacrificing some of the repository independence that ManifoldCF by itself maintains.