Performance tuning
Performance tuning
In order to get the most out of ManifoldCF, performance-wise, there are a few things you need to know. First, you need to know how to configure the project so that it performs optimally. You'll also want to know what hardware would work best. And, no doubt, you want to have some idea whether you've actually done everything properly, so you need data to compare with. This page will hopefully answer all of those questions.
Configuration for performance
The goal of performance tuning for ManifoldCF is to take maximum advantage of parallelism in the system doing the work, and to make sure there are no bottlenecks anywhere that would slow things down. The most important underpinning of ManifoldCF is the database, since that is the only persistent storage mechanism ManifoldCF uses. Getting the database right is therefore the first goal.
Selecting the database
Start by using PostgreSQL rather than Derby, because Derby has known performance problems when it comes to handling deadlocks. Database deadlocks arise naturally in systems like ManifoldCF that are highly threaded, and while the risk of their arising can be reduced, it cannot entirely be eliminated. Derby, on the other hand, has the ability to deadlock with a simple SELECT against a table happening at the same time as a DELETE against the same table, and Derby requires a hang of a minute before it detects the deadlock. Obviously that behavior is incompatible with high performance. So, use PostgreSQL if you care at all about crawler performance. See the how-to-deploy page for a description of how to run ManifoldCF under PostgreSQL.
Certain PostgreSQL versions are also known to generate bad plans for ManifoldCF queries. When this happens, crawls of any size may become extremely slow. The ManifoldCF log will start to include many warnings of the sort, "Query took more than a minute", with a corresponding dumped plan that shows a sequential scan of a large table. At this point you should suspect you have a bad version of PostgreSQL. Known bad versions include 8.3.12. Known good versions are 8.3.7, 8.3.8, and 8.4.5.
Configuring PostgreSQL correctly
The key configuration changes you need to make to PostgreSQL from its out-of-the-box settings are intended to:
- Set PostgreSQL up with enough database handles so that that will not be a bottleneck;
- Make sure PostgreSQL has enough shared memory allocated to support the number of handles you selected;
- Turn off autovacuuming.
The postgresql.conf file is where you set most of these options. Some recommended settings are described in the deployment page. The postgresql.conf file describes the relationship between parameters, especially between the number of database handles and the amount of shared memory allocated. This can differ significantly from version to version, so it never hurts to read the text in that file, and understand what you are trying to achieve.
The number of database handles you need will depend on your ManifoldCF setup. If you use the Quick Start, for instance, fewer handles are needed, because only one process is used. The formula relating handle count to other parameters of ManifoldCF is presented below.
manifoldcf_db_pool_size * number_of_manifoldcf_processes <= maximum_postgresql_database_handles - 2
The number of processes you might have depends on how you deployed ManifoldCF. If you used the Quick Start, you will only have one process. But if you deployed in a more distributed way, you will have at least a process for the agents daemon, as well as at one process for each web application. If you anticipate that a command-line utility could be used at the same time, that's one more process. These multiply quickly, so the number of database handles you need to make available can get quite large, unless you limit the ManifoldCF pool size artificially instead.
Setting the parameters that control the size of the database connection pool is covered in the next section.
Setting the ManifoldCF database handle pool size
The database handle pool size must be set correctly, or ManifoldCF will not perform well, and may even deadlock waiting to get a database handle. The properties.xml parameter that controls this is org.apache.manifoldcf.database.maxhandles. The formula you should use to properly set the value is below.
worker_thread_count + delete_thread_count + expiration_thread_count + cleanup_thread_count + 10 < manifoldcf_db_pool_size
Setting the number of worker, delete, and expiration threads
The number of each variety of thread you choose depends on a number of factors that are specific to the kinds of tasks you expect to do. First, note that constraints based on your hardware may have the effect of setting an upper bound on the total number of threads. If, for example, memory constraints on your system have the effect of limiting the number of available PostgreSQL handles, the total threads will also be limited as a result of applying the formulas already given.
If you do not have any such constraints, then you can choose the number of threads based on other hardware factors. Typically, the number of processors would be what you'd consider in coming up with the total thread count. A value of between 12 and 35 threads per processor is typical. The optimal number for you will require some experimentation.
The threads then have to be allocated to the worker, deletion, or expiration category. If your work load does not require much in the way of deleting documents or expiring them, it is usually adequate to retain the default of 10 deletion and 10 expiration threads, and simply adjust the worker thread count. The worker thread count parameter is org.apache.manifoldcf.crawler.threads. See the deployment page for a list of all of these parameters.
Database maintenance
Once you have the database and ManifoldCF configurated correctly, you will discover that the performance of the system gradually degrades over time. This is because PostgreSQL requires periodic maintenance in order to function optimally. This maintenance is called vacuuming.
Our recommendation is to vacuum on a schedule, and to use the "full" variant of the vacuum command (e.g. "VACUUM FULL"). PostgreSQL gives you the option of lesser vacuums, some of which can be done in background, but in our experience these are very expensive performance-wise, and are not very helpful either. "VACUUM FULL" makes a complete new copy of the database, a table at a time, stored in an optimal way. It is also reasonably quick, considering what it is doing.
Some results
We've run performance test on several systems. Depending on hardware configuration, we've seen as fast as 57 documents per second to 16 documents per second. We tested with three different systems and ran the test across 306,944 documents. The table below shows the relevant configurations and results:
System | Processors (2+ Ghz) | Memory | Disk drives | Elapsed time (seconds) | Documents per second |
---|---|---|---|---|---|
Desktop | 2 | 8 GB | 7,200 RPM | 19,492 | 16 |
Laptop | 2 | 4 GB | Samsung SSD RBX | 9,230 | 33 |
Server | 8 | 8 GB | 10,000 RPM | 5,366 | 57 |
For these tests, we ran the Quick-Start example configuration from ManifoldCF as is, with the exception of using an external PostgreSQL database instead of the embedded Derby. We altered the ManifoldCF and PostgreSQL configuration from their default settings to maximize system resource usage. The table below shows the key configuration changes.
Workers | ManifoldCF DB Connections | PostgreSQL Connections | Max repository connections | JVM Memory |
---|---|---|---|---|
100 | 105 | 200 | 105 | 1024 MB |
Additionally, we made postgresql.conf changes as shown in the table below:
Parameter | Value |
---|---|
shared_buffers | 1024MB |
checkpoint_segments | 300 |
maintenanceworkmem | 2MB |
tcpip_socket | true |
max_connections | 200 |
checkpoint_timeout | 900 |
datestyle | ISO,European |
autovacuum | off |
There are some interesting conclusions, for example the use of Solid State Drives for the laptop. Even though addressable memory was reduced to 4 GB, the system processed twice as much documents than the desktop did with slower disks. The other interesting fact is that the server had lower performing disks, but 4 times as many processors, and it was twice as fast as the laptop.