Java Reference
In-Depth Information
The advantage to the MemoryWorkloadManager is that it is very easy to setup.
Just create a new instance of the MemoryWorkloadManager , and your spider is ready
to go. Because everything is stored in memory, there is no database or file system set up.
The main disadvantage to a MemoryWorkloadManager , is that it is unable to
hold a large number of URLs. Because of this, the MemoryWorkloadManager is lim-
ited to processing URLs from only a single host. If you would like to process URLs from many
different web hosts, you will need to use the SQLWorkloadManager .
SQL Workload Management
The SQLWorkloadManager uses an SQL database to hold the list of URLs. This
allows the SQLWorkloadManager to process a much larger amount of data than the
MemoryWorkloadManager . Additionally, the SQLWorkloadManager can pro-
cess multiple hosts.
The main disadvantage to the SQLWorkloadManager , is that it is complex to setup.
For example, you must create a database, with the correct table structure. You must verify that
the spider has the correct login information and drivers for the database. None of this is terribly
difficult; however, it is more complex than the simple MemoryWorkloadManager .
Other Workload Managers
Some databases require specialized workload managers. One such example is Oracle. A spe-
cialized workload manager is provided for Oracle named OracleWorkloadManager .
Oracle requires slightly different forms for several of the SQL statements used by the
workload manager. As a result, it is necessary to create a special workload manager for
Oracle. The OracleWorkloadManager class is very short. It simply inherits from
WorkloadManagement and replaces a few of the SQL statements.
Currently, the OracleWorkloadManager , the MemoryWorkloadManager
and the SQLWorkloadManager are the only supported workload manag-
ers, although others may be supported in the future. One example might be a
FileSystemWorkloadManager . This workload manager would use a direc-
tory on the file system to store the URL list. This would have a similar capacity as the
SQLWorkloadManager , but would not require a relational database.
Implementing a Memory Based WorkloadManager
You will now see how the MemoryWorkloadManager class is implemented. The
memory workload manager stores the list of URLs in several memory-based objects. The
MemoryWorkloadManager is shown in Listing 14.7.
Search WWH ::




Custom Search