Java Reference
In-Depth Information
The advantage to the
MemoryWorkloadManager
is that it is very easy to setup.
Just create a new instance of the
MemoryWorkloadManager
, and your spider is ready
to go. Because everything is stored in memory, there is no database or file system set up.
The main disadvantage to a
MemoryWorkloadManager
, is that it is unable to
hold a large number of URLs. Because of this, the
MemoryWorkloadManager
is lim-
ited to processing URLs from only a single host. If you would like to process URLs from many
different web hosts, you will need to use the
SQLWorkloadManager
.
SQL Workload Management
The
SQLWorkloadManager
uses an SQL database to hold the list of URLs. This
allows the
SQLWorkloadManager
to process a much larger amount of data than the
MemoryWorkloadManager
. Additionally, the
SQLWorkloadManager
can pro-
cess multiple hosts.
The main disadvantage to the
SQLWorkloadManager
, is that it is complex to setup.
For example, you must create a database, with the correct table structure. You must verify that
the spider has the correct login information and drivers for the database. None of this is terribly
difficult; however, it is more complex than the simple
MemoryWorkloadManager
.
Other Workload Managers
Some databases require specialized workload managers. One such example is Oracle. A spe-
cialized workload manager is provided for Oracle named
OracleWorkloadManager
.
Oracle requires slightly different forms for several of the SQL statements used by the
workload manager. As a result, it is necessary to create a special workload manager for
Oracle. The
OracleWorkloadManager
class is very short. It simply inherits from
WorkloadManagement
and replaces a few of the SQL statements.
Currently, the
OracleWorkloadManager
, the
MemoryWorkloadManager
and the
SQLWorkloadManager
are the only supported workload manag-
ers, although others may be supported in the future. One example might be a
FileSystemWorkloadManager
. This workload manager would use a direc-
tory on the file system to store the URL list. This would have a similar capacity as the
SQLWorkloadManager
, but would not require a relational database.
Implementing a Memory Based WorkloadManager
You will now see how the
MemoryWorkloadManager
class is implemented. The
memory workload manager stores the list of URLs in several memory-based objects. The
MemoryWorkloadManager
is shown in Listing 14.7.