Information Technology Reference
In-Depth Information
sandbox you into some kind of virtual machine environment that is very carefully
constrained and isolated - but this is hard to do, and a big security risk, and they do
not have the computational capacity anyway, so running your “programs” on the
repository site is unlikely in many cases. Or they will support a small, constrained set
of high-level queries where they can bound the computational demand and the
functionality of the queries.
(I will note in passing another issue. Literatures are, and will be, most typically
scattered across large numbers of repositories or libraries. So from the user
perspective, it is not working with one source of content, but many in parallel. This
also changes the tradeoffs and indeed even the feasibility of shipping queries to the
data rather than copying the aggregated data somewhere and computing on it.)
There is a debate that is simmering which comes down to this: are open-access
articles going to be liquid, mobile, or are we going to, for most practical purposes,
have the tools of text mining defined by the publishing community because you will
need to run those tools in their environment? Will publishers only let you run specific
tools or will they charge you extra if you want to run other tools that are
computationally intensive; they may choose to let you only run things that are fairly
inexpensive. You can ask the same question not just about publishers (and other
repositories) with regard to articles but also more generally about cultural memory
organizations and the materials that they house. The Library of Congress got some
publicity about a year and a half ago, when it was announced that they were going to
preserve and host the Twitter archive. So they have now got these data feeds coming
in from Twitter. Housing them on disk is not only a moderate problem; if you talk to
the people there the really intractable problem is how they provide meaningful access
to this resource because of the scale of computational provisioning necessary. It is a
question of where does the Library of Congress get computational capacity to deal
with the kind of queries that people are going to want to run across this database,
which are not simply “show me tweet 21,000,000,992”; they are going to be asking
questions about the nature of the social graph, retweet patterns, and things that are
genuinely expensive to compute.
So back to the clouds. Can we do some of this computation in the clouds, where at
least in some cases there is already a public market in computational resources and an
infrastructure (albeit a heavy-handed one) for isolating users? Can we imagine an
environment where if you really want to dig into the treasures of a poorly funded
cultural heritage organization or maybe not even poorly funded, but just one without
infinite resources, the deal is you buy some computing cycles in a part of the same
cloud that the cultural heritage storage organization occupies so that the data transfer
is manageable within that cloud, (you certainly in most cases if it is a big collection of
data, cannot download it yourself because the consumer broadband infrastructure is
incapable of handling this, or prices it out of reach).
There are some other interesting variations one can imagine here. In some parts of
the United States now (and I think other countries as well), we are asking questions
about the role of libraries and particularly public libraries in society going forward.
We are also asking questions about how much bandwidth should a public library have
and what would they do with it. There are at least a few experiments that are starting
Search WWH ::




Custom Search