Information Technology Reference
In-Depth Information
I would just note parenthetically that there are also tough problems on the
curatorial side of many archives and special collections here; one of the big ones is
redaction. When the size of the print record was pretty small, you could find enough
human beings to go through and redact indiscreet things and pull items that should
stay private; in government settings you could make decisions about whether
something was classified and whether it could be declassified. Now you have just an
unmanageable problem when you start talking about things like government records
or personal papers. Can you let users compute on them and then select a few things
that a human curator might go through and appraise and redact if necessary? Can you
let that computation happen safely without too much implicit information leaking out
to cause trouble? These are strange and wonderful new areas of research that are
taking the stage as we struggle with this environment.
Finally, I want to really stress that the need for computational capability is not just
to permit humans to do the kind of access or research that they have traditionally done
in an environment where the amount of content available has grown unmanageable. It
is also central to being able to ask very new kinds of questions, whether about graphs
of social, intellectual, economic or other connectivity; about the outcomes of
inference or the identification of consensus or contradictions within very large
collections of text or data; about statistical correlations and the identification of
outliers. One could, for example, attempt to run an analysis of major collections of
Greek antiquities worldwide to computationally attempt to characterize an
“archetypal” version of a common kind of vase, and to characterize the patterns of
variation, and then link this to the geography of excavation sites.
So let us return to this issue of access implying computation and connect it to
clouds. We are now moving into an environment where more and more kinds of
access actually require meaningful amounts of computation and we have some hard
questions here. One is where does the computation happen; the way we answer this
largely determines who chooses and provides the tools, and who sets the limits on
what you as a user of information can do. Let us look very quickly at a few scenarios.
Let us imagine that I want to do a complex computation over a substantial slice of
the recent literature in molecular biology - perhaps, say, 750,000 articles. Or a big
slice of the Twitter archive. If the publishers or other repositories housing them will
let me download them (not at all clear, in part because some publishers do not agree
that this kind of download and compute scenario is part of open access, or something
that they need to support in offering public access, on a policy basis; in part because
you may not have the storage needed to hold all these articles, or the bandwidth to
your local resources that will allow you to download them in reasonable time and at
reasonable cost; in part because some repositories may not have the computational
provisioning even to support this kind of bulk downloading, or may rate limit it to
reduce the computational impact, though this translates into a very long download
time); given all these caveats, I could download them and do my computations
locally.
Or, in theory I could send my computation over to the repository. Well, how many
sites do you know who say “send me your arbitrary programs, I'm happy to run them
and see what they do on my site, what fun”? No, what they will do is they will
Search WWH ::




Custom Search