Monday, October 20, 2008

google and newspapers

Recently, when Google announced it was getting into the newspaper digitization business, many of us digitizing newspapers already took note. And who wouldn't? It's Google: they do a lot of great things and they've got a lot of money to do more. They've tried their hand at books so it's only natural that newspapers should follow. Their announcement wasn't unexpected.
(Google sample newspaper page)

Nevertheless, it gave us pause to consider the impact(s) this might have on our own digitization efforts in several key areas:

  • the long-term preservation of the digital data....quality imaging not withstanding (see below) 
  • for those of us funded by grants - our livelihoods
  • and, most importantly, title selections

I can't imagine Google would have financial worries for maintaining the enormous amount of data these newspapers generate. Even if they save their master files in a compressed format like JPG, JP2 or, God forbid some lesser format, they're still faced with loads of material to save in perpetuity. Choosing the right format and thinking in forever terms are but two issues involved with digital preservation, all of which are beyond the scope of this posting.

As to our livelihoods - between Google and the current economic collapse/crisis, it feels kind of silly to even talk about. Let's just be thankful to have a jobs and leave it at that for now.

But title selection is a different animal altogether. If you're an NDNP awardee, as we are here at the University of Kentucky, then you're bound by the NEH rules. Of particular importance here is the fact that we cannot digitize titles that have been digitized by another entity, whether it's a commercial entity or someone like Google who may make them freely available. 

Some argue that there's plenty to go around, and that's a reasonable enough argument. There are millions, if not billions, of historic newspaper pages waiting to be digitized. So, yes, there's plenty to do in that respect. But what happens to "collections"? What happens to their preservation? And who is responsible for those two things?

Picture this: what would you think if you, as a researcher - professional or layperson - landed upon a website that had tons of newspaper pages only to find that just a few newspaper titles are available? Would you feel cheated? Would you feel like you've wasted your time because, now, you have to keep looking for what you need? Or would you feel satisfied?

Take Chronicling America or our own Kentuckiana Digital Library...How strange would it be to look at Kentucky's newspapers at the end of NDNP's 20 year cycle to find we have every historic Kentucky newspaper except Louisa's Big Sandy News or the Kentucky Reporter, for instance? Wouldn't it seem odd for the University of Kentucky - the state's flagship University and Kentucky's sole NDNP content provider - to have everything except those two titles? Would you feel cheated? Would you feel like you've wasted your time because, now, you have to keep looking for what you need? Or would you feel satisfied?

And what would we say, as an arbitor for the state's historic collections and digital preservation, to those newspapers who may have opted to have their titles digitized by Google or some other outfit instead of UK when their stuff comes up missing, corrupt, distorted, or otherwise unusable? "Since you didn't let us preserve the material it's just lost. Sorry about your luck, Mr. Publisher.

In fact, it's not the publisher who stands to lose, but all of us - Kentuckian, American, Global citizens alike. Newspapers are a shared history and should be free to everyone. Further, it seems childish to want anything but the best preservation standards applied to every single page, no matter what your role may be. After all, who are we making this stuff for if not our children, or our children's children? Is it simply to glorify ourselves or is it really because this stuff matters?

I'd like to think it's the latter.


  1. The rule that we do not digitize any titles that have already been done may have to be modified perhaps. The NEH surely has standards, and if a particular approach does not comply with those, than it does not matter, we'll do it the right way.
    Rules and regulations are written and designed with the best outcome at heart, one would hope. Sometimes a change in technology or a change in thinking, concerns, whatever, may lead to different approach. Sometimes that could even be an improvement. That does not seem to be the case with Google though).
    Is there a published policy on how Google plans to deal with the master files?

  2. It seems to me that there is no shame in partnering with Google. The internet makes it easy to link to other websites and if Google is digitizing titles in your state; it makes sense to link to Google's resources from your state's digital newspaper website. Also, has anyone from the digital newspaper program contacted Google and told them what we are up to?

  3. A great post. Lots of food for thought.

    In the end libraries just want what is best for current and future researchers:

    - Access and preservation

    The old news article rule applies to this issue: Who, What, When, Where, Why, How?

    Why digitize? Do intentions have to be altruistic only? I don't think so but we have to be careful because this element invades every other aspect.

    Who does the work and when matters less than what is digitized and how. Where is also crucial. Say no to more silos. Yes, I think patrons care that they have to search multiple sites, with various levels of access.

    I guess I'm more afraid that this push by Google will cause programs like NDNP to back off and dry up and libraries/archives/museums need to be involved.

  4. A concern of ours, as I think you've alluded to, is that Google does not preserve the archives according to METS / ALTO standards, which go a long way to ensure the accuracy of the records.