Pages
Categories
- Figure Skating (9)
- Keiko (3)
- Knitting (2)
- mashups (1)
- Metadata (16)
- WorldCat (5)
- Music (1)
- News (2)
- NextGenCatalogs (1)
- On my mind (13)
- Personal (13)
- Research (6)
- ReadingNotes (5)
- Sabbatical (37)
- SemanticWeb (3)
- Theatre (7)
- Travel (5)
- Work (4)
Blogroll
Archives
Meta
I heard a really interesting story on All Things Considered last Thursday. A computer scientist from Carnegie Mellon came up with a brilliant idea. He developed a system that uses text from digitized print sources that OCR programs could not correctly decipher as the anti-bot key text that appears on certain websites (like Ticketmaster). So, now, instead of just keying computer-generated text rendered in fuzzy images, people are prompted to key words that are often easy for humans to recognize, but almost impossible for the bots to comprehend. The websites also capture the data entered by the end-users in order to improve the text capture for digitized books. When enough people agree on what the word is, the data is fed back to the source digitization project and used to improve OCR-generated full-text. The really cool thing is how much work has been accomplished through micro-contributions of time and knowledge made by millions of people.
This is just totally cool. Wouldn’t it be fabulous if we could find similar methods for capturing data that would help to improve metadata for bibliographic resources? Imagine if OCLC could come up with a similar mechanism for collecting variations on WorldCat master records made by individual libraries and individual users. Master records could be enhanced substantially without painstaking work from OCLC Quality Control staff.
The most valuable and expensive aspect of cataloging is capturing human knowledge effectively. We need systems that will allow end users to make small contributions to enhancing metadata easily and seamlessly, and give professionals the tools they need to quickly and systematically analyze this data, so that it can be incorporated it into the infrastructure. That seems like a key part of the Semantic Web: developing ways to capture, organize, and relate little bits of information and knowledge from all over the place into a coherent whole.
Arghhh!! When will this insanity end?
I’ve spent a good portion of the past two days struggling to update the records in our online catalog for titles included in Oxford Reference Online. This process is so annoying and frustrating that I’m about ready to give up entirely. Why don’t I? Because I need to add our holdings for these titles to WorldCat.
A couple of years ago, I made the mistake of loading the MARC records provided in the database publisher’s free record set into our local catalog. The main problem now is that I need to get holdings for these titles added to OCLC in support of the new, WorldCat based Summit catalog. So, no problem, I thought, I’ll just extract the ISBN numbers from the publisher-supplied records I still have in the database, put those into a text file, and do a batch search of WorldCat to download these records to a local file in Connexion, and then export the records to our local catalog, overlaying on ISBN number. Sounds pretty straightforward, but it’s actually a huge pain the butt!
First problem: I used screen-scraping (only method possible) to gather current ISBNs for the 71 titles that we still have the non-OCLC records for in our database and saved them in a plain text file that I then uploaded for batch searching in Connexion Client. All 71 of them came up with multiple matches, even though I used all possible limits to try to restrict my searching to just records for the online/electronic resource versions of these books. I’ve slogged through records for about 15 titles so far, and I’ve observed common characteristics that appear on the most acceptable records, but Connexion Client won’t let me filter records within my local file based on those characteristics (e.g. a specific member library symbol in the 040, encoding level I, etc.). So, the only way to select the records to use for this project is to look through all of the 3-5 records retrieved for each ISBN. To catalog 71 titles, therefore, I must examine 3-5 times that many records. If OCLC is going to permit so many duplicate records in WorldCat, they really need to give us more options for limiting and filtering the records retrieved in response to a search. In this case, if I could limit to records with a particular encoding level, English language records (OCLC only allows you to limit based on the language of the content, not the record itself), or contributed by a particular member library other than DLC, it would save me a lot of work. And I would have to do this just to add our holdings to WorldCat, even if I weren’t exporting the records to our local catalog as well.
Second problem: I have to review and make some edits to each record, even if the record is of good quality. Notably, I must update the 049 and add a 949 to each record in order to get our III system to process the records correctly when they load. I wrote a macro that does most of this work for me, but that took me about an hour this morning, including the time I had to spend updating our III load profiles to optimize overlay based on ISBN. Even after specifying that overlay comparison be based upon the normalized form of the ISBN in the 020 field, the III system doesn’t seem to normalize the 020 correctly in all cases. For example, when the ISBN is followed by (pbk.), the III normalization program retains pbk as part of the ISBN, so overlay doesn’t work if that isn’t on the incoming OCLC record. Thus, I have to check and clean up the 020 fields in the existing records in our catalog or the overlay won’t work in many cases.
Third problem: The titles in this database are based on the latest edition of the same title in print. And the publisher doesn’t provide any kind of notification or list of updated content, so you’re left on your own to find the updated titles. Since it has been a year since I last worked on this database, I need to search for updated content and dead links at this point, too. This leads to lots more manual review and checking, since there is no way other than human review to determine if the bibliographic record matches the content currently online at the database site.
This is too hard! And I, as the cataloger, am too distracted keeping track of all the mechanical aspects of searching, selecting, and downloading records to focus attention on the intellectual aspects of cataloging, like providing subject access to these resources that suits our local context. There has to be a better way to do this kind of stuff!!!