Saturday, October 22, 2011

Info Retrieval Systems: Challenges on Organizing the Web

The professor asked us our thoughts on curated collections versus automatic collections (such as through collection websites via web crawlers) and our thoughts on the challenges of a changing Web (1.0, 2.0, 3.0) for automated information retrieval. I mentioned my experience interning with the Internet Public Library.

The Internet Public Library has a virtual collection of web pages and those items are also prone to the same issues of collection management as physical collections through changing content or even disappearing content. One of the issues we handled on the IPL was correcting dead links. The IPL had an automatic dead link resolver which found dead links and tried proposing corrections, but sometimes we humans had to search for the new URL for a resource.

A more pressing point about the difference between curated and automated collections is the requirement by curated collection policy to establish the authoritativeness of the source when determining relevancy to an information need. This was another thing that we did a lot as IPL interns. We looked at the authoritativeness of a source by reading the About Us page and seeing what sorts of relationships a person had with an established organization or professional association.

Unethical Search Engine Optimizers are only looking to increase page rank for a site. One of the things that drove me nuts with submitted sites to the IPL was a page which was obviously created to drive traffic to ads on the site. A person would create a website/blog about a given topic, like "Novelty Cake Pans" (this is a real site submission which I rejected), and say that their site was authoritative about the topic and then it would be covered in ads. So the content is created to drive people to the product. These sorts of sites were submitted to the IPL for the link ranking, not because they were authoritative on the subject.

If one of the bragging rights of a search engine is the number of pages indexed, then we know that pages are not going to be excluded from indexing based on authoritativeness. A library's job is exactly to exclude "resources" which do not follow an editorial policy (such as having verifiable data).

As far as the "Semantic Web" goes, it seems to me that one of the fundamental ideas is having computer systems communicate with each other, using processes that don't involve human intervention. This means that IR and search move out of the realm of finding text results for a query entered in text, relying more on data about data (metadata). The challenges are determining what other kinds of data will need to be queried and how the machines involved will do so.

After having taken the Content Representation (INFO 622) and Metadata and Resource Description (INFO 662) courses, I am so tempted to think that moving toward the semantic web will require a huge initial human intervention. I think about the case of the Library of Congress and other national libraries moving toward the Resource Description and Access metadata scheme so that huge links between individual components of information can be made. This is how I understand the semantic web.

The point being that after that human intervention on organizing data, whether putting it in definitional tags like <resume></resume> or taking thesauri and turning them into ontologies, then the computers will be able to query this type of metadata in order to give better results on a person's text query.

I'm also interested in ways the semantic web can harness non-text resources such as numerical data/ statistics, maps and coordinates, video, audio and still images. Imagine making a text based query and getting data results that the IR system allows the user to manipulate. I remember answering questions on the IPL and a lady wanted statistical information about a certain age of children in US households. The census information that I could find collapsed ages into ranges (etc, 1-5, 6-11...), but the user wanted one specific age range. In this instance, having access to the original numbers and being able to manipulate it would have made her day. This is what I think the semantic web should be able to do.

The biggest challenge, I still think, is that no matter what, there will still be people trying to change the game in their favor. One day XML tags may be standard in web pages, but people will still spam the content within the tags or the tags themselves in order to boost their rankings. I hope there will be a differentiation between authoritative sources and non-authoritative sources and that people will be able to select which set to search for information, but that both will still be equally valid. (On the one hand, its nice to see research and cited studies, on the other hand, sometimes a person just wants to read the reviews of other "people on the street" in regards to a product.)

Follow up: I expressed my interest in how the web can organize non-textual information in this post and wanted to share my term paper on indexing video.

Friday, September 23, 2011

Information Retrieval Systems: thoughts on Boolean Search

So I'm taking a challenging (for me) course this term with lots of great subjects for thought. This week we are looking at Boolean Search (AND, OR, NOT statements). The prof wanted to know if we used it and in what situations we don't use it.

I said that I used Boolean when searching a library catalog or database for things that I knew existed, and that I use keyword relevancy searching for things that I don't know whether they exist because Boolean searching is difficult when looking for a document in which there is some doubt as to its existence, or there is doubt in the appropriate terms.

For example, just this week I had a library patron who wanted "the book by Laura Ingels about the demise of culture with the catchy funny title. She's a national commentator and was just interviewed about the book." It turned out that using Google's keyword search "laura funny culture book" (and not a Boolean search in the catalog) was the perfect way to find the book, which was by Laura Ingraham and titled Of Thee I Zing. The Google result which answered the query was an Amazon page of the book which (I guess) pulled together Laura from the author name, culture from the title, and funny from the reviews.

I also looked at another question of the prof's which was about controlled vocabulary and expecting users to know what they are looking for (at least in terms of the specific words they use to search). I think it's fair to expect some level of responsibility on the part of the user for learning the vocabulary of their search need. So if a user is searching a specific microcosm of information (say a health database), I have no problem recommending that the user become as familiar as possible with the language of her search (such as by browsing the controlled vocab/ thesaurus), in order to consciously craft a search using general or specific terms, preferred terms or natural language and understand how the difference in the terms chosen will affect the types of documents and the relevance of the documents returned.

However, I especially like Saracevic's third powerful idea about information science, that of "interaction, enabling direct exchanges and feedback between systems and people engaged in IR processes." (p. 1052) If the system is designed for interaction, then the user might be able to define more precise queries through computer prompting of related keywords. Sort of a brainstorming session with the computer.

Saracevic, T. (1999). "Information science." Journal of the American Society for Information Science. 50(12), 1051-1063.

Tuesday, August 23, 2011

Lista de Libros leidos el Verano 2011

Este verano decedí que no tomaría clases de biblioteca para pasar unos meses con mi familia. Tambien he tenido la oportunidad de leer unos libros- algunos pensativos, otros simplemente pura diversión. Leí algunos que me parecieron digno de poner unas palabras "al papel," pero no sabía si debería escribir una crítica aquí o en LibraryThing...

Lo que me gustó de los libros era el elemento de realismo y la verdad a la experiencia de vida que representaron.
Los títulos:
  • The Elephant Keeper- escrito por Christopher Nicholson, que me rompío la corazón y a veces preferiría no leerlo
  • Catfish Alley- escrito por Lynne Bryant, bien hecho libro de la vida del sur de los EEUU y reflejó la cultura de ignorance deliberado de cómo la historia todavía nos toca (por razón de leer éste libro, todavía no he leído "The Help"
  • Impossible- escrito por Nancy Werlin, bien dificil de hacer un resumen- mejor leer el libro
  • Fairy Bad Day- escrito por Amanda Ashby- okei, ésto leí por pura diversión : )

Friday, July 22, 2011

I Just Created a Local Installation of WordPress!

I have given myself a summer project to create a video catalog using WordPress and either the Dublin Core metadata scheme or a subset of the scheme created by Rutgers University for their OpenMIC cataloging software. I'm working with Kinetic Illusions to catalog both their raw video files and their completed project files to describe the raw files based on subject matter, geographic location, time of year, etc and describe which raw files were used in which completed projects.

So one step of the project is to figure out how Kinetic will use the database and what metadata elements are of upmost importance.

As Kinetic is planning to create a WordPress version of their website, and is the software of choice for their web services, and as I saw that Library Technology Reports published the Using WordPress as a Library Content Management System report, I thought creating the catalog using WP would be a good project. Only problem is... I don't do WordPress.

So I'm reading Lisa Sabin-Wilson's WordPress All-in-One for Dummies and the steps are very clear, however she doesn't cover creating a local installation, meaning on my computer, rather than on the web, which is her assumption. So I had to go find software which would recreate the online environment that WordPress requires. Namely I needed a database and a server. As I use a Mac, I discovered MAMP: Mac Apache mySQL PHP. (I don't remember exactly how, I must have googled "wordpress local installation Mac.")

Now, I'm seriously superstitious about getting things to work, so I desperately searched for some step by step instructions for installing said MAMP 1.9.6 and WP 3.2.1. The following youtube video was created using slightly older versions of the software, but it was mostly the same.

I followed these instructions step by step and successfully installed WordPress! Needless to say, much celebration insued as I am not tech architecture savvy, but that's why they created the helpful software. Thanks, local community for sharing your resources and your knowledge!

Now back to the for Dummies book.

Monday, July 18, 2011

Stuff I Saw I Thought was Cool in the ALA11 Stacks

Here is a list of the stuff that I picked up information on in the Stacks, so that I may now recycle the paper and keep the list electronically in perpetuity!!!!!
  • iimageretrieval
    • e-scan scanning station
    • CopiBook scanning station
  • Open Repository- a commercial consultant/ service provider for customizing the open source product DSpace
  • The Horn Book Magazine- yeah, I've seen it on magazine stands, but never looked at one.
  • Library Advocate's Handbook- put together by the Office of Library Advocacy OLA
  • Unshelved- the comic about libraries
  • the Knowledge Imaging Centers (scanners) at Image Access- these things are amazing! They typically use a V cradle to hold the book by the cover/spine rather than have the customer put the book pages down and smash the spine. The particular model you choose depends on the application, from the size of the document scanned to whether it is for library users to scan documents.
    • KIC Bookeye 4, Bookeye 2
    • KIC Rebel (B4C6 Color Scanner/ Copier ? I wonder if this is its technical term)
    • KIC BookEdge Color Scanner/ Copier
  • G. T. Labs Comics about scientists. The cover art looked great, which is what induced me to pick up the the handout:
    • T Minus
    • Levitation
    • Wire Mothers
    • Bone Sharps, Cowboys, and Thunder Lizards
    • Suspended in Language
    • Fallout
    • Dignifying Science: Stories about Women Scientists
    • Two Fisted Science
  • Academic Video Online by Alexander Street Press
This is the list thus far. I'm a bit tired looking through these, so I'll do a part two later.

    Monday, July 11, 2011

    Book Review: Introducing RDA: a Guide to the Basics

    The book's title tells it all. This is a very general introduction to RDA and at times I felt it was repetitive in the information. (Such as using the example of the cataloger no longer abbreviating words when transcribing information from the object to the record. This one was used a lot!) But, if you're a student in grad school, haven't taken a MARC class and just want to inform yourself of the future in cataloging, this book is a good start.

    The seven chapters are:
    1. What is RDA?
    2. RDA and the International Context
    3. FRBR and FRAD in RDA
    4. Continuity with AACR2
    5. Where Do We See Changes?
    6. Implementing RDA
    7. Advantages, Present and Future
    The book mainly talks about the foundation and concepts behind RDA, which is basically to support the user in her search for information by seperating descriptive elements from each other and making relationships explicit so that the user could do this sort of search: "this item was created by that person who also created these things, one of which was published by this company, which specializes in audiobooks." Katrina Clifford also reviews the book for Ariadne and probably does a much better job than I could.

    As I'll be returning the book to the library (not the sort of book you'd want to buy), I'm going to take a moment to list some of the resources in no particular order that I'd like to read for further, perhaps more concrete, information:
    • Coyle, K. (2010). RDA vocabularies for a twenty-first century data environment. Library Technology Reports. 46(2).
    • Delsey, T. (2008). RDA, FRBR/FRAD and implementation scenarios. 5JSC/Editor/4; www.rda-jsc.org/doc/editor4.pdf
    • Delsey, T. (2007). RDA database implementation scenarios. 5JSC/Editor/2; www.rda-jsc.org/docs/5editor2.pdf
    • JSC presentations: www.rda-jsc.org/rdapresentations.html
    • Beacom, M. (2007). Cataloging cultural objects (CCO), resource description and Access (RDA), and the future of metadata content. VRA Bulletin 34(1), 81-85.
    • IFLA Study Group on the FRBR's Function Requirements for Bibliographic Records: Final Report. Munich: Saur, 1998. www.ifla.org/en/publications/functional-requirements-for-bibliographic-records.
    • Tillett, B. (2004). What is FRBR? A conceptual model for the bibliographic universe. Washington, DC: Cataloging Distribution Service, Library of Congress. http://www.loc.gov/cds/downloads/FRBR.PDF
    • Curran, Mary. (2009). Serials in RDA: A starter's tour and kit. Serials Librarian 57(4), 306-324.
    • Hitchens, A, & Symons, E. (2009).  Preparing catalogers for RDA training. Cataloging & Classificiation Quarterly 47(8), 691-707.
    All in all a good resource to get you started. It certainly helped me to understand the structural differences between AACR2 (anglo bias, bias on container of information object) and RDA (entity relationship based) and it provided some good resources for further study. I'd like more (free) info on FRAD and FRSAD (authority and subject authority data), but alas, I've not stumbled upon it. I'll share it here when I do.

    Friday, July 8, 2011

    RDA should be called RAD

    On my last day at ALA '11 I attended a session with no knowledge of the topic. I sat down at a table and was immediately greeted with, "How do you know RDA?" To which I replied, "I don't. But I keep seeing the acronym and thought I should probably learn something."

    The session was titled Education and Training for Using RDA and moderated by Kathryn La Barre and Marjorie Bloss. There were many other acronyms tossed about (FRBR, FRAD, PCC, RDA) and I waited and took notes until I heard one I recognized: MARC. Ah, this has to do with cataloging.

    As it turns out, the American Library Assoc. (ALA), Australian Committee on Cataloguing, British Library, Canadian Committee on Cataloguing, Chartereed Institute of Library and Information Professionals (CILIP) (UK based), and the Library of Congress (LOC) all got together (from 2003 to the present) to develop a new cataloging standard called Resource Description and Access (RDA) to replace the Anglo American Cataloging Rules (AACR2). This I learned from reading Introducing RDA: a Guide to the Basics by Chris Oliver.

    Essentially what is going down is the community recognized that the way AACR2 was structured made it hard to describe new types of resources such as online multimedia (apparently AACR2 is structured around the container that an information resource is housed in such as a book or a music CD) and that it was created for the print environment, not for the digital environment. So AACR2 doesn't easily allow for descriptions of new types of resources nor does its structure emphasize relationships between objects or between objects' creators so that data can be used outside of the environment in which it is created, which is what a networked environment excels at.

    Back to the session: I came away with the idea that RDA is a MEGA-schema. It lists all the elements to describe an information object and tells you how to fill the data values, from something as "minor" as don't include the article "The" for title information to which controlled vocabularies to consult for specific elements to detailing the relationships between items that share the same name and concept (a work), but not the same expression of the content (book, translation of book to another language, audiobook, adapted movie, etc). This is nothing new for experienced catalogers- the old rules AACR2 did just that. There is a core element set which defines the minimum information to describe an object and the rules leave it up to the individual library to decide how to display the information in the catalog.

    I think the radical change is that so many people have invested so much time learning the MARC format, which is a metadata schema that is not easy to learn and they are waiting to see when the big institutions will begin using RDA before they charge forward. Vendors will need to change their databases, people will need to be trained in the new structure. It is a costly business to change standards.

    As the point of the session was how institutions (whether academic teaching students or actual service libraries with catalogers) were were handling training, a representative of the Library of Congress shared that their training materials could be found on the Library of Congress Documentation for the RDA (Resource Description and Access) Test and among other LOC educational podcasts on iTunesU is a set of five talks on RDA which introduce the listener to the history and context of RDA's development, the principles it is based on and the implications of its implementation. (You may need iTunes installed to access, but it's free to download.)

    There was also a representative of Minitex who said that he broke down RDA training for copy catalogers by media. Those trainings are only in person, but  if you make the trip, I'm sure they'd be able to include you. (Minitex also has a load of interesting trainings that are self paced or webinar based besides the RDA trainings.)

    Two things I'm excited about:
    1. Not having to be an expert on MARC. I did not take a cataloging course because it was so book focused and focused on MARC. I don't like MARC's lack of granularity (the more granular an item, the easier it is to create filters and to reuse data elements) and I don't like MARC's weird rules about where to put things based on where other things were put. Me, I'm a describe the resources using a form and let the computer put the items into the proper fields. And I wanted to learn about describing all sorts of things, such as digital objects, multimedia objects, etc.
    2. Freeing the data. If the Library of Congress is paid for by the American people, then the data in the databases should be easily accessed, easily harvested and easily re-purposed. Either by other libraries or by people creating a database of their personal library. Also, by describing an information object by its relationships, we are creating ways to link that object to other, related objects outside of the library. In this way a person's exploration is unbounded so that they link into the library's description of an object then could link out to a museum's related items.

    Thursday, June 30, 2011

    Context, Content, Contraptions

    Speakers: Paulette Hasier, DARPA/Advanced Resource Technologies Inc.
    Tim ...?
    Sponsored by LITA
    Sunday, June 26, 2011 - 10:30am - 12:00pm

    The environment in which we are operating:
    • Users still want the dual worlds of print and online
    • in 2010, Amazon announced kindle book sales overtook hardcover book sales 105 ebook to 100 print
    • Garner Analysis reports 65% of users have paid for online content (music, articles, newspapers, etc)
    •  in 2011, worldwide smart phones sales will pass worldwide PC sales in units
    The contraptions people use are not only to read books, but to enhance learning and provide interactive learning.
    • Scan on demand and digitization (this was a new feature of the scanning devices being sold by the vendors in the stacks)
    • iPads for reference
    • Loan of e-readers
    The Tim speaker (from Research Services?) shared the goals his library wanted in order to support their users in the field:
    • Make the subscription resources search-able through a mobile app. This entailed creating one app which held the logon info to authenticate the user for all the databases.
    • Make news updates dynamic. This was solved by making a web design which allowed the user to select the organizational view they wanted based on their science research. For example, the chemists would get the view with links to the chemistry databases and their departmental news, etc. The physicists would get the web view which supported their research needs. 
    Websites/ Places etc. mentioned in the session
    • Goddard Library: http://library.gsfc.nasa.gov/public/

      Monday, June 27, 2011

      Sunday at the Conference

      When I was a newbie ALA conference attendee I sweetly and innocently thought that I would be able to attend all the programs which interested me. Below is Sunday's list in order by time:

      Start: 8am
      Picture Books Go Digital
      Lost in Translation: the Emerging Technology Librarian and the New Technology

      Start: 10:30am
      Context, Content, and Contraptions
      It's All About Them: Developing Information Service with User Experience Design
      Linked Library Data Interest Group
      Drupal 101
      Pecha Kucha: Teens and Technology

      Start 1:30pm
      Creating Multimedia Metadata: Controlled Vocabularies Across Time and Space
      Science Programming 101: Presenting Excellent Science Programs in Your Library
      Top Technology Trends

      Start 3:30pm
      Sue Garner of Wikipedia
      Consultants for Technical Services
      Teens Reading Digitally

      Start 5:30pm
      The Laugh's On Us: Paula Poundstone

      The stuff in bold is the stuff I actually got to, and yes, I slept right through the 8am sessions. I actually left Sue's talk early too, b/c I didn't want to miss Paula, but I miscalculated and found myself in the stacks at 4:30pm. I spoke to two separate vendors (one sell digitizing equipment, the other provides Dspace consulting services) before the lights were shut off and we were all escorted out to the shuttle buses at 5pm.

      I also actually chatted with people today (whoa!)- standing around the grab and go lunch table talking about upcoming conferences in the Caribbean, on the shuttle bus comparing impressions of Sue Garner and on a street car I met my doppelganger: another metadata librarian student who likes aquariums and attended the same morning session I did!

      I have lost any naivety about being able to see it all primarily because one cannot power walk 5 city blocks in heels. I can only hope people who attended those other sessions will blog about what they learned.

      Sunday, June 26, 2011

      Sunday Morning is a Time to Rest

      I've been at the ALA Annual Conference in New Orleans for a few days now and my body still hasn't realized that going to bed at 12:30am is a bad thing. (It thinks it is just 10:30pm). What with the aerobic workout that is the Convention Center, I need more rest. Who thought it was a logistically good idea to place thematically related sessions back to back and on opposite ends of a 5 city block long building with three floors? I should have paid more attention to Connie Willis's short story, At the Rialto.

      My partner back home keeps telling me I need to go see the sites, but today Sue Garner of Wikipedia will be speaking and I want to hear what she has to say to a profession of people who spent the better part of 5 years criticizing Wikipedia as "un-authoritative." (Which I think only raised the awareness of citing your sources among the "amateur" contributors and quickly became a non-issue.) If only librarians had thought of Wikipedia! But I think that has to do with the profession's opinion that we collect and organize, we do not create.

      Anyhoo- I did find the NOMA and the Edgar Degas House and plan to make side trips either Monday or Tuesday before I leave. And I had GREAT Indian food (a dal, aloo gobi, naan and basmati rice) at Salt and Pepper which has a mix for vegetarians and omnivores.