Observe, Analyze... Blog: info 624

Saturday, October 22, 2011

Info Retrieval Systems: Challenges on Organizing the Web

The professor asked us our thoughts on curated collections versus automatic collections (such as through collection websites via web crawlers) and our thoughts on the challenges of a changing Web (1.0, 2.0, 3.0) for automated information retrieval. I mentioned my experience interning with the Internet Public Library.

The Internet Public Library has a virtual collection of web pages and those items are also prone to the same issues of collection management as physical collections through changing content or even disappearing content. One of the issues we handled on the IPL was correcting dead links. The IPL had an automatic dead link resolver which found dead links and tried proposing corrections, but sometimes we humans had to search for the new URL for a resource.

A more pressing point about the difference between curated and automated collections is the requirement by curated collection policy to establish the authoritativeness of the source when determining relevancy to an information need. This was another thing that we did a lot as IPL interns. We looked at the authoritativeness of a source by reading the About Us page and seeing what sorts of relationships a person had with an established organization or professional association.

Unethical Search Engine Optimizers are only looking to increase page rank for a site. One of the things that drove me nuts with submitted sites to the IPL was a page which was obviously created to drive traffic to ads on the site. A person would create a website/blog about a given topic, like "Novelty Cake Pans" (this is a real site submission which I rejected), and say that their site was authoritative about the topic and then it would be covered in ads. So the content is created to drive people to the product. These sorts of sites were submitted to the IPL for the link ranking, not because they were authoritative on the subject.

If one of the bragging rights of a search engine is the number of pages indexed, then we know that pages are not going to be excluded from indexing based on authoritativeness. A library's job is exactly to exclude "resources" which do not follow an editorial policy (such as having verifiable data).

As far as the "Semantic Web" goes, it seems to me that one of the fundamental ideas is having computer systems communicate with each other, using processes that don't involve human intervention. This means that IR and search move out of the realm of finding text results for a query entered in text, relying more on data about data (metadata). The challenges are determining what other kinds of data will need to be queried and how the machines involved will do so.

After having taken the Content Representation (INFO 622) and Metadata and Resource Description (INFO 662) courses, I am so tempted to think that moving toward the semantic web will require a huge initial human intervention. I think about the case of the Library of Congress and other national libraries moving toward the Resource Description and Access metadata scheme so that huge links between individual components of information can be made. This is how I understand the semantic web.

The point being that after that human intervention on organizing data, whether putting it in definitional tags like <resume></resume> or taking thesauri and turning them into ontologies, then the computers will be able to query this type of metadata in order to give better results on a person's text query.

I'm also interested in ways the semantic web can harness non-text resources such as numerical data/ statistics, maps and coordinates, video, audio and still images. Imagine making a text based query and getting data results that the IR system allows the user to manipulate. I remember answering questions on the IPL and a lady wanted statistical information about a certain age of children in US households. The census information that I could find collapsed ages into ranges (etc, 1-5, 6-11...), but the user wanted one specific age range. In this instance, having access to the original numbers and being able to manipulate it would have made her day. This is what I think the semantic web should be able to do.

The biggest challenge, I still think, is that no matter what, there will still be people trying to change the game in their favor. One day XML tags may be standard in web pages, but people will still spam the content within the tags or the tags themselves in order to boost their rankings. I hope there will be a differentiation between authoritative sources and non-authoritative sources and that people will be able to select which set to search for information, but that both will still be equally valid. (On the one hand, its nice to see research and cited studies, on the other hand, sometimes a person just wants to read the reviews of other "people on the street" in regards to a product.)

Follow up: I expressed my interest in how the web can organize non-textual information in this post and wanted to share my term paper on indexing video.

Friday, September 23, 2011

Information Retrieval Systems: thoughts on Boolean Search

So I'm taking a challenging (for me) course this term with lots of great subjects for thought. This week we are looking at Boolean Search (AND, OR, NOT statements). The prof wanted to know if we used it and in what situations we don't use it.

I said that I used Boolean when searching a library catalog or database for things that I knew existed, and that I use keyword relevancy searching for things that I don't know whether they exist because Boolean searching is difficult when looking for a document in which there is some doubt as to its existence, or there is doubt in the appropriate terms.

For example, just this week I had a library patron who wanted "the book by Laura Ingels about the demise of culture with the catchy funny title. She's a national commentator and was just interviewed about the book." It turned out that using Google's keyword search "laura funny culture book" (and not a Boolean search in the catalog) was the perfect way to find the book, which was by Laura Ingraham and titled Of Thee I Zing. The Google result which answered the query was an Amazon page of the book which (I guess) pulled together Laura from the author name, culture from the title, and funny from the reviews.

I also looked at another question of the prof's which was about controlled vocabulary and expecting users to know what they are looking for (at least in terms of the specific words they use to search). I think it's fair to expect some level of responsibility on the part of the user for learning the vocabulary of their search need. So if a user is searching a specific microcosm of information (say a health database), I have no problem recommending that the user become as familiar as possible with the language of her search (such as by browsing the controlled vocab/ thesaurus), in order to consciously craft a search using general or specific terms, preferred terms or natural language and understand how the difference in the terms chosen will affect the types of documents and the relevance of the documents returned.

However, I especially like Saracevic's third powerful idea about information science, that of "interaction, enabling direct exchanges and feedback between systems and people engaged in IR processes." (p. 1052) If the system is designed for interaction, then the user might be able to define more precise queries through computer prompting of related keywords. Sort of a brainstorming session with the computer.

Saracevic, T. (1999). "Information science." Journal of the American Society for Information Science. 50(12), 1051-1063.