Saturday, October 22, 2011

Info Retrieval Systems: Challenges on Organizing the Web

The professor asked us our thoughts on curated collections versus automatic collections (such as through collection websites via web crawlers) and our thoughts on the challenges of a changing Web (1.0, 2.0, 3.0) for automated information retrieval. I mentioned my experience interning with the Internet Public Library.

The Internet Public Library has a virtual collection of web pages and those items are also prone to the same issues of collection management as physical collections through changing content or even disappearing content. One of the issues we handled on the IPL was correcting dead links. The IPL had an automatic dead link resolver which found dead links and tried proposing corrections, but sometimes we humans had to search for the new URL for a resource.

A more pressing point about the difference between curated and automated collections is the requirement by curated collection policy to establish the authoritativeness of the source when determining relevancy to an information need. This was another thing that we did a lot as IPL interns. We looked at the authoritativeness of a source by reading the About Us page and seeing what sorts of relationships a person had with an established organization or professional association.

Unethical Search Engine Optimizers are only looking to increase page rank for a site. One of the things that drove me nuts with submitted sites to the IPL was a page which was obviously created to drive traffic to ads on the site. A person would create a website/blog about a given topic, like "Novelty Cake Pans" (this is a real site submission which I rejected), and say that their site was authoritative about the topic and then it would be covered in ads. So the content is created to drive people to the product. These sorts of sites were submitted to the IPL for the link ranking, not because they were authoritative on the subject.

If one of the bragging rights of a search engine is the number of pages indexed, then we know that pages are not going to be excluded from indexing based on authoritativeness. A library's job is exactly to exclude "resources" which do not follow an editorial policy (such as having verifiable data).

As far as the "Semantic Web" goes, it seems to me that one of the fundamental ideas is having computer systems communicate with each other, using processes that don't involve human intervention. This means that IR and search move out of the realm of finding text results for a query entered in text, relying more on data about data (metadata). The challenges are determining what other kinds of data will need to be queried and how the machines involved will do so.

After having taken the Content Representation (INFO 622) and Metadata and Resource Description (INFO 662) courses, I am so tempted to think that moving toward the semantic web will require a huge initial human intervention. I think about the case of the Library of Congress and other national libraries moving toward the Resource Description and Access metadata scheme so that huge links between individual components of information can be made. This is how I understand the semantic web.

The point being that after that human intervention on organizing data, whether putting it in definitional tags like <resume></resume> or taking thesauri and turning them into ontologies, then the computers will be able to query this type of metadata in order to give better results on a person's text query.

I'm also interested in ways the semantic web can harness non-text resources such as numerical data/ statistics, maps and coordinates, video, audio and still images. Imagine making a text based query and getting data results that the IR system allows the user to manipulate. I remember answering questions on the IPL and a lady wanted statistical information about a certain age of children in US households. The census information that I could find collapsed ages into ranges (etc, 1-5, 6-11...), but the user wanted one specific age range. In this instance, having access to the original numbers and being able to manipulate it would have made her day. This is what I think the semantic web should be able to do.

The biggest challenge, I still think, is that no matter what, there will still be people trying to change the game in their favor. One day XML tags may be standard in web pages, but people will still spam the content within the tags or the tags themselves in order to boost their rankings. I hope there will be a differentiation between authoritative sources and non-authoritative sources and that people will be able to select which set to search for information, but that both will still be equally valid. (On the one hand, its nice to see research and cited studies, on the other hand, sometimes a person just wants to read the reviews of other "people on the street" in regards to a product.)

Follow up: I expressed my interest in how the web can organize non-textual information in this post and wanted to share my term paper on indexing video.