LiFiDeA

Jinyoung Kim on Information Retrieval and Personal Information Management

Why care about PIM?

Days ago I had a chance to have dinner with an executive of a big internet company. When I mentioned that PIM is my research interest, he told me that it’s not very interesting in any sense and I’d better give it another thought.

To counter his argument and as an sequel of my earlier post on major concepts of PIM, I want to write about why one should care about PIM, from the perspective of business and research.

From business perspective, PIM matters because it’s the best way to understand the user, and the businessmodel of most internet companies critically dependent on this.

It’s not a coincidence that most of Google’s core services are managing personal information such as emails, schedules and so on. And having that many people manage their information by Google is its competitive advantage. Bing may follow Google in terms of index size and ranking algorithm, yet it won’t do the same without all the user information Google has access to. It’s like a comparison of a teller and a private banker.

From researcher’s view, PIM provides a venue for interdisciplinary research where Database, IR, HCI and many other fields in CS should make a combined efforts. It starts from storing heterogeneous information items, then choosing a relevant items based on user’s expression of information needs and finally presenting the result effectively. The challenge here is that a system should provide a seamlessly combined experience.

A perhaps less obvious reason is that one can easily experiment with PIM research by solving one’s own PIM problem. You cannot (and may not want to) build a web-scale search system, yet you can index your own documents and see whether you can do better than existing solutions. Once you can make it work for your problem, you can probably persuade people around to have it applied for them as well.

Although my previous work builds on the idea of building a reusable test collection for PIM (desktop search) research, I believe that PIM research should be done with real users and the best way is to build something useful and for you and later see what happens to other people.

Filed under: Category, Personal Information Management

CIKM – Overview

I just attended CIKM in Hong Kong. The conference was very successful. There were many interesting papers and the organizers did a good job. (I also liked the food here much better than in SIGIR Boston ;)

Let me post about a couple of interesting trends and works. I want to summarize the big picture as ‘a strenuous effort for convergence’. Although CIKM conference itself is aimed at providing a venue of collaboration between IR, DB and KM researchers, I could observer many efforts toward more unified approach.

The keynote speech on the first day was about the DBMS architecture where DB and IR components are tightly integrated. The speaker Kyu-Young Whang asserted that major concepts of IR such as inverted index and relevance matching operators should be embedded in DBMS as a core component.

Given that full-text search capability of current RDBMS systems are implemented as an extension layer on top of traditional DBMS components (e.g. B Tree index, Query Optimizer, and so on), having more tight integration will help efficient execution of queries where traditional data types in RDBMS (e.g. string, number and time) should be combined with full-text data types.

He demonstrated this vision by building a large-scale web search engine which can handing filtering operations such as site-specific search efficiently. Although the evaluation part was somewhat questionable, I agree with the main point he made.

There was also panel discussion titled ‘Information Extraction Meets Relational Databases’, which started with a thought-provoking question of ‘Where would you spend your next million dollar to solve this challenge?’. Each panelist represented the field of one’s major interests — Andrei Broder (Yahoo!) for Web Search, Edward Chung (Google China) for Data Mining and so on. During the discussion, Andrei took an illustrative example of answering queries like ‘Brad Farve’ (a football player), where he made the point that presenting automatically extracted information can help user fulfill information needs more effectively.

While the overall conclusion of the discussion was nothing new — there should be a convergence for user benefit, it was interesting to see how each subfield is limited and how the findings in related field can address the very issue. For instance, since information extraction works reasonably only for limited domain, it was suggested that aggregated query logs can provide useful clues on which object type or properties to focus as extraction targets. Getting back to the example query ‘Brad Farve’ this extracted information can help search relevance for each query related to specific object type.

In spite of these efforts for convergence, I also got the impression that researchers are not quite ready for such trend. For one thing, I was surprised that none of XML keyword retrieval papers from database people cited my recent work, although they even used the same collection! (IMDB) Anyway, I believe that we are in the right direction and conferences like CIKM can play a crucial role in our way forward.

I plan to write a follow-up posts on papers that drew my attention particularly.

Filed under: Category ,

Principles of Keeping Personal Information Effectively

PIM is arguably as much about the user as it is about the system, since the overall effectiveness is critically dependent on the right behavior of the user as well as the system, or the combination of both. Therefore, as a PIM researcher, I often think about what the best practices of doing PIM myself so that the system support and provide incentives for these practices.

As a starting point, these are things I found over the years, mostly on the ‘keeping’ side of PIM:

Keep only when it reduces access cost

When a large portion of information is a few googlings away, it doesn’t make sense to bear cost of keeping it yourself. Keep it when re-finding is relatively costly (and uncertain).

Check what you have before finding or creating

In many cases we get so lazy that we don’t check what we kept before and find or create from scratch. This can cause extra effort sometimes, especially when we already put some effort on that item. For instance, I often find, print and read a paper only to find that I already read this and have annotated hardcopy somewhere in my paper archive.

Keep with findability in mind

Attach a cue so that it can be easily found later. Put tag or use seachable keywords so that it can be found again efficiently when is needed. It’s as good as you don’t have it if you can’t find it later.

Simplify decision when keeping

I hate using folders to organize files because I have to make multiple decisions and the possibility of finding is dependent on each decision. Using simpler hierarchy or labels reduces cognitive burden and improves findability.

Keep a single reference

DRY (Don’t Repeat Yourself) principle holds true for personal information as well. If you have to look at many places for finding something (e.g. your plan for project A), you will be confused each time and end up having many different versions of same information.

Sharing the information whenever possible

Sharing is often the best way of keeping, since it will motivate you to complete your thoughts, strengthen your memory and getting further feedback from other people will help you develop thoughts. For instance, when I found an interesting paper, I sometimes send a email of comments to the author rather than write a summary and keep it myself.

From many points I made above, one lesson is that effective finding is key for effective keeping, while it seems obvious the other way around (e.g. more tags, better findability). The possibly of easily finding what we have enables us to keep smaller amount of things with less effort.

In the following post, I’ll write about how I tried to address each point in LiFiDeA — PIM system prototype I’m working on. I’m also curious on what your wisdoms are in personal information management.

Filed under: Personal Information Management ,

Total Recall — A Future of PIM

The people behind MyLifeBits project, Gordon Bell and Jim Gammel, just published a book ‘Total Recall’ on their effort of creating digitized version of their memory (e-Memory as they call it). Their main point is that most people can (and will) keep almost complete record of what they see, hear and experience and this will serve as not only a perfect memory (hence the title ‘Total Recall’), but also help people improve their lives by finding and remedying bad patterns from the data collected.

In addition to providing a detailed account of their vision, they also suggest ways people can start their own MyLifeBits project using various kinds of capturing, storage and search solutions such as:

They also delve into subtle issues of maintaining e-memory, like secure backup solution and privacy concerns. About future directions, they hope to capture more data with less effort by using devices like the mirror equipped with a camera which will record daily change of one’s appearance. I have kept my eyes on the project for years and it’s certainly exciting that many people will be interested in what PIM can potentially do for them by this book.

From IR perspective, this opens up a new line of research on the personalized and context-sensitive retrieval of information, since this real-time collection of diverse personal information should enable better understanding of user. Also, existing retrieval models should evolve to deal with this heterogeneous collection where documents make up only a small fraction. Being able to infer what kind of item is requested will be another interesting challenge.

Taking more practical view, while this is certainly a compelling long-term vision, it is not clear to me whether this can have an immediate appeal to most people, who might be more keen on finding a better way of dealing with existing information rather than having many new types of information available.

Another related issue is that their focus is still on ‘keeping’ side of personal information rather than ‘finding’ and ‘organizing’ side. PIM book I mentioned earlier introduces MyLifeBits project as a ‘Save Everything’ approach in personal information, which reconfirms this point. According to the book, MyLifeBits uses SQL-database as data storage and hand-crafted hierarchical classification and labeling scheme is used to support scoped search and browsing, which is not as advanced as their keeping mechanism.

While I agree that keeping is the starting point of PIM, I think we need more efforts in the other sides. Keeping will be motivated and sustainable only when the information stored can be retrieved effectively and used for other purposes.

Filed under: Personal Information Management

CIKM 2009 Interesting Papers

Another coverage on CIKM 2009. Here are the list of papers that drew my attention, regarding structured document retrieval, query modeling and search personalization. I tried to link the PDF version of each paper  whenever possible.

Among new topics, Characterizing and Predicting Search Engine Switching Behavior by Ryen White and Susan Dumais seemed most interesting, which is a like sequel of Ryen’s earlier work Enhancing Web Search by Promoting Multiple Search Engine Use .

Structured Document Retrieval

A Framework for Semantic Link Discovery over Relational Data
Oktie Hassanzadeh (University of Toronto), Anastasios Kementsietsidis (IBM T.J. Watson Research Center), Lipyeow Lim (IBM T.J. Watson Research Center), Renee J. Miller (University of Toronto), Min Wang (IBM T.J. Watson Research Center)

Effective XML Content and Structure Retrieval with Relevance Ranking
Xiping Liu (Jiangxi University of Finance and Economics), Changxuan Wan (Jiangxi University of Finance and Economics), Lei Chen (Hong Kong University of Science and Technology)

Language-model-based Ranking for Queries on RDF-Graphs
Shady Elbassuoni (Max-Planck Institute for Informatics), Maya Ramanath (Max-Planck Institute for Informatics), Ralf Schenkel (Max-Planck Institute for Informatics), Marcin Sydow (Polish-Japanese Institute of Information Technology), Gerhard Weikum (Max-Planck Institute for Informatics)

Learning document aboutness from implicit user feedback and document structure
Deepa Paranjpe (Yahoo! Labs)

Query Modeling

Semi-Supervised Learning of Semantic Classes for Query Understanding — from the Web and for the Web
Ye-Yi Wang (Microsoft Corporation), Raphael Hoffmann (University of Washington), Xiao Li (Microsoft Corporation), Jakub Szymanski (Microsoft Corporation)

Product Query Classification
Dou Shen (Microsoft), Ying Li (Microsoft), Xiao Li (Microsoft Research), Dengyong Zhou (Microsoft Research)

Personalization

Adaptive Relevance Feedback in Information Retrieval
Yuanhua Lv (University of Illinois at Urbana-Champaign), ChengXiang Zhai (University of Illinois at Urbana-Champaign)

PQC: Personalized Query Classification
Bin Cao (The Hong Kong University of Science Technology), Jian-Tao Sun (Microsoft Research Asia), Evan Wei Xiang (The Hong Kong University of Science Technology), Derek Hao Hu (The Hong Kong University of Science Technology), Qiang Yang (The Hong Kong University of Science Technology), Zheng Chen (Microsoft Research Asia)

Personalized Social Search Based on the User’s Social Network
David Carmel (IBM Research Lab in Haifa), Naama Zwerdling (IBM Research Lab in Haifa), Ido Guy (IBM Research Lab in Haifa), Shila Ofek-Koifman (IBM Research Lab in Haifa), Nadav Har’el (IBM Research Lab in Haifa), Inbal Ronen (IBM Research Lab in Haifa), Erel Uziel (IBM Research Lab in Haifa), Sivan Yogev (IBM Research Lab in Haifa), Sergey Chernov (Leibniz University)

Novel Topics

Characterizing and Predicting Search Engine Switching Behavior
Ryen W White (Microsoft Research), Susan T Dumais (Microsoft Research)

Improving Search Engines Using Human Computation Games
Hao Ma (The Chinese University of Hong Kong), Raman Chandrasekar (Microsoft Research), Chris Quirk (Microsoft Research), Abhishek Gupta (Georgia Institute of Technology)

Beyond Hyperlinks: Organizing Information Footprints in Search Logs to Support Effective Browsing
Xuanhui Wang (UIUC), Bin Tan (UIUC), Azadeh Shakery (University of Tehran), ChengXiang Zhai (UIUC)

Clustering and Exploring Search Results using Timeline Constructions
Omar Alonso (University of California, Davis), Michael Gertz (University of Heidelberg), Ricardo Baeza-Yates (Yahoo! Research)

Mashup-based Information Retrieval for Domain Experts
Anand Ranganathan (IBM TJ Watson Research Center), Anton Riabov (IBM TJ Watson Research Center), Octavian Udrea (IBM TJ Watson Research Center)

ETC

Usage Based Effectiveness Measures
Leif Azzopardi (University of Glasgow)

Expected Reciprocal Rank for Graded Relevance
Olivier Chapelle (Yahoo! Labs), Donald Metlzer (Yahoo! Labs), Ya Zhang (Yahoo! Labs), Pierre Grinspan (Google Inc)

Filed under: Information Retrieval

CIKM 2009 Statistics of Titles and Institutions

CIKM 2009 (The 18th ACM Conference on Information and Knowledge Management) will be an interesting venue for combining researchers in DB, IR and KM fields. As a IR researcher whose interest spans over structured document retrieval and personal information(knowledge) management, this is possibly the most relevant conference.

I got the following word statistics from paper titles. As you can imagine, top terms show that this conference is primarily focused on the retrieval of data and information using many statistical techniques. Keywords such as ‘efficient’, ’structure’, ’structure’ and ‘xml’ represents database-side, while ‘web’, ‘ranking’, ‘relevance’, ’social’ and ‘feedback’ seems more on IR-side. KM in this conference may be related to ’semantic’, ‘extraction’, ‘graph’ and ‘mining’.

query : 46
search : 39
web : 36
ranking : 35
retrieval : 27
data : 26
learning : 25
information : 18
model : 16
classification : 16
clustering : 15
mining : 15
relevance : 14
efficient : 14
topic : 14
document : 14
social : 13
graph : 12
semantic : 11
domain : 11
structure : 10
xml : 10
extraction : 10
feedback : 10

Next statistics on the institutions of authors show that corporate research labs is dominant here, similarly to SIGIR 2009. Chinese labs and universities are particularly strong here, reminding us that this conference is hosted by Hong Kong. Also, IBM research seems to have more papers than in IR-only conferences.

(Yahoo! Labs) : 52
(Microsoft Research Asia) : 32
(IBM China Research Lab) : 27
(Yahoo! Research) : 24
(The Chinese University of Hong Kong) : 20
(Tsinghua University) : 20
(Microsoft Research) : 19
(Peking University) : 16
(Yahoo! Inc.) : 14
(University of Glasgow) : 14
(Pennsylvania State University) : 14
(University of Illinois at Urbana-Champaign) : 13
(IBM Research) : 12
(University of Amsterdam) : 12
(IBM Research Lab in Haifa) : 11
(Nanyang Technological University) : 10
(IBM T.J. Watson Research Center) : 9
(University of Massachusetts Amherst) : 9
(University of Waterloo) : 9

Yet the numbers above turns out to be somewhat deceptive if you look at the following statistics just based on the institution of the first author. Academic institutions strikes back here. Perhaps many of industry papers are done by interns or they are authored by more researchers.

(Yahoo! Labs) : 12
(The Chinese University of Hong Kong) : 7
(University of Illinois at Urbana-Champaign) : 6
(IBM China Research Lab) : 6
(University of Glasgow) : 6
(Pennsylvania State University) : 6
(Peking University) : 6
(Microsoft Research) : 5
(Tsinghua University) : 5
(University of Waterloo) : 4
(Microsoft Research Asia) : 4
(University of Amsterdam) : 4
(Nanyang Technological University) : 4
(University of Kansas) : 3
(University of Massachusetts Amherst) : 3
(Drexel University) : 3
(National Taiwan University) : 3
(Microsoft) : 3

Also check out conference webpage for the full listing of papers.

Filed under: Information Retrieval ,

Book Review – Personal Information Management (1)

If you’re looking for a overview of current status of PIM research, the book titled ‘Personal Information Management’ by William Jones and Jaime Teevan is definitely the best (and maybe only) choice.

I have known William Jones and his PIM research group in UW for a long time and they have pioneered the research efforts on PIM. Jaime Teevan is also a renowned researcher in the field of re-finding and search personalization. In this and following posts, I intend to provide my perspective on the key ideas of this book.

First off, it’s worth pointing out that they provide several working definitions regarding PIM such as:

  • Information Item : a packaging of information in a persistent form that can be acquited, created, viewed, stored and so on.
    (e.g. paper/electronic documents, emails, webpages, etc.)
  • Personal Information : the information a person keeps, about a person, experienced by a person, directed to a person. Persoanl information is substantiated in the form of information items.
  • Personal Space of Information (PSI) : personal information combines to form a single personal space of information, which is a collection of information items related to a person.

The definition of personal information management (PIM) naturally follows: the practice and study of the activities people perform to acquire, organize, maintain, retrieve, use and control the distribution of information items, in other words PIM encompasses all the activities regarding one’s PSI. In addition to this definition centering around information items, they suggest major activities of PIM like:

  • Finding / re-finding activities : move from need to information
    (e.g. I need to find a good restaurant for dinner)
  • Keeping activities : move from information to need
    (e.g. Where should I save the paper he sent me?)
  • Meta-level activities : evaluation, management, organization and making sense of the PSI itself

Among these activities, the authors point out that meta-level activities are often put off because they are important in the long run but are rarely urgent, while finding and keeping activities are constantly prompted by events in a typical day.

In overall, I think they provided an excellent job providing the structure by which one can think of the problem of PIM. I’ll often refer to these definitions again in the future.

Filed under: Personal Information Management ,

Retrieval Experiments in Pseudo-desktop Collections

My paper ‘Retrieval Experiments in Pseudo-desktop Collections’ (co-authored with my advisor Bruce Croft) will be presented in CIKM2009. It is about a new model of desktop search research, where we introduced ‘pseudo-desktop’ — a simulated desktop collection composed of automatically gathered documents and generated queries. The method to validate generated collection is suggested as well.

For retrieval model perspective, we saw desktop search as a known-item search task over semistructured document collection, since people usually find what they already know of and each document in desktop has metadata. For instance, e-mail has sender and receiver fields in addition to usual title and content fields. We presented an improved retrieval model based on PRM-S, which was introduced in my previous work.

I believe that the significane of this work is threefold. First off, it is an effort to bring more scientific effort to desktop search (or searching personal information in general). People have studied desktop search for a long time yet they mostly built their own systems and reported the result of user-study, which lacks reproducibility. Pseudo-desktop can be a sharable data collection which address this problem, by which new researchers can test their algorithm against state-of-the-art baselines without building yet another desktop search engine.

Also, the experimental result shows the value of simulation as a method in IR research. Simulated query is not only free to get but also provides total control over the parameters. In our experiment, using algorithmically-generated queries many different characteristics, we could find insights over the performance of tested retrieval methods. Another paper in SIGIR2009 also demonstrated this value of simulated queries.

Lastly, PRM-S — a novel retrieval model based on the mapping between query-word and document structure — was found to be useful in a noisy settings like e-mails (e.g. many word-overlap between document fields) as well as the collection of clean structure like movie database.

Recently, I’ve been working on the development of LiFiDeA — a prototype PIM system, by which I plan to compare the experimental results of pseudo-desktop and real-desktop collections. I believe that this will provide ultimate validation for our approach here.

Filed under: Information Retrieval, Personal Information Management , ,

Personal Information Management (PIM) as a Science

For a long time, I have been thinking that better PIM will lead to more productive and fulfilling life, since it is about supporting one’s brain in saving, processing and retrieving information, which constantly suffers from oveload in this internet age. This seemed like a very interesting problem which merits a graduate-level study.

After I got here in UMass and started to study and research about it, I noticed several characteristics of PIM as a topic of scientific inquiry, which is mostly due to its ‘personal’ nature. In other others, PIM is different from other research topics because it is deeply rooted in one’s practices and lives.

The first thing is that PIM research should aim at principles beyond idiosyncracies, or a framework (system) that can accommodiate these different needs and practices. While the system should learn from user’s behavior, sometimes it needs to give incentives for the user to follow the right path.

Another thing is that PIM research should overcome the difficulty of getting data. The first issue comes from that personal data is mostly private by nature and therefore hard to get and shared. I tried to address this problem by creating a simulated collection with similar characteristics in my  SIGIR worksop paper. The second issue is that personal data is much smaller in scale, which imposes a big challenge for any supervised/semi-supervised learning approach.

In spite of these issues, I think it is a fascinaing problem with huge impact, since the findings are immediately relevant to millons of people who suffer from information overload. Also, I like the fact that I can test the system with myself and verify the result with my own data before doing a user study, which enables a real-time feedback and a fast iteration.

Filed under: Personal Information Management

How would you search for your favorite movie?

How would you search for your favorite movie? I tried to answer this question in my paper ‘A Probabilistic Retrieval Model for Semistructured Data (download)’ , which was presented in ECIR’09 at Toulouse, France.

In this problem, we can assume that collection is structured by many different fields (e.g. title, cast, genre, etc.) and users may recall partial information on some aspect of the target item. People have either used advanced search interface or let user issue structured query (e.g. XPath), which renders the problem trieval.

I wanted to see the problem differently. The first obvious thing was that people usually don’t want to go for multi-form search interface. Also, it’s beyond the capability of average users to formulate XPath query — it’s hard even for me!. I thought that it will be useful if users issue simple keyword query yet have the same effect as using those advanced techniques.

And a thought on the typical user’s querying behavior made me realize that we implicitly map each query-term into some aspect of the item we are looking for. Let’s assume a user trying to find a movie ‘French Kiss’ with partial information about cast (‘meg ryan’) and genre (‘romance’). He or she may type ‘meg ryan romance’ yet it is clear which aspect of data (movie) user meant by each query-word. And we can infer this mapping between query-term and document field by bayesian estimation (more detail on paper).

Given this observation and taking into account that each aspect of information is encoded in different XML element, it is natural that raking algorithm for this kind of document can benefit from this mapping between query-word and document. The solution is to put a higher weight for the element which seems to be what user intended. In above example, ‘cast’ element needs to be weighted higher for ‘meg ryan’ and the same can be said about ‘genre’ element and ‘romance’.

This simple idea later turned out to improve retrieval performance significantly. The performance gain was more noticable for collection with clear semantics (e.g. movie descriptions) since it was easier for a system to map each query-word into correct document field.

I’m currently working on applying this retrieval model for the desktop search problem, XML data were replaced with documents with metadata fields.

Filed under: Information Retrieval , ,

About Me

Twitter Updates

  • It's surprising how big a difference I can make by paying 'full' attention. Then what will be the best way to staying in that state of mind? 1 week ago
  • Waking up early gives me new energy and motivation to follow the way of life I determined to live. Why don't you start now. 2 months ago
  • Always be minimal, in terms of code, data, writing, and everything. Otherwise you'll soon find yourself flooded with wastes. {productivity} 4 months ago
  • Why blog is more popular than wiki? It lets you divide message into individual posts and get a feedback for each. It's more motivational. 5 months ago
  • http://www.slifeweb.com/ Check this out. This looks interesting. 8 months ago

Blog Stats

  • 1,691 hits