PeterGroves.com
Home Ideas Resume About Contact

Email Event Extraction

Posted: June 4, 2006

Information Extraction is the process of filling in fields of a database based on some form of text. For instance, a company called Burning Glass produces software that automatically files resumes submitted electronically to a companies human resources department. The relevant data for a number of fields such as name, years experience, skills, education, etc., are extracted and entered into a database. It is then much easier to search the database for the proper applicants or a position then if an employee had to manually read and through every resume for people with the appropriate skills.

An application that is commonly tested in academic literature, but that has not been exploited commercially, is extracting event information from emails and adding it to one's calendar. For instance, if you are on an email list that provides upcoming community events, you could create a system that automatically places all the events you're notified about on your calendar.

The caveat is that a specially trained model to achieve good results in a particular domain is needed. (The domain is what the emails a person gets normally contain - do you get concert announcements or seminar announcements or appointment reminders?). A good interface is therefore needed to let the user quickly and easily correct improper extractions. Even if the model performs poorly, a good interface will still improve the current situation where a user must open up a blank calendar entry, and then type in the required fields based on what they've read in the email.

The general idea of this can be extended to other sources of text information, most obviously web pages, but also Usenet news groups. That can wait, in my opinion.

There are actually two possible applications for such technology that I'd like to pursue. They are discussed below.

Adaptive Email Information Extraction

This involves creating a plugin for an existing personal information management (PIM) application. While I don't believe there is an official definition, a PIM normally is an email client, calendar, and addressbook in one. The prime example is Microsoft Outlook (by "prime" I mean "most well known," and not "good," of course). There are several open source applications such as Ximian Evolution, KPIM, some Gnome apps that can work together (Balsa, GnomeCal, etc), and Chandler, which is only in it's initial stages but has ambitious goals.

The plugin would simply add a sidebar to the window where the text of an email is displayed. A drop down bar would allow you to select which of your templates you want to fill in, with the best match automatically selected. By template I mean a definition of a set of fields (date, time, speaker, location, etc.) that the user has grouped to form the main fields of an event. So if you get concert announcemnts you might have a "show" template with the fields: performer, date, time, location, and price. If the template is incorrectly filled in by the model, the user can fix it quickly by highlighting the correct text and then clicking on an icon that represents that field in the sidebar with the template selection (the list of field icons would change depending on which template is selected).

The model for each template would improve as it is given more emails to train on. The question would be how often would you need to retrain the model, or could you have a model that was truly adaptive and could be updated easily (little computation) after every email is verified. That would depend on the model, which could be abstracted out and different algorithms could be tried. There are a number of researchers in academia working on this problem so there is no shortage of possible methods.

Furthermore, with such a system in place, it would be trivial to implement a means to extract contact information from a person's signature file at the end of an email. The template would simply contain name, address, title, etc. instead of date and time and the results would be exported, along with the sender's email address, to the address book.

I worked on a prototype of such a system in a Special Topics on Information Retrieval (cs497cxz at the University of Illinois) class in the spring of 2003 with Liping Chen and Yan Shi. We are continuing our work this semester (Fall 2003), hopefully integrating such a system into Ximian Evolution. We may, however, pursue a slightly different path and try to publish a paper on the effects of increased accuracy with new training data over time (that is, how much better does the model do if the user verifies more and more extractions). A copy of the final report from last semester can be found here.

Autogenerated Web Calendar

The same basic technique could also be used by a web calendar service. The idea is that community or local web pages are rarely up-to-date with information on upcoming events, making it unlikely for people to check them regularly, although anyone with an internet connection has the means to. On the other hand, event organizers and promoters often send out current information on events to mailing lists of some sort. The information only goes to people on the list, however, which may require some form of membership. If a single email address were to be put on to the public mailing lists of all the activist groups, concert promoters, schools, goverment groups, etc, a very current and complete calendar could be produced. Traditionally, this would be infeasible as every event would have to be manually added by a volunteer (or employee), with little return for their time. Using the information extraction techniques discussed earlier, however, the problem would be reduced to that of simply verifying the calendar entries the system had extracted.

I would like to work on the back end for such a system, and release it as open source (it would share much of the code of the PIM plugin). This would allow any community group to use the software in their area. I believe money could be made by running the system in a large market, and selling ads. Ads could be both generic/static or displayed based on what events the user had queried or was currently looking at. Theoretically, once the code was written little would need to be done day-to-day, simply maintaining the code, checking the calendar entries, and selling ads. I've also submitted the idea to mySociety.org, which is in search of civic minded projects to sponsor. The idea is to pay a stipend for people to work on web based projects for the greater good and then release the code or run some service. I believe this project would match nicely with such an organization.


Copyright 2007 Peter Groves. This text may be reproduced only in it's entirety in any medium without royalty provided this copyright notice is included.