Simon Overell's
Publications

Invited Talks

Using AI to get Answers from the Internet
ECIR Industry Day
The Open University, Milton Keynes
1 April 2010

True Knowledge is a pioneer in a new class of Internet search technology that’s aimed at dramatically improving the experience of finding known facts on the Web. Their first service - the True Knowledge Answer Engine - is a major step toward fulfilling a longstanding Internet industry goal: providing consumers with instant answers to complex questions, with a single click. Picking up where search engines leave off, True Knowledge’s path-breaking Answer Engine automates the laborious, time-consuming work that users generally must do to get final answers to their questions. True Knowledge does this by structuring data in a way that enables computers to work and think like humans do, drawing inferences and conclusions when needed to find the information that’s requested. Another key differentiator: True Knowledge is tapping subject matter experts around the globe to build its information repository - bringing together the benefits of machine-driven automation and people-driven intelligence. Simon Overell of True Knowledge will lead us through the story of how they applied AI techniques to make the break from search engines that give links to search engines that gives facts.

Using AI to get Answers from the Internet
Real AI
Peterhouse College, Cambridge, UK
17 December 2009

The World According to Wikipedia
Centre for Digital Video Processing, Dublin City University
26 June 2008

Detecting Locations and Events in Wikipedia
Imperial College Internet Centre, London
29 April 2008

Wikipedia is the largest encyclopaedia mankind has ever known. It contains over 10 million articles across 250 languages and is now the 9th most visited site on the Internet. Wikipedia has led the way for user-generated-content sites such as Flickr and YouTube. In this talk, Simon will present his work on mining location and temporal references from Wikipedia, and will show that despite its best efforts at neutrality, Wikipedia still reflects the cultural biases of its contributors. By analysing different language versions of Wikipedia we can show how different locations and events have significance to different peoples. The talk will conclude with a summary of the applications of the work to Information Retrieval, Computer Science and beyond.

Distribution of Location References in Wikipedia (Short talk)
Million Books Workshop
Imperial College Internet Centre, London
14 March 2008

Classifying Wikipedia pages at home and abroad
Yahoo! UK, London
20 November 2007

Proposing a geographic co-occurrence model as a tool for GIR
Natural Language Processing Group, University of Sheffield
10 July 2007

The motivation behind developing such a tool is to improve performance on Geographic Information Retrieval problems such as placename disambiguation (if "Sheffield" appears in text, which Sheffield is it?) and geographic relevance (if "Sheffield" appears in a query are "Yorkshire", "Manchester" or "Derby" relevant?). The talk will cover the development of a geographic co-occurrence model mined from Wikipedia and similar user-generated content. The co-occurrence model is similar to a language model, however, contains only geographic entities. The accuracy and clarity of the co-occurrence model are also quantified. The talk will begin with a description of how Wikipedia can be mined for named-entity associations and the area Geographic Information Retrieval, followed by details of the co-occurrence model and its application. The talk will conclude with future directions and applying the described techniques to the CLEF corpora.

Placename disambiguation with co-occurrence models
Knowledge Media Institute, Open University
6 December 2006

My talk will cover an introduction to Geographic Information Retrieval (GIR) and the advantages provided by indexing placenames as unambiguous locations. I will describe our GIR system which generates a large-scale co-occurrence model and applies this model to the problem of placename disambiguation. The data for the model is mined from Wikipedia and applied to the GeoCLEF corpus. An example of placename disambiguation could be when "London" is referred to in text, is it "London, UK" or "London, Ontario"? The motivation behind this problem is to make un-annotated data machine readable and allow users to query and browse data geographically. The talk will begin with a description of GIR, placename disambiguation techniques and the use of Wikipedia as a corpus. Then a description of my probabilistic models, using first and higher orders of co-occurrence. The talk will conclude with our findings on how Information Retrieval methods can be enhanced with Geographic Knowledge.

Evaluating co-occurrence models applied to disambiguation
University of Glasgow
13 November 2006

My presentation will cover the evaluation of large-scale co-occurrence models for disambiguation. The data for the models is mined from Wikipedia and applied to the GeoCLEF corpus. The mining and application parts of the system are entirely independent to avoid bias. The specific problem I am applying co-occurrence models to is place name disambiguation (for example when “London” is referred to in text, is it "London, UK" or "London, Ontario"?). The motivation behind this problem is to make un-annotated data machine readable and allow users to query and browse data geographically. With the recent introduction of the geographic track to the Cross Language Evaluation Forum there is now a standardised way to test Geographic Information Systems.

I have evaluated three approaches to applying co-occurrence to place name disambiguation:
1. Assign a co-occurrence index to place triplets.
2. Infer co-occurrence classifiers from the ground truth.
3. Represent the places occurring in the training data as vectors in a high dimensional space. The talk will begin with a description of place name disambiguation techniques and the use of Wikipedia as a corpus. Then a description of my probabilistic models, using first and higher orders of co-occurrence. The talk will conclude with my intended future work: expansion beyond just place names to looking at all named entities.


Simon Overell's Publications
About.me | Academia | Linked in | Publications | Stuff I've Built | Musings | Follow Me
My PhD topic was Geographic Information Retrieval. I've written papers on Geographic Disambiguation and Modelling, Patents on Classification and Accurate NLP at Scale and given talks on Extracting Data from Wikipedia and the Web. For abstracts and citation details on all my publications click the boxes below.

Theses

PhD Thesis. Geographic Information Retrieval: Classification, Disambiguation and Modelling. (Imperial College London, 2009)

Master’s Thesis. TRIDE: Implementation of a Teleo-Reactive Integrated Development Environment. (Imperial College London, 2005)

Journal Articles

View of the world according to Wikipedia: Are we all little Steinbergs? (JOCS, 2011)

Using co-occurrence models for placename disambiguation. (IJGIS, 2008)

Conference & Workshop Papers

Classifying Tags using Open Content Resources. (WSDM, 2009, Barcelona)

Geographic Co-occurrence as a Tool for GIR. (GIR @ CIKM, 2007, Lisbon)
...

Invited Talks

I've given 9 invited talks covering my PhD, research at Yahoo! and work at True Knowledge.

Invited Articles

The Problem of Place Name Ambiguity (The SIGSPATIAL Special, 2011)

Are we getting it right? The results of the Student Survey (Informer, Spring 2008)

Patents

I've written a various patents all broadly related to classification. Four have been granted with previous employers and two are pending with Spider.io.

Evaluation Conference Papers

A key part of Information Retrieval is evaluation. Due to the efforts of the TREC and CLEF conferences there are now a series of standardised data sets for these evaluations. I've taken part in three CLEF conferences and one TREC conference, publishing 10 papers.

Posters

Distribution of Location References in Wikipedia (The Future of Multimedia Knowledge Management 2008, Milton Keynes)

SIRIL: A multidimensional browsing framework (MMKM Workshop 2007, Milton Keynes)

Citations

Both Google Scholar and Microsoft Academic Search maintain co-author and citation lists.