As part of my PhD I have produced several data sets, which are available for download from this page. I will continue to release more data as I produce it. All data is released under the Apache 2.0 License. Any questions, please e-mail me:seo01@docDOTicDOTac.uk.
Wikipedia groundtruth.
A manually annotated sample of 1000 Wikipedia pages.
Description
We selected a random sample of 1000 Wikipedia pages. We extracted all the links from these pages and annotated whether they were referring to locations or not. If a link does refer to a location we include the corresponding unique id from the TGN gazetteer. The dump of Wikipedia used is the 3rd of December 2005.
Citation
@inproceedings{overell06a,
title={Identifying and grounding descriptions of places.},
author={Simon Overell and Stefan R\"uger},
year={2006},
month={August},
editor={Chris Jones and Ross Purves},
booktitle={SIGIR Workshop on Geographic Information Retrieval},
pages={14--16},
location={Seattle, USA}
}
Geographic Co-occurrence model
An automatically generated geographic co-occurrence model extracted from Wikipedia.
Bibtex
@inproceedings{overell07f,
title={Geographic Co-occurrence as a Tool for GIR.},
author={Simon Overell and Stefan R\"uger},
year={2007},
month={November},
editor={Chris Jones and Ross Purves},
booktitle={CIKM Workshop on Geographic Information Retrieval},
pages={71--76},
location={Lisbon, Portugal}
}
More data has recently been released via my API.
For more information on how this data was generated please read my publications.