Automatic extraction of topic hierarchies based on WordNet

dc.contributor.authorBrey, Gerharden
dc.contributor.authorVieira, Miguelen
dc.date.accessioned2012-04-17T23:39:36Z
dc.date.available2012-04-17T23:39:36Z
dc.date.created2012-03en_AU
dc.description.abstractThe aim of the research described here is the automatic generation of a topic hierarchy, using WordNet as the basis for a faceted browser interface, with a collection of 19th-century periodical texts as the test corpus. Our research was motivated by the Castanet algorithm, which was developed and successfully applied to short descriptions of documents. In our research we adapt the algorithm so that it can be applied to the full text of documents. The algorithm for the automatic generation of the topic hierarchy has three main processes: Data preparation, wherein data is prepared so that the information contained within the texts is more easily accessible; Target term extraction, wherein terms that are considered relevant to classify each text are selected, and; Topic tree generation, wherein the tree is built using the target terms. We evaluated samples of the resulting topic tree and found that over 90% of the topics are relevant, i.e. they clearly illustrate what the articles are about and the topic hierarchy adequately relates to the content of the articles. Future work will address problems resulting from mis‐OCRed words, erroneous disambiguation, and language anachronisms. Faceted browsing interfaces based on topic hierarchies are easy and intuitive to navigate, and as our results demonstrate, topic hierarchies form an appropriate basis for this type of data navigation. We are confident that our approach can successfully be applied to other corpora and should yield even better results if there are no OCR issues to contend with. Since WordNet is available in several languages, it should also be possible to apply our approach to corpora in other languages.en_AU
dc.description.sponsorshipAustralian Academy of the Humanities; the ANU College of Arts and Social Sciencesen_AU
dc.format.extent20 slidesen_AU
dc.format.mimetypeapplication/pdfen_AU
dc.identifier.citationBrey, G. & Vieira, M. (March 2012). Automatic extraction of topic hierarchies based on WordNet. Presentation at the Digital Humanities Australasia 2012: Building, Mapping, Connecting [Conference][aaDH2012]. Canberra, Australia: ANUen_AU
dc.identifier.urihttp://hdl.handle.net/1885/8990
dc.language.isoen_AUen_AU
dc.provenanceThe copyright is owned by the authors. The conference organisers make no claim over copyrighten_AU
dc.publisherAustralasian Association for Digital Humanitiesen_AU
dc.relation.ispartofAustralasian Association for Digital Humanities Conference (1st : 2012 : The Australian National University, Canberra, ACT)en_AU
dc.rightsAuthor/s retain copyrighten_AU
dc.titleAutomatic extraction of topic hierarchies based on WordNeten_AU
dc.typeConference presentationen_AU
dcterms.accessRightsOpen Access
local.contributor.affiliationBrey, Gerhard, King's College London, Department of Digital Humanitiesen_AU
local.contributor.affiliationVieira, Miguel, King's College London, Department of Digital Humanitiesen_AU
local.description.notesInaugural Conference of the Australasian Association for Digital Humanities held 27-30 March, 2012. Presentation given by Jamie Norrishen_AU
local.publisher.urlhttp://aa-dh.org/conference/en_AU
local.type.statusPublished Versionen_AU

Downloads

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Brey_Automatic2012.pdf
Size:
1.59 MB
Format:
Adobe Portable Document Format
Description:
Powerpoint presentation slides

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
68 B
Format:
Item-specific license agreed upon to submission
Description: