Ontology driven Temporal Text Mining on Organisational Data for Extracting Temporal Valid Knowledge

The Ontology Editor allows users to construct and maintain ontologies using a tree-based graphical interface using drag-and-drop. The editor manages, imports, and exports multiple ontologies, and provides human readable reports on ontological structure. The editor exposes a extensive API allowing any third party code to launch an editor instance, add and remove ontology structure, and drive the GUI. The editor also exposes a DOM interface allowing the current ontology to be treated as a single XML document. Although the Editor was constructed with Project Parmenides in mind, Wordmap have already applied the lessons learned during development to its main product, and expect significant commercial benefit from the exercise. Aside from business process, Wordmap do not expect to exploit this result by selling the Editor in its current form, but rather to use it as an internal R&D platform. However, licensing agreements will always be available.

Based on an API (application programming interface) for defining the behaviour of document/data processing components, the Component Workflow Architecture enables the arrangement of any number of components into a directed graph (processing workflow). Two consecutive components in a processing workflow are connected via a queue; the output of the first component is buffered in the queue and will be consumed by the second component as soon as the second component finishes its current processing task. Apart from the one-producer-one-consumer queuing paradigm, which leads to component pipeline workflows, more sophisticated queuing strategies can meet the needs of more complicated flows of processing, e.g. having a component receiving input from multiple components or directing the output of one component to multiple successor components.

The Knowledge Extraction System (KES) supports the semi-automatic extraction of structured data from documents, their saving in a database as well as the eventual querying of this database in support of knowledge management activities. The KES consists of an information extraction module that takes press releases as input, analyses them using natural language processing techniques to identify entities such as people, products, companies etc. and fills in specific business event templates (such as a sale-purchase event) with the extracted data. Templates can be modified by the user. When accepted, information in templates is stored in a database that can subsequently be queried using specific questions.

The RELFIN-Annotator takes a set of topics and a clustering produced by the RELFIN-Learner through the RELFIN internal API and uses them to assign XML labels to a set of documents. The documents must have been NLP-pre-processed and adhering to the ParDoc format. The RELFIN-Annotator is implemented as component module, appropriate for use in the pipelines of the Parmenides Resource Manager. This is not a standalone module: It requires the RELFIN-Learner, which discovers the topics and delivers the clusters.

The relative ordering tool (ROTE) was designed as a support tool in the evaluation process in the Parmenides project. It supports users in determining the relative importance of quality characteristics of the software under evaluation. This information is then used to combine the results of applying the metrics to arrive at an overall evaluation of the components and ultimately the system as a whole. It works by presenting users with pairs of characteristics to elicit which of the two characteristics presented they consider to be more important (or whether they are equally important). The users' responses are recorded as an ordered list of characteristics reflecting the users' beliefs. This improves both efficiency and accuracy in the process of ordering a (potentially) long list of items This tool will be very useful in other large-scale evaluation projects. In addition we feel that it could also be a useful support tool in other similar tasks where a human is asked to give a subjective judgement of the relative merits of a large range of items (for example certain types of examination script, or social sciences data).

The evaluation framework built up for the Parmenides system as a whole is intended to be re-usable for evaluating other systems for revealing information or knowledge from large amounts of data. Reflecting the complex nature of the Parmenides system itself, the framework will include a number of evaluation methodologies for different components of the system which could be applied to "simpler knowledge discovery systems, for example information extraction, term recognition or ontology building as well as for the most complex data and text mining systems which may subsume these processes. Whilst some of the methodologies for evaluating the more established technologies, such as information extraction, are largely based on current general practice, other parts of the framework will be more innovative. An example is the user-oriented evaluation of temporal data-mining where, although a considerable amount of work has been done in diagnostic and progress evaluation by developers, the development of user-centred evaluation methodologies for this and related technologies is still in its infancy. Currently, draft versions of these methodologies have been produced which must be applied and validated once stable versions of the Parmenides software are available. These methodologies can then be made available all interested players, developers, users and evaluators after the end of the project. The benefit of such an evaluation framework will be to ultimately increase the take-up of these technologies based on realistic understanding of real user requirements (on the part of developers) and the realistic potential of the technologies (on the part of the potential user).

Preliminary result in the direction of ontology evaluation methodologies in the particular area of semi-automated unsupervised ontology learning with data mining methods.

The sequence mining algorithm analyses large datasets of events in order to discover relationships of frequently occurring episodes (sets of events). It consists of a framework that supports data pre-processing allowing users to formulate different scenarios upon which the knowledge discovery is applied.

SVM Categorizer performs automatic categorization of given documents, based on the Support Vector Machines algorithm. The algorithm first needs to be trained with a number of positive and negative samples from each thematic category. The tool then, based on the previous training, assigns each new document to one or more categories, setting one as primary and the rest as secondary.

The Wordmap termfinder is a technology for extracting terms and ontology fragments from text. Terms and fragments are useful for semi-automating ontology construction, information extraction, and many other NLP tasks. The existing code is structured as a single self-contained java library for easy integration with other products. Key innovative features are: algorithmic efficiency in a combinatorially difficult domain. The possibility of using a rich target (C5, though Penn is supported.) The possibility of using multiple term extraction algorithms, including a new non-statistical method. The Termfinder is mature code; Wordmap will begin demonstrations to customers in Q4 2004, and expects to sell products containing this technology in 2005. The Termfinder also uses a standardized XML format for unambiguously specifying Brill Tagger transformations. We expect to open source this aspect of the technology as an NLP community resource.

A set of tools and algorithms for event and argument extraction (with a specific focus on Biomedical scientific literature). The tools are still under development. Results so far are documented in the following publication: Rinaldi et al., Minining Relations in the GENIA corpus, ECML/PKDD workshop on Data Mining and Text Mining for Bioinformatics, Pisa, September 2004.

This is a part of speech tagger built using a novel implementation of Brill's transformation-based error-driven learning algorithm. The main advantage this has over the original implementation is speed at run-time, due to the memory-based implementation. The tagger can be trained for different languages and tag-sets.

The "Parmenides Concept Monitor" is an interactive module that monitors the popularity of application-specific concepts derived from a document collection as the collection accumulates or changes over time. The "Parmenides Concept Monitor" receives as first input a set of concepts or concept combinations discovered by the RELFIN-Learner from a document collection characteristic for the application. These concepts do not characterise the collection but are representative of document fragments, which correspond to in-document topics. The "Parmenides Concept Monitor receives as second input an accumulating collection of documents, partitioned in time intervals. It identifies changes to the popularity of in-document topics and to the strength of the correlations among them. It encompasses different criteria for change detection and an alert mechanism. The "Parmenides Concept Monitor" has an API to the RELFIN-Learner, used to read the input set of concepts. Documents input to the "Parmenides Concept Monitor" must have been linguistically pre-processed using the modules of the Parmenides system, as required by the RELFIN-Learner and the RELFIN-Annotator.

The CAFETIERE analysis system does information and knowledge extraction by the application of patterns defined over multiple layers of analysis of a textual document. A distinguishing feature is that it is closely coupled with an ontology management system, such as PS-NKRL (developed by Wordmap in the project) or Protege. The system is designed to extract facts as well as information about domain terms, entities and relations between them. Because of the ontology grounding, the extracted entities are represented not by text strings but as knowledge-based instances, allowing other facts about them to be accessed via textual annotations. The system is supported by an annotation browser and editor, which can be used either to validate the automatic analysis of a document collection, or to build up a corpus of training data usable for the construction of a specific knowledge extraction system.

A multi-layered XML-based annotation scheme, which is used within the project as an interchange format between different tools. Can be used to enrich documents with simple semantic annotation, without using more complex approaches such as RDF. See papers references in the "documentation" section for a detailed description and results.

The Terminology Structuring Tool allows detection of structuring relations (in particular synonymy and hyponymy) among a set of pre-extracted terms. It is based upon a set of morphologic, synctactic and semantic rules, and makes use of external resources (WordNet).

In the Parmenides project, UMD has acquired expertise in accompanying the process of ontology establishment, using as input - A rudimentary taxonomy of terms, - A collection of representative texts and - Descriptions of entity types that are important for the universe of discourse. UMD can apply the RELFIN-Learner (see result "RELFIN-Learner") or further data mining modules for the discovery of concept combinations that are characteristic of document fragments that occur frequently in the application. Consulting support on further steps of ontology establishment is also possible.

Risultati finali

Condividi questa pagina

Scarica