Knowledge-Base and Database Systems

  1. Knowledge-Base and Database Systems

________________________________________________________________________________________________

Until quite recently, the two areas of knowledge-based systems and databases have usually kept some distance apart from each other. While their preoccupations have not been totally different, their techniques have been developed independently. It has always been appreciated that there is no sharp dividing line between data and knowledge, as the distinction varies according to the context in which it is made. At some periods the appreciation was the basis for just an honorary association between the two areas, but the connection is now stronger than this, and growing in strength because the existing achievements of each area fill a gap or need in the continuing development of the other. It is possible to make inferences of new information (generalisations, and not only the derivation of new facts by using given data) from database material - for which methods devised in research on knowledge-based systems are immediately relevant. And databases are natural places to look for the large volumes of information that are needed if techniques for processing of knowledge are to be tested thoroughly and validated.

The projects described in this section of the research report include examples of situations where knowledge-based techniques and databases come together: in interpretation of medical images, in exploiting stored multilingual texts in order for computers to give better support to human translators, and in construction of stocks of knowledge that can allow automated reasoning in terms of cases. A common lesson in the experience we have gained in these situations is that uniting databases and knowledge-based schemes within one framework raises questions (compatibility of representations, acquisition of the implicit knowledge that lies behind a particular database and its design assumptions, and effective joint working of all the software components in a unified environment) which are not given proper attention in present textbooks. These are effects of scale: they can be downplayed in textbook examples, which are simple because they are small, but must be confronted for systems that hold and process large amounts of information. Learning how to deal with large applications is an integral part of several of our projects.

The projects involving medical applications which are mentioned below are continuations or natural consequences of work that has been carried out for some years in the Department, and involve recently-expanded collaborations with specialist groups in medicine and in medical computing. Broadly speaking, the second area of application is textual (the treatment of documents, either

as part of a general filing or office-automation scheme or as sources of assistance to translators), and benefits also from collaboration with specialist researchers and users outside computer science. We expect that our activities in both areas will expand along with the general growth of these areas in the future. In medicine the steady growth is a matter of record. For knowledge-based textual and linguistic processing, we expect to be able to take advantage of the recent increased emphasis on "language engineering" in the European Union and its prominent place in the EU's Fourth Framework Programme of research and development in information technology.

A further strand of our research involves improvements in the foundations of database systems, such as the representation of incompleteness and uncertainty in data, and theoretical treatment of the issues underlying object-oriented database models. All of these topics, too, have implications for knowledge-based systems , and (as treated in our database research) give perspectives that complement the previous views that have come from logicians working on artificial intelligence.

Knowledge-Based Systems

  1. The Malformation Syndromes
    Diagnostician (MSD) Project

Felicity Dams, John Washbrook, with E.T. Keravnou (Department of Computer Science, University of Cyprus) and C.M. Hall, R.M. Dawood , D. Shaw (Department of Radiology, The Hospital for Sick Children)

This project is part of a continuing collaboration between radiologists at the Hospital for Sick Children, Great Ormond Street, and Computer Scientists at University College London and the University of Cyprus. The motivation for the work is to create a knowledge-based system to aid in the diagnosis of skeletal abnormalities. These abnormalities include about 2000 skeletal malformation syndromes as well as about 120 skeletal dysplasias.

The objectives of the project are to:

This work builds upon the Skeletal Dysplasias Diagnostician (SDD) project.

The most significant practical innovation for the project has been the introduction of a new model for background, or "commonsense" medical knowledge, and an associated inference engine. The new model supports a canonical (standard) form for representing features. This has two major advantages over the original model:

  1. it eliminates synonyms; when a feature is input by the user it is translated into its own preferred terms for use within the system. This saves a great deal of time in subsequent processing.
  2. it contains a great deal of taxonomic and meronomic knowledge , enabling the system to make much more sense of the information which it is given.

A complete new set of syndromes (those displaying sclerosis) has been entered into the dysplasia knowledge base ready for the next set of user trials. This also necessitated several changes to the data model and background reasoning model, including:

These changes to the background reasoning have been very successful, and have improved the performance of the system considerably.

The tasks for the next twelve months are to:

Sophia Prevezanou, John Washbrook, with A. Todd-Pokropek (Department of Medical Physics), E.T.
Keravnou (Department of Computer Science,University of Cyprus) and P. Taylor (I.C.R.F.)

This project is part of a continuing project, originally funded by the HEFCE, being a collaboration primarily between the Departments of Medical Physics and Computer Science at University College London, a group at the Imperial Cancer Research Fund, several external companies (in particular AuRa Scientific Ltd, and IMSTAR S.A.) and the International Atomic Energy Agency. The motivation for the work is to create a knowledge-based system to assist in image analysis. Two main target application areas were chosen, being: expert-aided quality assurance in nuclear medicine, and expert-aided gel electrophoresis image quantitation.

The objectives of the project are to:

Since there is a significant commercial interest in the system, an important requirement has been that software packages so developed should run on different machine, in particular both on Unix-based workstations and on MS-DOS based PC platforms.

To date the bulk of the work has been in the development of suitable image processing tools. A complete quality assurance package has been written for the nuclear medicine application. This is now being pilot tested in twenty hospitals primarily in developing countries, under the sponsorship of the IAEA. A large part of the gel electrophoresis image analysis package has also been completed. Both packages run under Unix (Sun-OS and Solaris), MS-DOS and OS-2. In addition to the specific code related to the two application areas, a large amount of general purpose image processing functions have been added. Packages can be configured which are appropriate for many medical image processing applications.

The most significant practical innovation for the project has been to:

For a software package in this application areas to be useful, it must either perform real data acquisition itself, or control the data acquisition software provided by some other manufacturer. Given the enormous variability of systems in the field, only the latter solution is practical. Thus not only must appropriate data transfer mechanisms exist, but control functions also must be created. Both have been tackled, for example by using the INTERFILE file transfer format for nuclear medicine, and by providing appropriate interfaces to manufacturer specific formats or general purpose graphics formats. A suitable control mechanism has been provided for the quality assurance problem which has now been implemented by two suppliers of hardware for acquisition.

The tasks for the next twelve months are to:

John Campbell

Case-based reasoning (CBR) is now a standard research topic in knowledge-based systems, after being a small minority interest as recently as three years ago. It is now widely appreciated that there are many examples of expert knowledge that are particular and anecdotal, and that cannot be expressed within the standard representations for knowledge (rules etc.) without losing essential information. There is no substantial debate about what should be represented in a case, but the literature contains a very wide selection of decisions about how to represent cases.

Work in the past year has confirmed our previous classification of CBR as a particularisation (more accurately, several specific particularisations) of analogical reasoning. This has led to identification of particular features of past research on analogical reasoning (especially psychologically-based research) that are effective in structuring case information, in contrast to the general situation in CBR where cases are given a rather flat representation which does not take into account temporal or narrative cues.

We have been conducting experiments to test the view that assessment of similarity can often be speeded up if some redundancy is accepted in the representations of case material, and that retrieval is of higher quality (as judged by human experts) and more effective (in answering non-experts' queries and needs) if it takes into account the narrative structures mentioned above. These are positive as far as they have progressed, but to be more convincing they need to be based on larger collections of case material. A start has been made on collection of data for one area which can yield much information, in principle (transport routing).

Knowledge-Based Systems Support for Engineering Designers

Janet McDonnell

This investigation is concerned with the potential of knowledge based systems (KBS) technology to make a key contribution, at a strategic level, to the support of practising engineering designers. The premise is that an object-oriented knowledge-based system can provide a framework within which specialised CAD tools, conventional databases, intelligent data handling, and traditional computer programs for analysis, simulation, and so on are integrated in such a way that each continues to make its particular contribution to automating or assisting aspects of designer's work but in which each is also set in a context (by the framework) which is relevant, timely, and meaningful to the designer.

The work is based on the belief that an understanding of designers' perspectives should be the primary formative influence on the nature and structure of any KBS which is intended to assist with designing and the decision making which it involves. For this reason, it essential to study real design activity in the natural setting of design practice, and to make empirical studies the central core in both the formulation and the evaluation of design decision support systems. The theoretical aspect of the work so far has concentrated on exploring the relationship between design competence and professional design practice with a view to understanding how design support systems can fit into the professional environment; investigating and testing what are the best ways of eliciting what engineering design activities entail and in representing the outcomes in ways that lead to identification and specification of software components which realize the features and the qualities that must be exhibited by an engineering design support system; and design issues associated with mapping of the essential components identified to an object-oriented software platform.

A case study based on the design of high voltage electricity distribution networks has been completed. A number of knowledge elicitation techniques have been applied within a methodological framework suggested by Pask's Conversation Theory including some based on the use of systemic grammar networks. This approach allows knowledge engineering to proceed effectively on the basis of knowledge elicitation and representation through the construction of a language to suit a purpose, rather than on the basis of making psychological claims about designers' cognitive processes.

Whilst the work reported refers to engineering design, the nature of designing during the early stages of design, sometimes termed conceptual design, appears to share significant similarities across different design disciplines. It has therefore proved valuable to draw upon design studies in other design disciplines in the theoretical research and it is a deliberate research aim to be able to demonstrate components of a KBS design support architecture which are valid for broader application e.g. architectural components for supporting design justification through reference to design alternatives, an aspect of a designer's work which is common to most if not all professional design practice.

Developing Techniques for
Knowledge-Based Problem Solving in Time-Bounded Situations.

Niladri Chatterjee (supervised by John Campbell)

Our work in the ongoing project on developing techniques for time-critical decision making in knowledge-based domains has been primarily along two directions:

While our earlier reports contain embryonic ideas on both of these aspects, over the past year we have succeeded in identifying the implementational bottlenecks of the above and have been able to provide solutions with respect to both of them.

As we expected, the main difficulty in building a cache is in choosing a manageable set of indices that are adequate for representing different problems in the domain. In the absence of clear domain theories (as happens often with real domains) we propose to perform an analysis of selected past cases to ascertain the significance of various features in the domain. We have developed a novel classification scheme for identifying the roles of different features in the overall case. We have also developed a formalism for case representations within the memory, so that their adaptation in a given situation is smooth and requires less computational time than in previous approaches.

The technique of "knowledge interpolation" has been conceived with a view to performing this adaptation. Earlier we thought of using interpolation for case adaptations only. However, over the past year we have not only developed methods for carrying out interpolation, but also have been able to extend the idea of interpolation for some other reasoning paradigms (e.g. rule-based).

Our recent work in this area has primarily focused on the following topics:

The applications considered have been ground operations control in an airport and frequency allocation in short-wave radio communications. The different paradigms that we have considered for this purpose have been default solutions, rules and cases.

A Computational Framework for Rule-Based and Case-Based Reasoning Applied to Matrimonial Home Settlement after
Divorce

Kamalendu Pa (supervised by John Campbell )

This research is part of a project on legal decision-making systems . Decision makers, faced with new problems, benefit from the experience of similar cases that have already been decided. In many legal domains, it is very difficult to arrive at a final decision using only statutory rules. We have developed a computer-based prototype system, which can help in working out the final decision for a new case in hand, using both general legal rules and specific information taken from previously-decided similar cases. Developing such a hybrid computational framework is desirable because neither rules nor past cases, taken alone, are sufficient for a legal decision-making process even in supposedly simple legal areas, e.g. property settlement in divorce proceedings (which is the specific facts of the proposed work). There are many other situations where such a two-paradigm approach is demanded by the underlying knowledge.

Fuzzy information plays a vital role in this computational framework. Legal decision-makers often use descriptions such as "the length of the marriage is long" or "short" or "the spouses are poor" or "rich" in their judgements. But there is no evidence as to how to quantify those linguistic terms. Fuzziness occurs when the boundary of a piece of information is not clear-cut. Therefore concepts such as poor, rich, very-rich, or middling are fuzzy. There is no single quantitative value that defines a man as rich or poor. We use fuzzy set thory to handle such information in matrimonial home settlement. "Fuzziness " and "uncertainty" are two distinct inexact concepts employed in our prototype system.

This system takes as input a description of a current situation and produces initial advice for the user, provided that at least one of the rules from the rule base is firing. If none of the rules of the rule base fires, then the computation can suggest some partial or tentative advice. The computation also exploits features of the case being considered, to choose similar cases from a relevant case-base and display these previous decisions to the user in a way that should facilitate the final decision.

TRANSLEARN: Assistance in Natural-Language Translation

John Campbell, Niladri Chatterjee and Neil
Morgenstern, with A. Fang (Survey of English
Usage, UCL)

In European organisations, there is an obvious need to make the translation of official documents into all CEC languages (and, in the future, into languages of new applicants to join the European Union) more efficient and less labour intensive. Many documents include parts in standardised formats and phrasings, apart from minor differences in items such as dates and serial numbers. Hence, if a piece of text X in language B already exists in a database of translations, as a translation from language A, it is undesirable for a translator to have to repeat the process of translation to B when X turns up in a new document written in A.

The TRANSLEARN project, which is part of the CEC programme LRE on information technology for natural-language processing, is intended to deal with this situation. Our collaborators are natural-language specialists: the Institute for Language and Speech Processing, and Knowledge S.A., both in Greece; ILTEC in Portugal; and the large French company SITE which produces documentation and translations. We have access to large samples of databases of CEC documents, with parallel versions in English, Greek, Portuguese and French.

We have specialised in producing high-quality software for parsing English text and tagging its components with part-of-speech labels, and in finding the best fast discriminators for alignment of texts that represent (in principle) the same thing in different languages. During the current year we have produced efficient software for this latter purpose, for inclusion in an overall "translator's workbench" for TRANSLEARN. Our material has been tested in the project's mid-term review, and it has been shown to exceed the previous state of the art, in performance and capability.

Database Systems

  1. Incomplete Information in Databases

Mark Levene and G. Loizou (Birkbeck College)

We are investigating data modelling, query languages and data dependencies (i.e. integrity constraints) in relational databases (both flat and nested) in the context of missing or incomplete information. We have formalised two models of incomplete information, which we briefly describe.

In the first approach we allow only one unmarked generic null value whose semantics we do not further specify. The motivation for the choice of a generic null is our desire to investigate only fundamental semantics which are common to all unmarked null types. This has lead us to define a preorder on relations, which allows us to measure the relative information content of relations. We have also defined a procedure, called the extended chase procedure, for testing satisfaction of null data dependencies and for making inferences by using these null data dependencies. The extended chase procedure is shown to generalise the classical chase procedure, which is of major importance in classical relational database theory. As a consequence of our approach we are also able to capture the novel notion of losslessness in nested relations, called null extended lossless decomposition.

In the second approach we interpret the unmarked null as "value exists but is unknown at the present time" and give the semantics of the resulting incomplete relations in terms of the possible worlds they induce. This approach allows us to define strong and weak functional dependencies (FDs), which are two alternative interpretations of FDs being satisfied in incomplete relations. Strong satisfaction of FDs implies that the FD holds in all possible worlds and weak satisfaction of FDs implies that the FD holds in at least one possible world. We have presented a sound and complete axiom system for the interaction of strong and weak FDs, which corresponds to a subset of propositional modal logic.

We have extended these results by investigating the effect that incomplete information has on inclusion dependencies (INDs), which generalise the concept of referential integrity in relational databases. We have exhibited a sound and complete axiom system for INDs and weak FDs over incomplete databases. An interesting result of this research is that the implication problem for weak FDs and NINDs over incomplete databases is decidable and EXPTIME-complete. By contrast, when no nulls are allowed, this implication problem is undecidable. This undecidability result has motivated several researchers to restrict their attention to FDs and noncircular INDs. Our results imply that when considering nulls in relational database design we need not assume that NINDs are noncircular.

The Additivity Problem in Incomplete Relational Databases.

Mark Levene and G. Loizou (Birkbeck College)

Functional dependencies (FDs) and inclusion dependencies (INDs) are the most fundamental integrity constraints that arise in practice in relational databases. This work is concerned with a particular problem that arises with respect to FDs and INDs in the presence of incomplete information. An FD or IND is weakly satisfied in an incomplete database, if there exists a possible world of this incomplete database, in which the FD or IND is satisfied in the standard way. Additivity is the property of the equivalence of the weak satisfaction of a set of FDs and INDs, S, with the individual weak satisfaction of each member of S in the said incomplete database. From the users point of view an incomplete database which is not additive may be considered as contradictory. We have shown that in general weak satisfaction of FDs and INDs is not additive. The problem that arises is: under what conditions is weak satisfaction of FDs and INDs additive. We have solved this problem for the following cases: when S is a set of FDs, when S is a set of unary INDs and when S is a set of FDs and unary INDs. We have shown that if the set of INDs is unary then checking whether S is additive can be done in time polynomial in the size of S.

Computable Database Queries

Mark Levene and G. Loizou (Birkbeck College)

Computable database queries differ from standard computable mappings in that their inputs and outputs are sets of records (or in general objects) rather than strings of characters. This has an effect on the semantics of computable queries, since when the users pose a database query they do not need to concern themselves with the storage details of the records in the database. In particular, the user can ignore the order in which the records are stored. Our approach is to view a computable query as being realised via a Turing-computable mapping from strings to strings and an encoding, which encodes the input set of records into an appropriate string. We have analysed several subclasses of computable queries and showed the equivalence of these subclasses to ones already defined in the database literature. One of our main results shows that the class of encoding-independent computable queries is equivalent to the well-know class of generic queries.

Object-Oriented and Graph-Based Database Models

Mark Levene with R. Offen (Macquarie University
Sydney) and A. Poulovassilis (Kings College)

Object-oriented data models attempt to overcome the deficiencies of value-oriented data models, such as the relational model, and entity-based data models, such as the functional data model. In value-oriented models objects are identified by the values of their key attributes, whereas in object-oriented models object identity is independent of attribute values and is achieved by equipping each database object with a unique identifier. Several advantages accrue with attribute-independent identity: arbitrarily complex objects can be represented, objects can share common sub-objects, and updates are simplified. As well as attribute-independent object identity, object-oriented data models provide mechanisms for inheritance between classes of objects and for encapsulation of object structure and behaviour. This encapsulation of object semantics is not found in the traditional value-oriented or entity-based data models.

Object-oriented databases are a relatively recent database research area and still lack a well-accepted formal foundation. In previous work we have shown how hypergraphs can provide a simple unifying formalism for the main concepts of object-oriented data modelling , namely representation of objects, sharing of sub-objects, and inheritance. Following on from this work, we have developed a graph-based database model called the Hypernode Model. The main data structure of this model is the hypernode - a directed graph whose nodes may themselves be directed graphs. Hypernodes can be used to represent arbitrarily complex objects and can support the encapsulation of information, to any level. We have developed a procedural query and update language for the hypernode model, called HNQL, which has shown to be computationally complete.

We have extended the work on the Hypernode model by formalising a unified, graph-based data and process model, which can provide a rigorous, canonical approach to software engineering based modelling. This has been achieved by combining the Hypernode data model with a nested Petri net process model. We have demonstrated how data and process modelling can be combined seamlessly by defining process bases. A process base consists of a repository of hypernodes (which have a nested digraph structure) representing data entities and a repository of process diagrams (which have a nested Petri net structure) representing the processes acting on the database. We have shown that, by constructing the process accessibility graph we can answer questions with respect to the enabling of process nodes in process diagrams. In addition, we have shown that although the reachability problem with respect to process bases is, in the general case, undecidable, certain reachability problems, such as whether a process base is dead or potentially alive, are decidable and can be answered by constructing the reachability graph. A computer-aided software engineering modelling tool based on our approach is currently under development at the CSIRO-Macquarie Joint Research Centre for Advanced Systems Engineering in Sydney. This tool extends the initial prototype which was developed at UCL.

Semantic Data Modelling

Wilfred Siu Hung Ng (supervised by Mark Levene)

The number of new applications in need of database support is rapidly growing in different application areas such as Hypertext and CAD. It is essential that the data semantics in these new applications is well-understood, represented, and communicated by both users and database systems. In this context we are currently studying two aspects of semantics in relational databases.

The first aspect concerns the properties of ordering tuples in a relation. The relational model assumes that the way in which the information is arranged is unimportant, i.e., that relations are simple collection of records. However, sometimes information may also have a pre-defined positional arrangement, for example, Hypertext is naturally described as a network of information nodes and cannot be easily modelled within a relational system. Database researchers have recognised that we need a data model that can offer good support for an ordered data structure. Our investigation deals with the mechanism of organising such data.

We have already formulated the underlying structures needed to represent relations with linear orderings imposed on them. As a generalization of our work partially ordered relations will also be considered. This extension of the relational data model has several benefits. For example, ordered data releases some burden from application programmers. It allows more efficient programming due to the fact that in this case it is not necessary for them to explicitly iterate over the results of data manipulation with the aid of cursors. One of the nice features of ordered databases is that a conventional relation can be viewed as an equivalence class of ordered relations. It implies that ordered databases allow flexible database design and that they provide physical data independence. We have also formulated an ordered relational algebra and an ordered relational calculus in order to manipulate ordered databases. We are currently investigating the expressive power of these query languages. We are also looking into a category theoretic justification of our results.

The second aspect of our work in database semantics concerns cardinality constraints (abbreviated to CCs). CCs formalises the idea of restricting the number of tuples (i.e. the cardinality) of projections in a relation. As a simple illustration, a CC can state that the number of lecturers in a relation Lecturer_record is less than the number of students in a relation Student_record. In general, a CC may be interrelational and of the form R[X] S[Y] where R and S are relation names (which are not necessarily distinct), and where X and Y are set of attributes. We observe that CCs differ in two respects from functional dependencies (abbreviated to FDs) and some other commonly studied database dependencies. First, there does not exist an Armstrong database for CCs. Second, every database is a model that must satisfy a given CC or its reverse, i.e., the database must satisfy either R[X] S[Y] or S[Y] R[X]. The interaction between CCs and FDs is surprisingly complex even in the case of FDs and unary CCs which have only one attribute on each side of each dependency.

Hypertext Databases

Mark Levene and G. Loizou (Birkbeck College)

One of the main unsolved problems confronting Hypertext is the navigation problem, namely the problem of having to know where you are in the database graph representing the structure of a Hypertext database, and knowing how to get to some other place that you are searching for in the database graph. In order to tackle this problem we have introduced a formal model for Hypertext. In this model a Hypertext database consists of an information repository, which stores the contents of the database in the form of pages, and a reachability relation which is a directed graph describing the structure of the database. The notion of a trail, which is a path in the database graph describing some logical association amongst the pages in the trail, is central to our model.

We have defined a Hypertext query language for our model based on a subset of propositional linear temporal logic, which we claim is a natural formalism as a basis for establishing navigation semantics to Hypertext. The output of a trail query in this language is the set (which may be infinite) of all trails that logically imply the query. We have shown that there is a strong connection between the output of a trail query and finite automata in the sense that given a Hypertext database and a trail query we can construct a finite automaton representing the output of the query, which accepts a star-free regular language. We show that the construction of the finite automaton is exponential in the number of conjunctions in the trail query plus one.

We have also investigated the computational complexity of model checking which, given a Hypertext database and a trail query, is the problem of deciding whether there exists a trail in the database that logically implies the query. We have shown that although the problem is NP-complete for different subsets of our query language it is polynomial time for some significant special cases. Thus the navigation problem can be solved efficiently only in some special cases and approximate solutions must therefore be sought.