Ontology Matching for Semantic Search

Since the Semantic Web is distributive, numerous resource descriptions are introduced where two concepts are equivalent, but are described using different terms. The resolution of such terminology conflicts requires ontology engineering. The word ontology refers to a hierarchical data structure containing the relevant entities and their relationships.

In the field of Artificial Intelligence (AI), ontology applications have been developed for knowledge management, e-Commerce, education, and for new emerging directions like the Semantic Web.

The realization of the Semantic Web will require the construction of ontologies for the various representation languages, query languages, and inference technology that are needed. A semantic search engine would require ontology matching between various sources. It would involve both "one-to-one" mappings of concepts plus complex mappings such as "many-to-one" mappings and attribute mappings.

In this article, we discuss ontology construction, matching, and mapping for support of semantic search engines. We present three types of mapping and an example mapping tool for each. As new tools and algorithms are developed and tested, the best algorithms can be incorporated into search engine for comparison.

Searching for Content

Today, the huge amount of information on the Web inhibits accurate search. Search results can often produce volumes of irrelevant references while failing to find the most desirable relationships. Because Web search engines use keywords they are subject to the two well-known linguistic phenomena that strongly degrade a query's precision and recall: Polysemy (one word might have several meanings) and Synonymy (several words or phrases, might designate the same concept).

There are several characteristics fundamental to search engines performance. It is important to consider useful searches as distinct from fruitless ones. There are three necessary criteria to evaluate searches as useful:

  • Maximum relevant information.
  • Minimum irrelevant information
  • Meaningful ranking, with the most relevant results first.

The first of these criteria - getting all of the relevant information available - is called recall. Without good recall, we have no guarantee that valid, interesting results won't be left out of our result set. We want the rate of false negatives - relevant results that we never see - to be as low as possible.

The second criterion - minimizing irrelevant information so that the proportion of relevant documents in our result set is very high - is called precision. With too little precision, our useful results get diluted by irrelevancies, and we are left with the task of sifting through a large set of documents to find what we want. High precision means the lowest possible rate of false positives.

There is an inevitable tradeoff between precision and recall. Search results generally lie on a continuum of relevancy, so there is no distinct place where relevant results stop and extraneous ones begin.

This is why the third criterion, ranking, is so important. Ranking has to do with whether the result set is ordered in a way that matches our intuitive understanding of what is more and what is less relevant. Of course the concept of 'relevance' depends heavily on our own immediate needs, our interests, and the context of our search. In an ideal world, search engines would learn our individual preferences, so that they could fine-tune any search based on our past interests.

Since the Semantic Web is distributive, there are a lot of resource descriptions where two concepts are equivalent, but they are described using different terms. The Semantic Web should enable information to be exchanged between applications that are built using different terminology for similar concepts (ontologies). Resolving polysemy and synonymy terms is central to information processing across ontologies and requires mapping equivalent terms. A semantic search engine should be able to process ontology mapping automatically.

Semantic Web data expressed in OWL or RDF documents can use ontology matching for search engines. When the user inputs a query, the program transfers it to a machine processing agent which evaluates the similarity between different ontologies and returns the matched item to the user. The semantic search engine will list the most similar concepts to the ontology that is entered as a keyword input. The system should consider both lexical similarities and textual similarities. In order to achieve this goal, it is necessary to compute the lexical similarities of the structures of concepts among ontologies.

For example, suppose we want to relate a postal code and an address. The Web is unable to represent this relationship since the names of the postal code system may be different in different countries and as a result we may not get what we expect.

By contrast, the Semantic Web could map the equivalence of a zip code to an appropriate postal code in another country. However, it is difficult to utilize such data on a large scale. To make the Semantic Web work, well-structured data and rules are necessary for agents to roam the Web.

Semantic Search Algorithm

A semantic search engine uses ontology matching to produce a remarkably useful application. It can discover if two documents are similar even if they do not have any specific words in common and it can reject documents that share only uninteresting words in common.

In broad terms what is involved is an algorithm that forms a Web of documents and words - connecting all documents to all words. Given such a model of words and documents one can then establish values based on the distance of documents from each other. The 'value' of any document to any other document might be designated as a function of the number of connections that must be traversed to establish a connection between documents. If two documents are connected by multiple routes then those documents might have a high degree of correlation.

Some of the preparatory work needed to get documents ready for comparison is very language-specific, such as, stemming. For English documents, we could use an algorithm called the Porter stemmer to remove common endings from words, leaving behind an invariant root form.

The implementation algorithm for semantic search could follow as:

For each document:

1. Stem all of the words and throw away any common 'noise' words:
a) Make a complete list of all the words that appear in the collection
b) Discard articles, prepositions, and conjunctions
c) Discard common verbs (know, see, do, be)
d) Discard pronouns
e) Discard common adjectives (big, late, high)
f) Discard frilly words (therefore, thus, however, albeit, etc.)
g) Discard any words that appear in every document
h) Discard any words that appear in only one document

2. For each of the remaining words perform a ontology matching evaluation:
a) Visit and remember each document that has a direct relationship to this word.
b) Score each document based on a distance function from the original document and the relative scarcity of the word in common.

3. For each of the as-of-yet-unvisited new related documents now being tracked

Recursively perform the same operation as above.

One possible weighting algorithm could work like this: For each increase in distance, divide a baseline score by two. Then the score of each document is equal to the baseline divided by the square root of the popularity of the word.

Overall this algorithm delivers a cheap semantic lookup based on walking a document and word graph. There are many other scoring algorithms that could be used.

The idea is then that the weighting algorithm feeds input into the semantic algorithm which first stems the words appropriately, scores them according to the semantic algorithm and sorts the results into the new rank order reflecting the semantic analysis.

Ontology

The word ontology was originally applied in the field of Philosophy to the concept of existence. It is the theory of objects and their interrelationships. As used in information science, the term ontology frequently refers to a hierarchical data structure containing the relevant entities and their relationships and rules within that domain. An ontology which is not tied to a particular problem domain but attempts to describe general entities is known as a foundation ontology or upper ontology.

The realization of the Semantic Web will require the construction of ontologies for the various representation languages, query languages, and inference technology that are needed.

In the following example is used with permission of X. Wang and is taken from www.charlestoncore.org. We will describe an ontology for a ‘spot.’ This ontology consists of three owl:Classes (spot, ellipse, and point) and six rdf:Properties (shape, center, x-position,y-position, x-radius, y-radius). Together, these vocabularies can be used to describe a spot. Figure 1 organizes the relationships for these elements.


Figure 1 Example Ontology O1 for ‘spot’

Classes
The three OWL classes are:

Spot - Conceptually, a "spot" of a Two Dimensional defined as a closed region on the plane of 2D.

Point – A point is defined as a location on a Cartesian plane. A Point has two attributes; its x-position and y-position on an implicit coordinate system of the plane.

Ellipse - Ellipse here is defined as a circle stretched along either x- or y-axis of a coordinate system. The major and minor axis of an Ellipse shall be parallel to the coordinates of the implicit coordinate system (See Figure 2).

Figure 2 Definition of an Ellipse.

Properties

The six RDF properties are:

Shape – A Spot is defined to be a closed region of a 2D plane, a Spot therefore must assume a shape of certain kind and we require that the shape be an Ellipse. Therefore the domain of shape is Spot and the range of Spot is Ellipse.

Center - The center refers to the center point of the Ellipse. Therefore, it has an rdfs:domain of Ellipse and an rdfs:range of Point.

x-position - An x-position is an owl:Datatype property that has a domain of Point. Its value (of type xsd:double) indicates its distance from the origin on the x-axis of the coordinate system.

y-position - A y-position is a owl:Datatype property that has a domain of Point. Its value (of type xsd:double) indicates its distance from the origin on the y-axis of the coordinate system.

x-radius- A x-radius is a owl:Datatype property that has a rdfs:domain of Ellipse. It refers to the radius that is parallel to the x-axis of the coordinate system (see Figure 2).

y-radius - A y-radius is a owl:Datatype property that has a rdfs:domain of Ellipse. It refers to the radius that is parallel to the y-axis of the coordinate system (see Figure 2)

The OWL file for this ontology example is as follows:

<?xml version="1.0" encoding="iso-8859-1" ?>
<!DOCTYPE rdf:RDF (...)>
<rdf:RDF xmlns="http:// example#"
xmlns:example="http:// example#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:dc="http://purl.org/dc/elements/1.1/"
xml:base="http:// /example">
<owl:Ontology rdf:about="">
<rdfs:isDefinedBy rdf:resource="http:// example/" />
<dc:author>Smith</dc:author>
<dc:title>Example Ontology</dc:title>
<rdfs:comment>This file defines a partial ontology in OWL</rdfs:comment>
<owl:versionInfo> 2005</owl:versionInfo>
</owl:Ontology>
<owl:Class rdf:ID="Spot" />
<owl:Class rdf:ID="Ellipse" />
<owl:Class rdf:ID="Point" />
<owl:ObjectProperty rdf:ID="shape">
<rdfs:domain rdf:resource="#Spot" />
<rdfs:range rdf:resource="#Ellipse" />
</owl:ObjectProperty>
<owl:ObjectProperty rdf:ID="center">
<rdfs:domain rdf:resource="#Ellipse" />
<rdfs:range rdf:resource="#Point" />
</owl:ObjectProperty>
<owl:DatatypeProperty rdf:ID="x-radius">
<rdfs:domain rdf:resource="#Ellipse" />
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#double" />
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="y-radius">
<rdfs:domain rdf:resource="#Ellipse" />
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#double" />
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="x-position">
<rdfs:domain rdf:resource="#Point" />
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#double" />
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="y-position">
<rdfs:domain rdf:resource="#Point" />
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#double" />
</owl:DatatypeProperty>
</rdf:RDF>

Ontology describes concepts and relationships with a set of representational vocabulary. The aim of building ontologies is to share and reuse knowledge. Since the Semantic Web is a distributed network, there are different ontologies that describe semantically equivalent things. As a result, it is necessary to map elements of these ontologies if we want to process information on the scale of the Web.

Ontology Matching

An ontology typically provides a vocabulary that describes a domain of interest and a specification of the meaning of terms used in the vocabulary. Depending on the precision of this specification, the notion of ontology encompasses several conceptual models. However, different parties could adopt different ontologies.

The problem of finding the semantic mappings between two given ontologies is called ontology matching. Ontology matching is a promising solution. It finds correspondences between semantically related entities. Thus, matching ontologies enables the data expressed in the matched ontologies to interoperate.

Despite its pervasiveness, today ontology matching is still largely conducted by hand and has become a bottleneck in information management.

String Matching

String matching is widely used in many fields such as text processing, image and signal processing, information retrieval, pattern recognition, and pattern matching in large databases. There are many string matching methods, among which edit distance is a well-known method for measuring the similarities of two strings. Suppose we have two strings S1 and S2, if we use limited steps of edit operations (insertions, deletions, and substitutions of characters), S1 can be transformed to S2 . Such a sequence is called edit sequence. The edit distance is defined as the weight of edit sequence.

String matching can help in ontology matching. The existing ontology files on the Web (see http://www.daml.org/ontologies) show that people usually use similar elements to build ontologies, although the complexity of each ontology may be different, and the terms may vary. This is because there are some properties necessary to describe a concept and people have already established a name.

The value of String Matching lies in that it can estimate the lexical similarity. However, we also need to consider the real meaning of the words and the context. In addition, there are some words that are similar in alphabet form while they have really different meaning such as "too" and "to". Hence, it is not enough if we only use string matching. Other approaches to improve the performance include normalization. Since edit distance does not consider the lengths of the strings compared, which may produce a side effect — a pair of long strings that differ only in one character may get the same edit distance as that of a pair of short strings. For example, suppose two words whose lengths are both 1000 only differ in one character, suppose further another pair of words whose lengths are both 3 also differ in one character. If we use traditional method to compute their edit distance, we will get exactly the same value. However, this is an unfair result since the long strings should get higher value. In this case, we should use an efficient uniform-cost normalized edit distance.
Comparing Ontologies

An ontology can be represented in a taxonomy tree where each node represents a concept with its attributes. Figure 3 shows a different spot (see Figure 1) ontologies in tree form. The concept spot on the left of Figure 1 should be recognizable as an equivalent match to the spot on the left of Figure 3. We can readily see that the only difference between these two figures (ontologies) is the term ‘point’ has been replaced by ‘origin.’


Figure 3 Ontology O2 for ‘spot’

The aim of ontology matching is to map the semantically equivalent elements. This is a one-to-one mapping of the simplest type. We can also map the different types of elements: e.g., a particular relation maps to a particular attribute. Mapping can be more complex if we want to map the combination of some elements to a specific element.

An approach for semantic search can be based on text categorization for ontology mapping than compares each element of an ontology with each element of the other ontology, and then determines a similarity metric on a per pair basis. Matched items are those whose similarity values are greater than a certain threshold.

An ontology can be represented in taxonomy tree form where each node represents a concept with its attributes. Figures 1 and 3, show two different ontologies. For example, the concept spot on the left of Figure 1 has three classes: spot, ellipse, and point. The aim of ontology matching is to map the semantically equivalent elements. For example, “point” maps to “origin” in Figure 3. This is a one-to-one mapping, the simplest type. We can also map the different types of elements, e.g. a particular relation maps to a particular attribute.

Similarity measures play a very significant role in ontology matching. A set of similarity measures which measures similarity between ontologies at two semiotic levels — the lexical level and the conceptual level is different from previous approaches that use the formal structures of ontologies and match the concept nodes. However, all ontologies in real-world not only specify the conceptualization by logical structures, but also refer to terms restricted by human natural languages use. For the two ontologiesO1 and O2, we could compute a similarity measure and use that measure to decide whether the onologies match. We could apply the notion of the joint probability distribution between any two concepts as a similarity measure.

It is important to give a definition of similarity between two concepts because this can make aims clear and facilitate the evaluation process. There are many measures solely based on the joint probability distribution.

Ontology Mapping

Ontology mapping enables interoperability among different sources in the Semantic Web. It is required for combing distributed and heterogeneous ontologies. Ontology mapping transforms the source ontology into the target ontology based on semantic relations at conceptual level. There are three approaches for combing distributed and heterogeneous ontologies:

1. mapping between local ontologies,
2. mapping between integrated global ontology and local ontologies, and
3. mapping on ontology merging, integration, or alignment.

Ontology merge, integration, and alignment can be considered as ontology reuse process.

Ontology merge is the process of generating a single, coherent ontology from two or more existing and different ontologies on the same subject. The original ontologies have similar or overlapping domains but they are different and not revisions of the same ones.

Ontology integration is the process of generating a single ontology from two or more differing ontologies in different subjects. The subjects of the different ontologies may be related.

Ontology alignment creates links between two original ontologies. Ontology alignment is made if the sources become consistent with each other but are kept separate.

Ontology Mapping Tools

In this section (see Choi), we review three types of ontology mapping tools.

1/. Ontology mapping between local ontologies.

Example tool: GLUE

GLUE is a system which semi-automatically creates ontology mapping using machine learning techniques. Given two ontologies, GLUE finds the most similar concept in the other ontology. For similarity measurement between two concepts, GLUE calculates the joint probability distribution of the concepts. GLUE uses a multi-strategy learning approach for finding joint probability distribution.

GLUE has a Content Learner and Name Learner, and a Meta Learner.. The Content Learner exploits the frequencies of words in the textual content of an instance in order to make predictions and uses the Naïve Bayes’s theorem. The Name Learner uses full name of the input instance. The Meta-learner combines the predictions of base learners and assigns weight to base learners based on how much it trusts that learner’s predictions.

2/. Ontology mappings between source ontology and integrated global ontology.

Example Tool LSD:

In LSD (Learning Source Description), Schema can be viewed as ontologies with restricted relationship types. Therefore, the mediated schema can be considered as a global ontology. The LSD system uses a multi-strategy learning two phase approach: training and matching. In the matching phase, prediction combiner combines meta-learner’s prediction and match the schema of new input source to the mediated schema. This process can be considered as ontology mapping between information sources and a global ontology.

3/. Ontology mapping in ontology merging, alignment, and integration.

Example Tool OntoMorph:

OntoMorph provides a powerful rule language for specifying mappings, and facilitates ontology merging and the rapid generation of knowledge base translators. It combines two syntactic rewriting and semantic rewriting. Syntactic rewriting is done through pattern-directed rewrite rules for sentence-level transformation based on pattern matching. Semantic rewriting is done through semantic models and logical inference.

Conclusion

The development of the Semantic Web and semantic search requires much more development in ontology engineering. Areas such as ontology mapping are still in there early stages of development. As new tools and algorithms are developed and tested, the best algorithms can be incorporated into search engine for comparison.

Reference

1/. Alesso, H. P., Developing Semantic Web Services, A. K. Peters, Ltd., 2004.

2/. H. Chalupsky. “Ontomorph: A Translation system for symbolic knowledge” In Principles of Knowledge Rep-resentation and Reasoning, 2000.

3/. Doan, A., Madhavan, J., Domingos, P., Halevy, A., "Learning to Map Between Ontologies on the Semantic Web, May 2002, http://www2002.org/CDROM/refereed/232/

4/. Doan, A., Domingos, P., Halevy, A., “Learning to Match the Schemas of Data Sources: A Multistrategy Approach”, Machine Learning 50 (3): 279-301, March 2003,

5/. Wang, P., “A Search Engine Based on the Semantic Web,” M.S. Thesis, University of Bristol, September 2003.

6/. Shvaiko, P., and Euzenat, J., "Tutorial on Schema and Ontology Matching," PowerPoint Presentation, ESWC’05, 2005.

7/.Shvaiko, P. and Euzenat, J., "Ontology Matching," (see www.ontologymatching.org/index.html) 2005.

8/. Choi, Namyoun, draft, “Ontology mapping and Ontology" 2006.

9/. Wang, X., "Spot example," www.charlestoncore.org, 2003.