Ontology Matching for Semantic Search
Since the Semantic Web is distributive, numerous resource descriptions are
introduced where two concepts are equivalent, but are described using different
terms. The resolution of such terminology conflicts requires ontology engineering.
The word ontology refers to a hierarchical data structure containing the relevant
entities and their relationships.
In the field of Artificial Intelligence (AI), ontology applications have
been developed for knowledge management, e-Commerce, education, and for new
emerging directions like the Semantic Web.
The realization of the Semantic Web will require the construction of ontologies
for the various representation languages, query languages, and inference technology
that are needed. A semantic search engine would require ontology matching
between various sources. It would involve both "one-to-one" mappings
of concepts plus complex mappings such as "many-to-one" mappings
and attribute mappings.
In this article, we discuss ontology construction, matching, and mapping
for support of semantic search engines. We present three types of mapping
and an example mapping tool for each. As new tools and algorithms are developed
and tested, the best algorithms can be incorporated into search engine for
comparison.
Searching for Content
Today, the huge amount of information on the Web inhibits accurate search.
Search results can often produce volumes of irrelevant references while failing
to find the most desirable relationships. Because Web search engines use keywords
they are subject to the two well-known linguistic phenomena that strongly
degrade a query's precision and recall: Polysemy (one word might have several
meanings) and Synonymy (several words or phrases, might designate the same
concept).
There are several characteristics fundamental to search engines performance.
It is important to consider useful searches as distinct from fruitless ones.
There are three necessary criteria to evaluate searches as useful:
- Maximum relevant information.
- Minimum irrelevant information
- Meaningful ranking, with the most relevant results first.
The first of these criteria - getting all of the relevant information available
- is called recall. Without good recall, we have no guarantee that valid,
interesting results won't be left out of our result set. We want the rate
of false negatives - relevant results that we never see - to be as low as
possible.
The second criterion - minimizing irrelevant information so that the proportion
of relevant documents in our result set is very high - is called precision.
With too little precision, our useful results get diluted by irrelevancies,
and we are left with the task of sifting through a large set of documents
to find what we want. High precision means the lowest possible rate of false
positives.
There is an inevitable tradeoff between precision and recall. Search results
generally lie on a continuum of relevancy, so there is no distinct place where
relevant results stop and extraneous ones begin.
This is why the third criterion, ranking, is so important. Ranking has to
do with whether the result set is ordered in a way that matches our intuitive
understanding of what is more and what is less relevant. Of course the concept
of 'relevance' depends heavily on our own immediate needs, our interests,
and the context of our search. In an ideal world, search engines would learn
our individual preferences, so that they could fine-tune any search based
on our past interests.
Since the Semantic Web is distributive, there are a lot of resource descriptions
where two concepts are equivalent, but they are described using different
terms. The Semantic Web should enable information to be exchanged between
applications that are built using different terminology for similar concepts
(ontologies). Resolving polysemy and synonymy terms is central to information
processing across ontologies and requires mapping equivalent terms. A semantic
search engine should be able to process ontology mapping automatically.
Semantic Web data expressed in OWL or RDF documents can use ontology matching
for search engines. When the user inputs a query, the program transfers it
to a machine processing agent which evaluates the similarity between different
ontologies and returns the matched item to the user. The semantic search engine
will list the most similar concepts to the ontology that is entered as a keyword
input. The system should consider both lexical similarities and textual similarities.
In order to achieve this goal, it is necessary to compute the lexical similarities
of the structures of concepts among ontologies.
For example, suppose we want to relate a postal code and an address. The
Web is unable to represent this relationship since the names of the postal
code system may be different in different countries and as a result we may
not get what we expect.
By contrast, the Semantic Web could map the equivalence of a zip code to
an appropriate postal code in another country. However, it is difficult to
utilize such data on a large scale. To make the Semantic Web work, well-structured
data and rules are necessary for agents to roam the Web.
Semantic Search Algorithm
A semantic search engine uses ontology matching to produce a remarkably useful
application. It can discover if two documents are similar even if they do
not have any specific words in common and it can reject documents that share
only uninteresting words in common.
In broad terms what is involved is an algorithm that forms a Web of documents
and words - connecting all documents to all words. Given such a model of words
and documents one can then establish values based on the distance of documents
from each other. The 'value' of any document to any other document might be
designated as a function of the number of connections that must be traversed
to establish a connection between documents. If two documents are connected
by multiple routes then those documents might have a high degree of correlation.
Some of the preparatory work needed to get documents ready for comparison
is very language-specific, such as, stemming. For English documents, we could
use an algorithm called the Porter stemmer to remove common endings from words,
leaving behind an invariant root form.
The implementation algorithm for semantic search could follow as:
For each document:
1. Stem all of the words and throw away any common 'noise' words:
a) Make a complete list of all the words that appear in the collection
b) Discard articles, prepositions, and conjunctions
c) Discard common verbs (know, see, do, be)
d) Discard pronouns
e) Discard common adjectives (big, late, high)
f) Discard frilly words (therefore, thus, however, albeit, etc.)
g) Discard any words that appear in every document
h) Discard any words that appear in only one document
2. For each of the remaining words perform a ontology matching evaluation:
a) Visit and remember each document that has a direct relationship to this
word.
b) Score each document based on a distance function from the original document
and the relative scarcity of the word in common.
3. For each of the as-of-yet-unvisited new related documents now being tracked
Recursively perform the same operation as above.
One possible weighting algorithm could work like this: For each increase
in distance, divide a baseline score by two. Then the score of each document
is equal to the baseline divided by the square root of the popularity of the
word.
Overall this algorithm delivers a cheap semantic lookup based on walking
a document and word graph. There are many other scoring algorithms that could
be used.
The idea is then that the weighting algorithm feeds input into the semantic
algorithm which first stems the words appropriately, scores them according
to the semantic algorithm and sorts the results into the new rank order reflecting
the semantic analysis.
Ontology
The word ontology was originally applied in the field of Philosophy to the
concept of existence. It is the theory of objects and their interrelationships.
As used in information science, the term ontology frequently refers to a hierarchical
data structure containing the relevant entities and their relationships and
rules within that domain. An ontology which is not tied to a particular problem
domain but attempts to describe general entities is known as a foundation
ontology or upper ontology.
The realization of the Semantic Web will require the construction of ontologies
for the various representation languages, query languages, and inference technology
that are needed.
In the following example is used with permission of X. Wang and is taken
from www.charlestoncore.org. We will describe an ontology for a ‘spot.’
This ontology consists of three owl:Classes (spot, ellipse, and point) and
six rdf:Properties (shape, center, x-position,y-position, x-radius, y-radius).
Together, these vocabularies can be used to describe a spot. Figure 1 organizes
the relationships for these elements.
Figure 1 Example Ontology O1 for ‘spot’
Classes
The three OWL classes are:
Spot - Conceptually, a "spot" of a Two Dimensional defined as a
closed region on the plane of 2D.
Point – A point is defined as a location on a Cartesian plane. A Point
has two attributes; its x-position and y-position on an implicit coordinate
system of the plane.
Ellipse - Ellipse here is defined as a circle stretched along either x- or
y-axis of a coordinate system. The major and minor axis of an Ellipse shall
be parallel to the coordinates of the implicit coordinate system (See Figure
2).
Figure 2 Definition of an Ellipse.
Properties
The six RDF properties are:
Shape – A Spot is defined to be a closed region of a 2D plane, a Spot
therefore must assume a shape of certain kind and we require that the shape
be an Ellipse. Therefore the domain of shape is Spot and the range of Spot
is Ellipse.
Center - The center refers to the center point of the Ellipse. Therefore,
it has an rdfs:domain of Ellipse and an rdfs:range of Point.
x-position - An x-position is an owl:Datatype property that has a domain
of Point. Its value (of type xsd:double) indicates its distance from the origin
on the x-axis of the coordinate system.
y-position - A y-position is a owl:Datatype property that has a domain of
Point. Its value (of type xsd:double) indicates its distance from the origin
on the y-axis of the coordinate system.
x-radius- A x-radius is a owl:Datatype property that has a rdfs:domain of
Ellipse. It refers to the radius that is parallel to the x-axis of the coordinate
system (see Figure 2).
y-radius - A y-radius is a owl:Datatype property that has a rdfs:domain of
Ellipse. It refers to the radius that is parallel to the y-axis of the coordinate
system (see Figure 2)
The OWL file for this ontology example is as follows:
<?xml version="1.0" encoding="iso-8859-1" ?>
<!DOCTYPE rdf:RDF (...)>
<rdf:RDF xmlns="http:// example#"
xmlns:example="http:// example#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:dc="http://purl.org/dc/elements/1.1/"
xml:base="http:// /example">
<owl:Ontology rdf:about="">
<rdfs:isDefinedBy rdf:resource="http:// example/" />
<dc:author>Smith</dc:author>
<dc:title>Example Ontology</dc:title>
<rdfs:comment>This file defines a partial ontology in OWL</rdfs:comment>
<owl:versionInfo> 2005</owl:versionInfo>
</owl:Ontology>
<owl:Class rdf:ID="Spot" />
<owl:Class rdf:ID="Ellipse" />
<owl:Class rdf:ID="Point" />
<owl:ObjectProperty rdf:ID="shape">
<rdfs:domain rdf:resource="#Spot" />
<rdfs:range rdf:resource="#Ellipse" />
</owl:ObjectProperty>
<owl:ObjectProperty rdf:ID="center">
<rdfs:domain rdf:resource="#Ellipse" />
<rdfs:range rdf:resource="#Point" />
</owl:ObjectProperty>
<owl:DatatypeProperty rdf:ID="x-radius">
<rdfs:domain rdf:resource="#Ellipse" />
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#double"
/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="y-radius">
<rdfs:domain rdf:resource="#Ellipse" />
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#double"
/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="x-position">
<rdfs:domain rdf:resource="#Point" />
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#double"
/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="y-position">
<rdfs:domain rdf:resource="#Point" />
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#double"
/>
</owl:DatatypeProperty>
</rdf:RDF>
Ontology describes concepts and relationships with a set of representational
vocabulary. The aim of building ontologies is to share and reuse knowledge.
Since the Semantic Web is a distributed network, there are different ontologies
that describe semantically equivalent things. As a result, it is necessary
to map elements of these ontologies if we want to process information on the
scale of the Web.
Ontology Matching
An ontology typically provides a vocabulary that describes a domain of interest
and a specification of the meaning of terms used in the vocabulary. Depending
on the precision of this specification, the notion of ontology encompasses
several conceptual models. However, different parties could adopt different
ontologies.
The problem of finding the semantic mappings between two given ontologies
is called ontology matching. Ontology matching is a promising solution. It
finds correspondences between semantically related entities. Thus, matching
ontologies enables the data expressed in the matched ontologies to interoperate.
Despite its pervasiveness, today ontology matching is still largely conducted
by hand and has become a bottleneck in information management.
String Matching
String matching is widely used in many fields such as text processing, image
and signal processing, information retrieval, pattern recognition, and pattern
matching in large databases. There are many string matching methods, among
which edit distance is a well-known method for measuring the similarities
of two strings. Suppose we have two strings S1 and S2, if we use limited steps
of edit operations (insertions, deletions, and substitutions of characters),
S1 can be transformed to S2 . Such a sequence is called edit sequence. The
edit distance is defined as the weight of edit sequence.
String matching can help in ontology matching. The existing ontology files
on the Web (see http://www.daml.org/ontologies) show that people usually use
similar elements to build ontologies, although the complexity of each ontology
may be different, and the terms may vary. This is because there are some properties
necessary to describe a concept and people have already established a name.
The value of String Matching lies in that it can estimate the lexical similarity.
However, we also need to consider the real meaning of the words and the context.
In addition, there are some words that are similar in alphabet form while
they have really different meaning such as "too" and "to".
Hence, it is not enough if we only use string matching. Other approaches to
improve the performance include normalization. Since edit distance does not
consider the lengths of the strings compared, which may produce a side effect
— a pair of long strings that differ only in one character may get the
same edit distance as that of a pair of short strings. For example, suppose
two words whose lengths are both 1000 only differ in one character, suppose
further another pair of words whose lengths are both 3 also differ in one
character. If we use traditional method to compute their edit distance, we
will get exactly the same value. However, this is an unfair result since the
long strings should get higher value. In this case, we should use an efficient
uniform-cost normalized edit distance.
Comparing Ontologies
An ontology can be represented in a taxonomy tree where each node represents
a concept with its attributes. Figure 3 shows a different spot (see Figure
1) ontologies in tree form. The concept spot on the left of Figure 1 should
be recognizable as an equivalent match to the spot on the left of Figure 3.
We can readily see that the only difference between these two figures (ontologies)
is the term ‘point’ has been replaced by ‘origin.’
Figure 3 Ontology O2 for ‘spot’
The aim of ontology matching is to map the semantically equivalent elements.
This is a one-to-one mapping of the simplest type. We can also map the different
types of elements: e.g., a particular relation maps to a particular attribute.
Mapping can be more complex if we want to map the combination of some elements
to a specific element.
An approach for semantic search can be based on text categorization for ontology
mapping than compares each element of an ontology with each element of the
other ontology, and then determines a similarity metric on a per pair basis.
Matched items are those whose similarity values are greater than a certain
threshold.
An ontology can be represented in taxonomy tree form where each node represents
a concept with its attributes. Figures 1 and 3, show two different ontologies.
For example, the concept spot on the left of Figure 1 has three classes: spot,
ellipse, and point. The aim of ontology matching is to map the semantically
equivalent elements. For example, “point” maps to “origin”
in Figure 3. This is a one-to-one mapping, the simplest type. We can also
map the different types of elements, e.g. a particular relation maps to a
particular attribute.
Similarity measures play a very significant role in ontology matching. A
set of similarity measures which measures similarity between ontologies at
two semiotic levels — the lexical level and the conceptual level is
different from previous approaches that use the formal structures of ontologies
and match the concept nodes. However, all ontologies in real-world not only
specify the conceptualization by logical structures, but also refer to terms
restricted by human natural languages use. For the two ontologiesO1 and O2,
we could compute a similarity measure and use that measure to decide whether
the onologies match. We could apply the notion of the joint probability distribution
between any two concepts as a similarity measure.
It is important to give a definition of similarity between two concepts because
this can make aims clear and facilitate the evaluation process. There are
many measures solely based on the joint probability distribution.
Ontology Mapping
Ontology mapping enables interoperability among different sources in the
Semantic Web. It is required for combing distributed and heterogeneous ontologies.
Ontology mapping transforms the source ontology into the target ontology based
on semantic relations at conceptual level. There are three approaches for
combing distributed and heterogeneous ontologies:
1. mapping between local ontologies,
2. mapping between integrated global ontology and local ontologies, and
3. mapping on ontology merging, integration, or alignment.
Ontology merge, integration, and alignment can be considered as ontology
reuse process.
Ontology merge is the process of generating a single, coherent ontology from
two or more existing and different ontologies on the same subject. The original
ontologies have similar or overlapping domains but they are different and
not revisions of the same ones.
Ontology integration is the process of generating a single ontology from
two or more differing ontologies in different subjects. The subjects of the
different ontologies may be related.
Ontology alignment creates links between two original ontologies. Ontology
alignment is made if the sources become consistent with each other but are
kept separate.
Ontology Mapping Tools
In this section (see Choi), we review three types of ontology mapping tools.
1/. Ontology mapping between local ontologies.
Example tool: GLUE
GLUE is a system which semi-automatically creates ontology mapping using
machine learning techniques. Given two ontologies, GLUE finds the most similar
concept in the other ontology. For similarity measurement between two concepts,
GLUE calculates the joint probability distribution of the concepts. GLUE uses
a multi-strategy learning approach for finding joint probability distribution.
GLUE has a Content Learner and Name Learner, and a Meta Learner.. The Content
Learner exploits the frequencies of words in the textual content of an instance
in order to make predictions and uses the Naïve Bayes’s theorem.
The Name Learner uses full name of the input instance. The Meta-learner combines
the predictions of base learners and assigns weight to base learners based
on how much it trusts that learner’s predictions.
2/. Ontology mappings between source ontology and integrated global ontology.
Example Tool LSD:
In LSD (Learning Source Description), Schema can be viewed as ontologies
with restricted relationship types. Therefore, the mediated schema can be
considered as a global ontology. The LSD system uses a multi-strategy learning
two phase approach: training and matching. In the matching phase, prediction
combiner combines meta-learner’s prediction and match the schema of
new input source to the mediated schema. This process can be considered as
ontology mapping between information sources and a global ontology.
3/. Ontology mapping in ontology merging, alignment, and integration.
Example Tool OntoMorph:
OntoMorph provides a powerful rule language for specifying mappings, and
facilitates ontology merging and the rapid generation of knowledge base translators.
It combines two syntactic rewriting and semantic rewriting. Syntactic rewriting
is done through pattern-directed rewrite rules for sentence-level transformation
based on pattern matching. Semantic rewriting is done through semantic models
and logical inference.
Conclusion
The development of the Semantic Web and semantic search requires much more
development in ontology engineering. Areas such as ontology mapping are still
in there early stages of development. As new tools and algorithms are developed
and tested, the best algorithms can be incorporated into search engine for
comparison.
Reference
1/. Alesso, H. P.,
Developing Semantic Web Services, A.
K. Peters, Ltd., 2004.
2/. H. Chalupsky. “Ontomorph: A Translation system for symbolic knowledge”
In Principles of Knowledge Rep-resentation and Reasoning, 2000.
3/. Doan, A., Madhavan, J., Domingos, P., Halevy, A., "Learning to Map
Between Ontologies on the Semantic Web, May 2002, http://www2002.org/CDROM/refereed/232/
4/. Doan, A., Domingos, P., Halevy, A., “Learning to Match the Schemas
of Data Sources: A Multistrategy Approach”, Machine Learning 50 (3):
279-301, March 2003,
5/. Wang, P., “A Search Engine Based on the Semantic Web,” M.S.
Thesis, University of Bristol, September 2003.
6/. Shvaiko, P., and Euzenat, J., "Tutorial on Schema and Ontology Matching,"
PowerPoint Presentation, ESWC05, 2005.
7/.Shvaiko, P. and Euzenat, J., "Ontology Matching," (see www.ontologymatching.org/index.html)
2005.
8/. Choi, Namyoun, draft, Ontology mapping and Ontology" 2006.
9/. Wang, X., "Spot example," www.charlestoncore.org, 2003.