213 0

Domain Ontology Learning from Text Documents using Linguistic Patterns and Clustering Technique

Title
Domain Ontology Learning from Text Documents using Linguistic Patterns and Clustering Technique
Author
이크발카심
Alternative Author(s)
Iqbal Qasim
Advisor(s)
이동호
Issue Date
2013-08
Publisher
한양대학교
Degree
Doctor
Abstract
도메인 온톨로지는 개체들간의 개념적, 용어적 불합의 최소화를 통해 특정 도메인에 존재하는 개체들간의 의미적 불일치 (semantic gap) 문제를 해소하기 위하여 사용된다. 그러나 텍스트 문서로부터 지식 체계를 추출하고 도메인 온톨로지를 수동으로 구축하는 것은 도메인 전문가들에 의한 장시간의 협업이 필요한 매우 어려운 과정이다. 또한, 지식 체계 획득 및 온톨로지 구축을 자동화하고자 하는 연구 역시 구조화된 지식 표현 체계의 결핍 및 자연어 처리 기술의 한계와 같은 문제점으로 인하여 많은 어려움을 겪고 있다. 따라서, 도메인 온톨로지가 실제 환경에서 효과적으로 활용 가능한 도구로서 작용하기 위해서는 몇 가지 중요한 문제들이 반드시 극복되어야만 한다. 첫 번째 중요 문제는 텍스트 문서로부터 도메인 후보 용어를 획득하는 과정 및 이렇게 추출된 용어들 중 도메인 온톨로지 획득을 위한 주요 도메인 개념들을 효과적으로 획득하는 과정이다. 두 번째 중요 문제는 도메인에 존재하는 실제 지식 체계와 일치하는 도메인 개념들간의 분류적, 비분류적 의미 관계를 추출하는 과정이다. 최근, 도메인 온톨로지를 자동으로 혹은 반자동으로 구축하기 위한 다양한 연구들이 진행되고 있다. 그러나, 이러한 연구들은 특정 도메인에 대한 말뭉치 (corpus)에 극히 의존적이라는 점과 특정 문헌들로부터 도메인 말뭉치를 구성하기 위하여 전문가의 수작업이 필요하다는 점과 같은 여러 가지 한계점들을 가지고 있다. 또한, 기존의 연구들은 텍스트 문서에 존재할 수 있는 주요 명제들을 누락시킬 수 있는 대명사의 대용 해소 문제를 고려하지 않고 오직 명사 구만을 도메인 온톨로지 구축에 활용하고 있다는 문제점이 있다. 본 학위논문에서는 텍스트 문서로부터 지식체계를 추출하기 위한 새로운 방식의 도메인 온톨로지 구축 기법을 제안한다. 본 학위논문은 언어적 패턴 및 클러스터링 기법을 활용한 반자동적, 도메인 독립적, 비감독 기법을 제안함으로써 기존연구들과의 차별성을 보인다. 본 학위논문에서는 도메인 온톨로지 구축 과정을 위하여 다음과 같은 기법들을 개발하였다: 1)의미적 언어 패턴을 이용한 후보 용어의 추출 기법 및 의미적으로 연관성이 있는 개념들의 클러스터링을 통한 도메인 개념의 선정 기법, 2) Hearst 패턴을 활용한 도메인 개념간의 분류적 관계성 추출 기법, 3) 언어 패턴을 활용한 도메인 개념간의 비분류적 관계성의 추출, 명명 및 방향성 할당 기법, 4) 중간 단계 지식 체계 표현을 위한 도메인 개념도의 구축 기법, 5) 도메인 개념도로부터 도메인 온톨로지를 구축하기 위한 기법. 먼저, 언어적 타입 의존 규칙을 활용하여 텍스트 문서로부터 후보 용어들을 추출한다. 그 후, 추출된 후보 용어 쌍 간의 의미적, 구조적 유사도를 바탕으로 용어쌍 간의 최종 유사도를 계산한다. 이렇게 계산된 용어쌍 간의 유사도 값들은 친근도 전파 알고리즘의 입력값으로 활용된다. 친근도 전파 알고리즘은 고품질의 데이터 견본 (exemplar)가 발견되어질 때까지 데이터 포인트 간의 메시지 교환을 반복적으로 수행하는 데이터 클러스터링 알고리즘이다. 친근도 전파 알고리즘을 활용하여 추출된 모든 견본 데이터 포인트 (즉, 도메인 용어)들은 도메인 온톨로지를 학습하기 위한 도메인 개념으로서 활용된다. 그 후, 개념도 구축을 위하여 각각의 클러스터 내의 후보 용어들 간에 분류적/비분류적 관계성이 할당된다. 최종적으로, 구축된 개념도 개체들을 온톨로지 개체들로 변환하는 과정을 통하여 도메인 온톨로지를 획득한다. 마지막으로, 다양한 실험을 통하여 제안하는 시스템의 성능을 검증한다. 실험을 통하여, 본 학위논문에서 제안하는 기법을 통하여 구축된 도메인 온톨로지는 도메인 전문가의 수작업을 통하여 생성된 온톨로지와 일치함을 볼 수 있었으며, 도메인 전문가에 의한 평가를 통하여 구축된 도메인 온톨로지가 정보 시스템 도메인 및 학계 도메인의 지식 체계와 높은 수준으로 일치한다는 것을 확인할 수 있었다. |Domain ontology can be used to bridge the semantic gap among the members of a domain through minimization of conceptual and terminological incompatibilities. Extracting knowledge from text documents and learning domain ontology manually is, however, a difficult, controversial, lengthy, and time consuming task that involves domain experts. Automatic or semi-automatic knowledge acquisition and ontology learning is also a non-trivial task due to the lack of structured knowledge representation and most of the data in documents are available in a free text format. Therefore several barriers must be overcome before domain ontology becomes a practical and useful tool. First important issue is the acquisition of candidate terms from text documents and then selection of domain concepts from these extracted terms for domain ontology learning. Second important issue is the extraction of taxonomic and non-taxonomic relationships between domain concepts which contain the actual context of a domain. Recently, various approaches for automatic or semi automatic construction of domain ontology have been proposed. However, these approaches suffer from several limitations such as heavy dependency on domain specific corpora and manual effort of experts required to populate the domain corpus from selected literature. Also, these approaches consider only the noun phrases in the text documents without resolving the anaphora resolution problems for pronouns which leads to miss the important propositions available in the text documents caused to decrease the recall. The proposed system presents a novel approach for domain ontology learning, defining new techniques for knowledge extraction from text documents. The utilization of linguistic patterns and clustering technique to the free text documents composing a semi-automatic, domain independent and unsupervised approach distinguishes the proposed system from the previous systems. We have been developed the following methods for the domain ontology construction process: 1) extraction of candidate terms using semantic linguistic patterns and selection of domain concepts by clustering semantically related concepts, 2) extraction of taxonomic relationships between domain concepts using Hearst’s patterns, 3) extraction, labeling, and assignment of direction of non-taxonomic relationships between domain concepts using linguistic patterns, 4) construction of domain concept map form extracted knowledge used as intermediate level knowledge representation, 5) finally, construction of domain ontology from constructed concept map. First, we extract candidate terms from documents using typed dependency linguistic rules. Second, Diset similarities are calculated based on semantic and structural similarity between pairs of extracted candidate terms. We then exploit affinity propagation algorithm, which takes as input Diset similarities between pairs of extracted candidate terms called data points. Real-valued messages are passed between candidate terms until a high quality set of exemplars iteratively emerges. All exemplars will be considered as domain concepts for learning domain ontologies. Then, extracted relationships are assigned between candidate terms in each cluster to complete the concept map. Finally, domain ontology is obtained from the constructed concept map by transforming concept map entities into domain ontology entities. The whole methodology has been implemented using different programming tools, providing a scalable solution. Finally, we verify the appropriateness of the proposed system by experimental results. Our empirical results show that the semi automatically constructed domain ontology conform to the outputs generated manually by domain experts, since the degree of difference between them is proportionally small. Also, domain experts have verified that the constructed domain ontologies are highly accordance with their knowledge and perception about information system domain and academia domain.; Domain ontology can be used to bridge the semantic gap among the members of a domain through minimization of conceptual and terminological incompatibilities. Extracting knowledge from text documents and learning domain ontology manually is, however, a difficult, controversial, lengthy, and time consuming task that involves domain experts. Automatic or semi-automatic knowledge acquisition and ontology learning is also a non-trivial task due to the lack of structured knowledge representation and most of the data in documents are available in a free text format. Therefore several barriers must be overcome before domain ontology becomes a practical and useful tool. First important issue is the acquisition of candidate terms from text documents and then selection of domain concepts from these extracted terms for domain ontology learning. Second important issue is the extraction of taxonomic and non-taxonomic relationships between domain concepts which contain the actual context of a domain. Recently, various approaches for automatic or semi automatic construction of domain ontology have been proposed. However, these approaches suffer from several limitations such as heavy dependency on domain specific corpora and manual effort of experts required to populate the domain corpus from selected literature. Also, these approaches consider only the noun phrases in the text documents without resolving the anaphora resolution problems for pronouns which leads to miss the important propositions available in the text documents caused to decrease the recall. The proposed system presents a novel approach for domain ontology learning, defining new techniques for knowledge extraction from text documents. The utilization of linguistic patterns and clustering technique to the free text documents composing a semi-automatic, domain independent and unsupervised approach distinguishes the proposed system from the previous systems. We have been developed the following methods for the domain ontology construction process: 1) extraction of candidate terms using semantic linguistic patterns and selection of domain concepts by clustering semantically related concepts, 2) extraction of taxonomic relationships between domain concepts using Hearst’s patterns, 3) extraction, labeling, and assignment of direction of non-taxonomic relationships between domain concepts using linguistic patterns, 4) construction of domain concept map form extracted knowledge used as intermediate level knowledge representation, 5) finally, construction of domain ontology from constructed concept map. First, we extract candidate terms from documents using typed dependency linguistic rules. Second, Diset similarities are calculated based on semantic and structural similarity between pairs of extracted candidate terms. We then exploit affinity propagation algorithm, which takes as input Diset similarities between pairs of extracted candidate terms called data points. Real-valued messages are passed between candidate terms until a high quality set of exemplars iteratively emerges. All exemplars will be considered as domain concepts for learning domain ontologies. Then, extracted relationships are assigned between candidate terms in each cluster to complete the concept map. Finally, domain ontology is obtained from the constructed concept map by transforming concept map entities into domain ontology entities. The whole methodology has been implemented using different programming tools, providing a scalable solution. Finally, we verify the appropriateness of the proposed system by experimental results. Our empirical results show that the semi automatically constructed domain ontology conform to the outputs generated manually by domain experts, since the degree of difference between them is proportionally small. Also, domain experts have verified that the constructed domain ontologies are highly accordance with their knowledge and perception about information system domain and academia domain.
URI
https://repository.hanyang.ac.kr/handle/20.500.11754/133209http://hanyang.dcollection.net/common/orgView/200000422218
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > COMPUTER SCIENCE & ENGINEERING(컴퓨터공학과) > Theses (Ph.D.)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE