261 0

저자 프로파일링을 위한 이름 명확화 및 웹 정보 탐지 기법

Title
저자 프로파일링을 위한 이름 명확화 및 웹 정보 탐지 기법
Other Titles
Name Disambiguation and Web Resource Discovery for Author Profiling
Author
신동욱
Alternative Author(s)
Dongwook Shin
Advisor(s)
김정선
Issue Date
2014-02
Publisher
한양대학교
Degree
Doctor
Abstract
전자도서관(scholarly digital libraries)의 주된 기능은 사용자의 질의에 부합하는 논문에 대한 검색 및 브라우징을 제공하는 것이다. 하지만 논문 양의 급격한 증가와 사용자의 다양한 요구로 인해 단순히 사용자의 질의에 부합하는 검색 결과를 제공하는 것 외에 다양한 정보 및 라이브러리 서비스를 제공하고자 하는 연구가 활발히 진행되었다. 이와 같이 전자도서관 도메인에서 다양한 분석 기법의 적용을 통해 방대한 양의 학술자료를 보다 효율적으로 검색하고, 연구자의 연구를 효과적으로 지원하기 위한 다양한 연구가 활발히 진행되었다. 하지만, 대부분의 기존 연구는 애드혹 (ad hoc) 방식을 사용하기 때문에 특정 도메인에 의존적이거나 실세계의 다양한 전자도서관에 범용적으로 적용하기 어려울 뿐만 아니라, 연구자의 행동(behavior)을 이해하고 모델링하기에 적절하지 못하였다. 이러한 문제점을 해결하기 위해, 문서 중심(document-centric)의 서지정보 데이터베이스(bibliographic database)를 사람 중심(person-centric)의 형태로 변환 후, 웹에서 저자의 소속, 직위, 연락처, 관심 분야 등의 추가적인 정보를 수집하고, 수집된 정보를 구조적으로 통합하여 저자 프로파일링(profiling)을 하고자 하는 연구가 활발히 진행되고 있다. 저자 프로파일링 문제는 문서 중심 형태의 서지정보 데이터베이스를 사람 중심의 형태로 재구성하는 과정에서 발생하는 저자명 모호성 문제를 해결함으로써 각 저자를 고유하게 식별해야 하는 이슈, 분산된 리소스(distributed resources)로부터 저자의 프로파일에 해당하는 정보 속성(informative properties)을 추출하는 이슈 등 다양한 이슈를 포함한다. 본 학위 논문에서는 효율적인 저자 프로파일링을 지원하기 위해, 전자도서관에 포함된 저자명 모호성을 해결하고 웹에서 해당 저자와 관련된 정보를 탐지하기 위한 이름 명확화 및 웹 정보 탐지 시스템(NDRD: Name Disambiguation and Web Resource Discovery for Author Profiling)을 제안한다. 제안한 이름 명확화 및 웹 정보 탐지 시스템은 저자명 모호성 해결을 위한 그래프 기반 접근방식인 저자명 명확화부(GAND: Graph-based Author Name Disambiguation)와 웹에서 저자와 관련된 리소스를 탐지하기 위한 저자 관련 웹 리소스 탐지부(ARD: Author-related Web Resource Discovery)로 구성된다. 저자명 명확화부는 저자들 간의 관계를 이용하여 그래프 모델(graph model)을 구축한 후, 공동 저자 관계를 기반으로 한 정점 분할과 정점 병합 등의 그래프 연산을 통해 저자명 모호성을 해결하는 방법으로, 그래프 모델 내 정점 중 이분되는 다수의 최장 순환(longest cycle)을 포함하는 정점은 다수의 동명이인 정보를 포함한다고 판단하여 해당 정점을 공동저자 관계에 따라 다수의 정점으로 분할함으로써 동명이인 문제를 해결하고, 유사한 이름을 가진 정점 중 공통된 정점과 연결성이 존재하는 경우 동일 저자로 판단하여 해당 정점들을 하나의 정점으로 병합함으로써 이명동인 문제를 해결한다. 저자 관련 웹 리소스 탐지부는 웹에서 저자와 관련된 정보제공 웹 페이지를 탐지하기 위한 정보제공 웹 페이지 탐지부(IPD: Informative Web Page Detection)와 탐지된 정보제공 웹 페이지에서 배너나 네비게이션 바, 저작권 표시와 같은 불필요한 데이터를 제외한 유용한 정보 블록만을 식별하기 위한 정보 블록 식별부(IBI: Informative Block Identification)로 구성된다. 정보제공 웹 페이지 탐지부는 저자명과 논문 제목을 기반으로 질의를 구성하고, 해당 질의를 검색 엔진에 전송한다. 그 후, 반환된 검색 결과를 유니폼 리소스 로케이터(URL: Uniform Resource Locator) 특징과 하이퍼텍스트 기술용 언어(HTML: Hypertext Mark-up Language) 특징에 해당하는 휴리스틱 특징(simple heuristic features)을 통해 분석하여 웹에 존재하는 정보제공 웹 페이지를 탐지한다. 그 후, 정보 블록 식별부가 해당 웹 페이지를 시각적인 정보를 기반으로 논리적 단위인 블록으로 분할 후, 각 블록 내 단어 벡터와 블록 내 품사 발생 비율을 블록의 특징으로 추출하고, 해당 특징을 기반으로 학습된 분류 모델을 활용하여 정보제공 웹 페이지 내 블록 중 정보 블록만을 식별하고, 해당 정보 블록을 콘텐츠 유형에 따라 세부적으로 분류한다. 마지막으로 제안한 이름 명확화 및 웹 정보 탐지 시스템의 성능을 평가하기 위해, 본 학위 논문은 이름 명확화 및 웹 정보 탐지 시스템을 구성하는 각 하위 시스템에 대한 다양한 종류의 실험을 기반한 성능 평가 및 분석을 통해서 이름 명확화 기법과 정보 제공 웹 페이지 탐지 및 정보 블록 식별 기법의 효율성을 평가한다.|Primary tasks of scholarly digital libraries (DLs), such as CiteSeerX, DBLP, Arnetminer and Academic Search, are to simply search and browse research papers corresponding to a user’s query. Recently, in addition to the primary tasks related to search and browsing of research papers, the rapidly growing amount of research papers and the ever-changing needs of users have led to a myriad of studies on diverse library services like research paper recommendation, joint research topic discovery, and other various advanced search strategies. In the scholarly domain, considerable efforts have been made to construct author profiles, and then to use the author profiles for facilitating the effective retrieval of papers of interest over the huge volume of research papers, and assisting the efficient research activities of researchers by applying a variety of mining techniques. However, most previous studies on library services represent author profiles in their own ad hoc fashions, such as a vector of relevant terms, and as a result, such a profile is insufficient for modeling and understanding author’s behavior. To resolve these problems, studies on author profiling are actively progressed. The author profiling is a problem characterized by many challenging issues including: the resolution of ambiguities, the identification of information sources about researchers, the automatic extraction of researcher profiles from distributed sources, and the consistency and completeness of information. In this dissertation, we propose a Name Disambiguation and Web Resource Discovery for Author Profiling, called NDRD, to deal with some key issues for author profiling: the resolution of ambiguities (author name disambiguation) and the identification of information sources about researchers (author-related resource discovery). NDRD is comprised of two subframeworks: Graph-based Author Name Disambiguation, called GAND, and Author-related Web Resource Discovery, called ARD. GAND is a graph-oriented approach to resolve the author name ambiguity problem. GAND constructs a graph model based on the extracted information in which a node denotes an author and an edge indicates a co-author relation. In GAND, the namesake problem is solved by splitting an author node in the graph model involved in multiple non-overlapping cycles of co-authorship, and the heteronymous name problem is handled by merging multiple author nodes into one by identifying those nodes that actually represent a single author with different names. ARD, which handles author-related web resource discovery, consists of two subsystems: Informative Web Page Detection, called IPD, and Informative Block Identification, called IBI. IPD is an informative web page detection system using heuristic features. IPD identifies informative web pages by first submitting two types of queries composed of frequently observed words from main web pages of academic homepages and titles of papers written by corresponding researchers with researcher’s names to a search engine, and then analyzing the search result by using URL naming pattern and HTML title tag usage convention. IBI is a supervised learning based approach to identify informative blocks with their content types from researcher’s homepage. IBI first splits a web page into a set of blocks based on visual information, and then extracts their content features including term vector and POS tag occurrence rate. Finally, IBI classifies those blocks into corresponding block types based on their content features by using two supervised learning methods, Naive Bayes classifier and SVM. Experiments were carried out to evaluate NDRD by conducting three experiments to measure the effectiveness of proposed approaches: author name disambiguation, informative web page detection, and informative block identification. The experimental results confirm that NDRD provide simple but efficient solution to author name disambiguation and a promising starting point for author-related web resource discovery.; Primary tasks of scholarly digital libraries (DLs), such as CiteSeerX, DBLP, Arnetminer and Academic Search, are to simply search and browse research papers corresponding to a user’s query. Recently, in addition to the primary tasks related to search and browsing of research papers, the rapidly growing amount of research papers and the ever-changing needs of users have led to a myriad of studies on diverse library services like research paper recommendation, joint research topic discovery, and other various advanced search strategies. In the scholarly domain, considerable efforts have been made to construct author profiles, and then to use the author profiles for facilitating the effective retrieval of papers of interest over the huge volume of research papers, and assisting the efficient research activities of researchers by applying a variety of mining techniques. However, most previous studies on library services represent author profiles in their own ad hoc fashions, such as a vector of relevant terms, and as a result, such a profile is insufficient for modeling and understanding author’s behavior. To resolve these problems, studies on author profiling are actively progressed. The author profiling is a problem characterized by many challenging issues including: the resolution of ambiguities, the identification of information sources about researchers, the automatic extraction of researcher profiles from distributed sources, and the consistency and completeness of information. In this dissertation, we propose a Name Disambiguation and Web Resource Discovery for Author Profiling, called NDRD, to deal with some key issues for author profiling: the resolution of ambiguities (author name disambiguation) and the identification of information sources about researchers (author-related resource discovery). NDRD is comprised of two subframeworks: Graph-based Author Name Disambiguation, called GAND, and Author-related Web Resource Discovery, called ARD. GAND is a graph-oriented approach to resolve the author name ambiguity problem. GAND constructs a graph model based on the extracted information in which a node denotes an author and an edge indicates a co-author relation. In GAND, the namesake problem is solved by splitting an author node in the graph model involved in multiple non-overlapping cycles of co-authorship, and the heteronymous name problem is handled by merging multiple author nodes into one by identifying those nodes that actually represent a single author with different names. ARD, which handles author-related web resource discovery, consists of two subsystems: Informative Web Page Detection, called IPD, and Informative Block Identification, called IBI. IPD is an informative web page detection system using heuristic features. IPD identifies informative web pages by first submitting two types of queries composed of frequently observed words from main web pages of academic homepages and titles of papers written by corresponding researchers with researcher’s names to a search engine, and then analyzing the search result by using URL naming pattern and HTML title tag usage convention. IBI is a supervised learning based approach to identify informative blocks with their content types from researcher’s homepage. IBI first splits a web page into a set of blocks based on visual information, and then extracts their content features including term vector and POS tag occurrence rate. Finally, IBI classifies those blocks into corresponding block types based on their content features by using two supervised learning methods, Naive Bayes classifier and SVM. Experiments were carried out to evaluate NDRD by conducting three experiments to measure the effectiveness of proposed approaches: author name disambiguation, informative web page detection, and informative block identification. The experimental results confirm that NDRD provide simple but efficient solution to author name disambiguation and a promising starting point for author-related web resource discovery.
URI
https://repository.hanyang.ac.kr/handle/20.500.11754/130910http://hanyang.dcollection.net/common/orgView/200000424135
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > COMPUTER SCIENCE & ENGINEERING(컴퓨터공학과) > Theses (Ph.D.)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE