400 0

Similarity Computation in Citation Networks

Title
Similarity Computation in Citation Networks
Author
마수드
Alternative Author(s)
마수드
Advisor(s)
Sang-Wook Kim
Issue Date
2016-08
Publisher
한양대학교, 대학원 컴퓨터·소프트웨어학과
Degree
Doctor
Abstract
Information networks consist of a large number of interacting physical, conceptual, human or social entities that are interconnected, creating a large and sophisticated network such as World Wide Web, social networks, and citation networks (i.e., extracted from academic literature data). An information network is represented as a graph where nodes correspond to entities and links do relations among entities. As an example, a citation network is represented as a graph (i.e., citation graph) where nodes correspond to scientific papers and links do citation relationships among papers. One of the most important aspects of information network analysis is to measure similarity between entities in the network. In this dissertation, we discuss similarity computation in information networks by mainly focusing on citation graphs. In the first part of the dissertation, we propose a novel method, SimCC (similarity based on content and citations), and its variants to compute the similarity of scientific papers. In contrast to link-based similarity measures and text-based similarity measures, SimCC considers both aspects of scientific papers, content and citation relationships in a citation graph, to compute the similarity. Furthermore, in contrast to hybrid methods, which consider both content and citations by simply combining the contents of papers involved in direct citation relationships, SimCC carefully considers the relation between the contents of papers having direct or indirect citation relationships by proposing the notion of a contribution score. The contribution score is a key factor in SimCC and indicates how much a paper contributes to another paper citing it directly or indirectly. SimCC consists of two orthogonal steps of feature extraction and similarity computation. For feature extraction, SimCC applies a a new RA (relevance and authority) weighting scheme, which enables SimCC to effectively reflects both content and authority of scientific papers simultaneously in similarity computation. For similarity computation, any text-based similarity measures applicable to vectors can be applied. In addition, we propose SimCC-AT (similarity based on content and citations with automatic parameter tuning), a variation of SimCC, which utilizes an automatic weighting scheme based on SVMrank and thus requires only a smaller number of experiments for parameter tuning than SimCC. Also, we propose SimCS (similarity based on contribution scores), another variation of SimCC, which considers the author dominance of papers in computing contribution scores. SimCS does not requires extra parameter tuning and shows higher accuracy in similarity computation than SimCC and SimCC-AT. We conducted extensive experiments on a real-world dataset of scientific papers to evaluate the effectiveness of the aforementioned methods in similarity computation in comparison to those of other existing methods. In the second part of the dissertation, we propose a novel link-based similarity measure, JacSim (Jaccard-based SimRank), for computing the similarity between nodes in a graph. JacSim solves a pairwise normalization problem, a counter-intuitive property of SimRank, in an effective way by combining two different computation manners: Jaccard coefïˇn ˛Acient and pairwise normalization. JacSim points out two problems of existing measures targeted at the pair- wise normalization problem and provides effective solutions to them as redundancy elimination and importance factor assignment. Since weighted graphs are more informative than un-weighted one, in order to take advantage of edge weights in similarity computation, we propose a weighted version of JacSim applicable to weighted graphs. Furthermore, to accelerate JacSim, we provide a linear recursive matrix form for JacSim, which computes the approximate similarity scores and composed of only linear operations. Although the JacSim matrix form computes approximate similarity scores, it provides a reasonable accuracy in similarity computation. We conducted extensive experiments on two real-world datasets, scientific papers and web pages, to evaluate both the effectiveness and efficiency of JacSim in comparison to those of other existing measures.|정보 네트워크란 서로 연결된 물리적, 개념적, 사회적 개체들로 이루어진 구조를 의미하며, 월드 와이드 웹, 소셜 네트워크, 인용 정보 네트워크 등은 대표적인 거대 정보 네트워크이다. 이 때, 인용 정보 네트워크란 학술 논문 데이터로부터 추출된 네트워크를 의미한다. 정보 네트워크는 각 개체들을 노드로, 개체들 간의 관계를 에지로 하는 그래프의 형태로 표현된다. 예를 들어, 인용 정보 네트워크는 학술 논문들을 노드로, 논문들 간의 인용 관계를 에지로 하는 그래프로 표현된다. 정보 네트워크 내 개체들 간의 유사도를 계산할 수 있는 척도에 대한 연구는 정보 네트워크 분석에 대한 주요 연구 주제 중 하나이다. 본 학위 논문에서는 정보 네트워크, 특히 인용 정보 네트워크에서의 유사도 계산에 대해 다루고자 한다. 본 학위 논문의 첫 번째 부분에서는, 내용 정보와 인용 정보를 모두 고려한 유사도 척도인 SimCC (similarity based on content and citations) 와 이를 이용하여 논문 간의 유사도를 계산하기 위한 방법들을 제안한다. 기존의 링크 기반 유사도 척도 또는 내용 기반 유사도 척도들과는 달리, SimCC는 논문의 내용 정보와 인용 정보를 모두 고려하여 유사도를 계산한다. 더 나아가, 기존의 하이브리드 방법들이 내용 정보와 인용 정보를 함께 고려하기 위해, 직접적으로 연결되어 있는 인용 논문들과 연관된 내용 정보를 단순히 결합하는 방식을 사용하였던 것과 달리, SimCC는 공헌 점수라는 개념을 새롭게 제안함으로써, 직접적 또는 간접적인 인용 관계를 갖고 있는 논문의 내용 정보 간의 관계를 신중하게 고려한다. 공헌 점수는 SimCC의 핵심 요소로, 특정 논문이 해당 논문을 인용하는 다른 논문에 직접적 또는 간접적으로 얼마나 공헌하는지를 나타낸다. SimCC는 서로 독립적인 두 단계로 이루어져 있다. 첫 번째 단계는 특징 추출 단계이고 두 번째 단계는 유사도 계산 단계이다. SimCC는 특징 추출을 위해 관련성과 권위를 고려하는 RA (relevance and authority) 가중치 부여 방법을 이용한다. RA 가중치 부여 방법을 통해 SimCC는 내용 정보와 논문의 권위를 동시에 유사도 계산에 효율적으로 반영할 수 있다. 유사도 계산 단계에서는 기존의 벡터 기반 유사도 척도를 활용한다. 또한, 본 학위 논문에서는 자동적인 파라미터 조정이 포함된 SimCC인 SimCC-AT (similarity based on content and citations with automatic parameter tuning) 를 제안한다. SimCC-AT는 $SVM^{rank}$에 기반한 자동적인 가중치 계산 방식을 이용하므로, 파라미터 조정을 위한 실험을 SimCC에 비해 적게 필요로 한다. 또한, 공헌 점수에 기반한 SimCC의 변형 방법인 SimCS (similarity based on contribution scores) 를 제안한다. SimCS는 공헌 점수 계산에 논문의 저자 영향력을 고려한다. SimCS는 추가적인 파라미터 조정을 필요로 하지 않으면서 SimCC나 SimCC-AT에 비해 유사도 계산에서 높은 정확도를 보인다. 본 학위논문에서는 위에서 언급한 방법들의 유사도 계산 효율성을 평가하기 위해 실세계의 논문 데이터에서 기존 방법들과의 비교 실험을 수행하였다. 본 학위 논문의 두 번째 부분에서는 그래프의 노드들 간의 유사도를 계산하기 위한 새로운 링크 기반 유사도 척도인 JacSim (Jaccard-based SimRank) 을 제안한다. JacSim은 서로 다른 두 가지 계산 방식인 자카드 계수와 쌍 정규화를 결합함으로써, SimRank의 직관에 반하는 특성인 쌍 정규화 문제를 효율적으로 해결한다. JacSim은 쌍 정규화 문제를 해결하기 위한 기존 유사도 척도들의 두 가지 문제점을 제시하고, 이를 효과적으로 해결하기 위해 중복 제거와 중요도 점수 조정을 해결방법으로 제시한다. 본 학위 논문에서는 가중치 그래프의 가중치 정보를 유사도 계산에 활용하기 위해, 가중치 그래프에 적용할 수 있는 JacSim의 가중치 버전을 제안한다. 더 나아가, JacSim을 더욱 빠르게 수행하기 위해, 유사도 점수의 근사치를 선형 연산만을 이용하여 계산할 수 있는 JacSim의 선형 재귀 행렬 계산 방식을 제시한다. 이를 통해 합리적 정확도를 갖는 유사도 점수의 근사치를 계산할 수 있다. 마지막으로, JacSim의 유효성과 효율성을 보이기 위해, 실제 논문 데이터와 웹페이지 데이터를 대상으로 기존 방법과의 다양한 비교 실험을 수행하였다.; Information networks consist of a large number of interacting physical, conceptual, human or social entities that are interconnected, creating a large and sophisticated network such as World Wide Web, social networks, and citation networks (i.e., extracted from academic literature data). An information network is represented as a graph where nodes correspond to entities and links do relations among entities. As an example, a citation network is represented as a graph (i.e., citation graph) where nodes correspond to scientific papers and links do citation relationships among papers. One of the most important aspects of information network analysis is to measure similarity between entities in the network. In this dissertation, we discuss similarity computation in information networks by mainly focusing on citation graphs. In the first part of the dissertation, we propose a novel method, SimCC (similarity based on content and citations), and its variants to compute the similarity of scientific papers. In contrast to link-based similarity measures and text-based similarity measures, SimCC considers both aspects of scientific papers, content and citation relationships in a citation graph, to compute the similarity. Furthermore, in contrast to hybrid methods, which consider both content and citations by simply combining the contents of papers involved in direct citation relationships, SimCC carefully considers the relation between the contents of papers having direct or indirect citation relationships by proposing the notion of a contribution score. The contribution score is a key factor in SimCC and indicates how much a paper contributes to another paper citing it directly or indirectly. SimCC consists of two orthogonal steps of feature extraction and similarity computation. For feature extraction, SimCC applies a a new RA (relevance and authority) weighting scheme, which enables SimCC to effectively reflects both content and authority of scientific papers simultaneously in similarity computation. For similarity computation, any text-based similarity measures applicable to vectors can be applied. In addition, we propose SimCC-AT (similarity based on content and citations with automatic parameter tuning), a variation of SimCC, which utilizes an automatic weighting scheme based on SVMrank and thus requires only a smaller number of experiments for parameter tuning than SimCC. Also, we propose SimCS (similarity based on contribution scores), another variation of SimCC, which considers the author dominance of papers in computing contribution scores. SimCS does not requires extra parameter tuning and shows higher accuracy in similarity computation than SimCC and SimCC-AT. We conducted extensive experiments on a real-world dataset of scientific papers to evaluate the effectiveness of the aforementioned methods in similarity computation in comparison to those of other existing methods. In the second part of the dissertation, we propose a novel link-based similarity measure, JacSim (Jaccard-based SimRank), for computing the similarity between nodes in a graph. JacSim solves a pairwise normalization problem, a counter-intuitive property of SimRank, in an effective way by combining two different computation manners: Jaccard coefïˇn ˛Acient and pairwise normalization. JacSim points out two problems of existing measures targeted at the pair- wise normalization problem and provides effective solutions to them as redundancy elimination and importance factor assignment. Since weighted graphs are more informative than un-weighted one, in order to take advantage of edge weights in similarity computation, we propose a weighted version of JacSim applicable to weighted graphs. Furthermore, to accelerate JacSim, we provide a linear recursive matrix form for JacSim, which computes the approximate similarity scores and composed of only linear operations. Although the JacSim matrix form computes approximate similarity scores, it provides a reasonable accuracy in similarity computation. We conducted extensive experiments on two real-world datasets, scientific papers and web pages, to evaluate both the effectiveness and efficiency of JacSim in comparison to those of other existing measures.
URI
https://repository.hanyang.ac.kr/handle/20.500.11754/125629http://hanyang.dcollection.net/common/orgView/200000486505
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > COMPUTER SCIENCE(컴퓨터·소프트웨어학과) > Theses (Ph.D.)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE