317 0

Design and Implementation of the RAKE Algorithm-based Web Crawler Using Real-Time Keyword Extraction

Title
Design and Implementation of the RAKE Algorithm-based Web Crawler Using Real-Time Keyword Extraction
Author
장비
Alternative Author(s)
Zhang, Fei
Advisor(s)
조인휘
Issue Date
2018-02
Publisher
한양대학교
Degree
Master
Abstract
Nowadays big data is becoming more and more popular. Big data technology has occupied a very important position in people's life. Data analysis has a profound impact on all industries. In particular, the technology of crawling is increasingly popular. However, with the rapid development of Internet, the accumulation of network information is growing explosively. Because of the large amount of data, the data obtained by crawler cannot be quickly classified or retrieved. In addition, it is very difficult to find the desired data in the massive data in time and quickly. Therefore, the crawler real-time extraction of keywords, is very conducive to data processing. In this paper, we propose a web crawler system with keyword extraction function. Researches on the keyword extraction in existing text mining are mostly based on databases which have already been grabbed by documents or corpora, but the purpose of this paper is to establish a real-time keyword extraction system which can extract the keywords of the corresponding text and store them into the database together while grasping the text of the web page. To facilitate better and faster extraction of the information which we needed. In this paper, we design and implement a crawler combining RAKE keyword extraction algorithm. It can extract keywords from the corresponding content while grasping the content of web page. The extraction experiments are carried out for the 100 news in YAHOO news science column. The number of keywords to be extracted each set 6, and manually annotated keywords for each news. We compared the results to manually annotated keywords for the keywords extracted by crawler which combines the RAKE algorithm and the crawler which combines the modified RAKE algorithm. As a result, the performance of the RAKE algorithm is improved by increasing the weight of the important features (such as the noun appearing in the title). The experimental results show that this method is superior to the existing method and it can extract keywords satisfactorily.
URI
https://repository.hanyang.ac.kr/handle/20.500.11754/68619http://hanyang.dcollection.net/common/orgView/200000431969
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > COMPUTER SCIENCE(컴퓨터·소프트웨어학과) > Theses (Master)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE