146 0

Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation

Title
Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation
Author
최중민
Issue Date
2007-11
Publisher
IEEE
Citation
2007 International Symposium on Information Technology Convergence (ISITC 2007), Page. 306-310
Abstract
As the structure of a Web page is getting more complicated, the construction of wrapper induction rules becomes more difficult and time-consuming. The main problem in most wrapper induction methods is the difficulty in discriminating the meaningful blocks that contain the target information from the noise blocks that contains irrelevant information such as advertisements, menus, or copyright statements. To solve this problem, this paper proposes the RIPB(recognizing informative page blocks) algorithm that detects the informative blocks in a Web page by exploiting the visual block segmentation scheme. RIPB uses the visual page segmentation algorithm to analyze and partition a Web page into a set of logical blocks, and then groups related blocks with similar structures into a block cluster and recognizes the informative block clusters by applying some heuristic rules to the cluster information. The results of a series of experiments indicate that RIPB contributes to improve the accuracy of information extraction by allowing the wrapper induction module to focus only on the informative block information and ignore other noise information in building extraction rules.
URI
https://ieeexplore.ieee.org/document/4410655https://repository.hanyang.ac.kr/handle/20.500.11754/107275
ISBN
0-7695-3045-1
DOI
10.1109/ISITC.2007.6
Appears in Collections:
COLLEGE OF ENGINEERING SCIENCES[E](공학대학) > COMPUTER SCIENCE AND ENGINEERING(컴퓨터공학과) > Articles
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE