Repository at Hanyang University: Entropy-based Valid Token Determination in Open-ended Text Generation

Browse

My Repository

Repository at Hanyang UniversityGRADUATE SCHOOL[S](대학원)DEPARTMENT OF INTELLIGENCE AND CONVERGENCE(지능융합학과)Theses (Master)

76 0

Entropy-based Valid Token Determination in Open-ended Text Generation

Title: Entropy-based Valid Token Determination in Open-ended Text Generation

Author: 정민지

Advisor(s): Yong Suk Choi

Issue Date: 2024. 2

Publisher: 한양대학교 대학원

Degree: Master

Abstract: In open-ended text generation, a surface form competition problem occurs when there are multiple valid token candidates at the same timestep with different probability masses. To select a token, sampling methods are used to generate sentences considering various candidates. However, existing studies have simply defined the range of possible tokens—i.e., the sampling candidates—as only the tokens that are at the top of the probability distribution. Inspired by the fact that possible tokens are not clearly defined, we investigated the characteristics of when there are valid and invalid tokens at the timestep during the generation process. We propose a novel method to determine whether a token can be placed at the current timestep by calculating the entropy of the probability distribution without training a model, and we show that the binary classifier created through this method has a good discriminatory ability. To demonstrate the usefulness of this method, we apply it to Keyword2Text, a lexically-constrained text generation model that suffers from surface form competition issues. Our proposed method reduces the generation time by an average of 81.92% and improves the quality of the generated text.|개방형 텍스트 생성(open-ended text generation)에서는 동일한 시점(timestep)에 유효한(valid) 토큰 후보가 여러 개 있는 경우, 각 토큰 후보가 서로 다른 확률 질량을 갖는 표면형 경쟁(surface form competition) 문제가 발생한다. 따라서 토큰을 선택할 때는 샘플링(sampling) 방법을 사용하여 다양한 후보를 고려한 문장을 생성하곤 한다. 하지만 올 수 있는 토큰이 작은 확률 질량 값을 가질 수도 있음이 알려졌음에도 불구하고 기존 연구에서는 가능한 토큰의 범위, 즉 샘플링 후보를 단순하게 확률 분포의 상위 부에 있는 토큰으로만 정의하였다. 가능한 토큰이 명확하게 정의되어 있지 않다는 점에서 아이디어를 얻어 우리는 생성 과정 중 현재 시점에 유효한(valid) 토큰이 있을 때와 유효하지 않은(invalid) 토큰이 있을 때의 특성을 조사하였다. 조사를 토대로, 우리는 모델 학습 없이 확률 분포의 엔트로피 계산을 통하여 현재 시점에 토큰이 올 수 있는지를 판단하는 새로운 방법을 제안한다. 또한 이 방법을 통해 생성된 이진 분류기가 좋은 판별 능력을 갖추고 있음을 보인다. 본 방법의 활용 가능성을 입증하기 위하여 표면형 경쟁 문제를 겪고 있는 Lexically-constrained text generation task 모델인 Keyword2Text에 이진 분류기를 적용한다. 실험 결과 생성된 텍스트의 품질이 향상되었으며 생성 시간을 평균 81.92% 단축하였다.

URI: http://hanyang.dcollection.net/common/orgView/200000719606 https://repository.hanyang.ac.kr/handle/20.500.11754/188346

Appears in Collections:: GRADUATE SCHOOL[S](대학원) > DEPARTMENT OF INTELLIGENCE AND CONVERGENCE(지능융합학과) > Theses (Master)

Files in This Item:

Export: RIS (EndNote); XLS (Excel); XML

Show full item record

한양대학교 리포지터리는 국립중앙도서관 OAK 보급사업으로 구축되었습니다. Feedback 개인정보처리방침

Hanyang University repository

Browse

My Repository

BROWSE