The Korean Association of Language Sciences

전국우수 학회와 맞먹는 연구성과를 위해 학술대회와 편집/심사기능을 보다 강화하겠습니다.

논문자료실

pISSN: 1225-2522


언어과학, Vol.26 (2019)
pp.51~70

DOI : 10.14384/kals.2019.26.1.051

TOEFL11을 이용한 비지도 토픽 모델링

윤태진

(성신여자대학교 부교수)

This paper aims at modeling topics from TOEFL essay samples in the TOEFL11 corpus. The TOEFL11 corpus is a collection of 12,100 TOEFL writing samples submitted by test-takers from 11 different countries. The paper applied an unsupervised method (i.e. Latent Dirichlet Allocation or LDA) of clustering texts to written samples, with the aim of automatic modeling of topics. For each of the 11 non-native TOEFL test takers, 1,100 TOEFL essays were transformed to a document-term matrix, and then were fed into the LDA function in the R software. The number of potential topics was set to be 8, which was the same number of prompts the test takers had been given when they took the test. The overall accuracy ranged from 83% to 99% depending on the native language of the test takers. Further analysis needs to be conducted to see how reliably the unsupervised LDA method can be used in automatically classifying written samples to potential topics. Nevertheless, the paper provides an empirical foundation that automatic topic modeling can be applied in an unsupervised way even to the writing sample of English learners. (Sungshin Women’s University)
  토플 에세이,TOEFL11 코퍼스,학습자 말뭉치,LDA (잠재 디리크레 할당),토픽 모델링,비지도 학습

Download PDF list