Journal of Information Science

 

Advanced Search

Journal Navigation

Journal Home

Subscriptions

Archive

Contact Us

Table of Contents

Click here for more information

Sign In to gain access to subscriptions and/or personal tools.
This Article
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
0165551507082592v1
34/2/213    most recent
Right arrow References
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Pong, J. Y.-H.
Right arrow Articles by Wong, P. C.-C.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
This version was published on April 1, 2008
Journal of Information Science, Vol. 34, No. 2, 213-230 (2008)
DOI: 10.1177/0165551507082592

A comparative study of two automatic document classification methods in a library setting

Joanna Yi-Hang Pong

Run Run Shaw Library, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong

Ron Chi-Wai Kwok

Department of Information Systems, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, isron{at}cityu.edu.hk

Raymond Yiu-Keung Lau

Department of Information Systems, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong

Jin-Xing Hao

Department of Information Systems, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong

Percy Ching-Chi Wong

Department of Information Systems, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong

In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.

Key Words: automatic document classification • text categorization • machine learning • k-nearest neighbours classifier • naive Bayes classifier • library practice


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?