Advanced Search

Journal Navigation

Journal Home

Subscriptions

Archive

Contact Us

Table of Contents

CiteULike is a free service for managing and discovering scholarly references - click here to get started.

Sign In to gain access to subscriptions and/or personal tools.
Journal of Information Science
This Article
Right arrow Abstract Freely available
Right arrow Free Full Text (Free PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Courseault Trumbach, C.
Right arrow Articles by Payne, D.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

Identifying synonymous concepts in preparation for technology mining

Cherie Courseault Trumbach

Department of Management, University of New Orleans, New Orleans, USA, ctrumbac{at}uno.edu

Dinah Payne

Department of Management, University of New Orleans, New Orleans, USA

In this research, the development of a `concept-clumping algorithm' designed to improve the clustering of technical concepts is demonstrated . The algorithm developed first identifies a list of technically relevant noun phrases from a cleaned extracted list and then applies a rule-based algorithm for identifying synonymous terms based on shared words in each term. An assessment of the algorithm found that the algorithm has an 89—91% precision rate, was successful in moving technically important terms higher in the term frequency list, and improved the technical specificity of term clusters.

Key Words: text mining • data quality • knowledge discovery • term similarity • text cleaning

References

  • A.L. Porter and S.W. Cunningham, Tech Mining: Exploiting New Technologies for Competitive Advantage (Wiley-Interscience, Hoboken, 2005).
  • K. Chen and H.H. Chen, Extracting noun phrases from large-scale texts: a hybrid approach and its automatic evaluation. In: Proceedings of the 32nd Annual Meeting of the ACL, Las Cruces, 1994, (ACL, Morristown, NJ, 1994) 234—41.
  • J. Li-Ping, H. Hou-Kuan and S. Hong-Bo, Improved feature selection approach TFIDF in text mining. In: Proceedings of the International Conference on Machine Learning and Cybernetics (IEEE, Beijing, 2002) 944—7.
  • J.I. Serrano and L. Araujo, Evolutionary algorithm for noun phrase detection in natural language processing, Proceedings of the 2005 IEEE Congress on Evolutionary Computing (IEEE Computer Society, Edinburgh, 2005 ) 640—47.
  • O. Kimball, R. Iyer, H. Gish, S. Miller and F. Richardson, Extracting descriptive noun phrases from conversational speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE, Orlando, 2002) 33—6.
  • I.H. Witten, Z. Bray, M. Mahoui and B. Teahan, Text mining: a new frontier for lossless compression. In: Proceedings of the Data Compression Conference (IEEE, Snowbird, UT, 1999) 198—207.
  • H. Kaji, Y. Morimoto, T. Aizono, and N. Yamasaki, Corpus-dependent association thesauri for information retrieval. In: Proceedings of the 18th International Conference on Computational Linguistics, Saarbrücken, 2000 (ACL, Morristown, NJ, 2000) 404—10.
  • H. Ahonen-Myka, O. Hienonen and M. Klemettinen, Finding co-occurring text phrases by combining sequence and frequent set discovery. In: R. Feldman (ed.), Proceedings of the Text Mining Workshop at IJCAI'99, Stockholm, 1999 (Morgan Kaufmann, Cambridge, 1999) 1—9.
  • R.N. Kostoff and J.A. Block, Factor matrix text filtering and clustering, Journal of the American Society for Information Science and Technology 56(9) (2005) 946—68.[CrossRef][Web of Science]
  • W.J. Wilbur and Y. Yang, An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts, Computers in Biology and Medicine 26(3) ( 1996) 209—22.[CrossRef]
  • R. Feldman, M. Fresko, Y. Kinar, Y. Lindell, O. Liphstat, M. Rajman, Y. Schler and O. Zamir, Text mining at the term level In: J.M. Zytkow and M. Quafafou (eds), Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery (Springer-Verlag, London, 1998) 65—73.
  • R.J. Watts, A.L. Porter and D.Z. Zhu, Factor analysis optimization: applied on natural language knowledge discovery. In: Committee on Data for Science and Technology 2002: Frontiers of Scientific and Technical Data: Proceedings of the 18th International Conference CODATA 2002. Available at: www.codata.org/codata02/index.html (accessed 19 March 2007).
  • M.F. Porter, An algorithm for suffix stripping, Program 14(3) (1989) 130—7.
  • M. Agosti, M. Bacchin, N. Ferro and M. Melucci, Improving the automatic retrieval of text documents. In: Advances in Cross-Language Information Retrieval 2003: Proceedings of the Third Workshop of the Cross-Language Evaluation Forum, Revised Papers (Springer, Rome, 2003 ) 279—90.
  • I. Diaz, J. Morato, and J. Llorens, An algorithm for term conflation based on tree structures, Journal of the American Society for Information Science and Technology 53(3) (2002) 199—208.[CrossRef][Web of Science]
  • T. Kurz and K. Stoffel, Going beyond stemming: creating concept signatures of complex medical terms, Knowledge-Based Systems 15(5—6) ( 2002) 309—13.
  • W.J. Wilbur and W. Kim, Flexible phrase based query handling algorithms. In: E. Aversa and C. Manley (eds), Proceedings of the ASIST 2001 Annual Meeting ( Information Today, Medford, 2001 ) 438—49.
  • Y. Kadoya, M. Fuketa, E.S. Atlam, K. Morita, T. Sumitomo and J. Aoe, A compression algorithm using integrated record information for translation dictionaries, Information Sciences 165(3—4) (2004) 171—86.
  • M. Palakal, M. Stephens, S. Mukhopadhyay, R. Raje, and S. Rhodes, A multi-level text mining method to extract biological relationships. In: Proceedings of the IEEE Computing Society Bioinformatics Conference (IEEE Computer Society, Stanford, 2002) 97—108.
  • N. Ide and J. Veronis, Word sense disambiguation: the state of the art, Computational Linguistics 24(1) (1998) 1—40.
  • W. Gale, K. Church and D. Yarowsky, A method for disambiguating word senses in a large corpus, Computers and the Humanities 26 ( 1992) 415—39.[CrossRef][Web of Science]
  • D. Yarowsky, Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (Morgan Kaufmann, Cambridge, 1995) 189—96.
  • G. Escudero, L. Marquez and G. Rigau, Naive Bayes and Exemplar-based approaches to word sense disambiguation revisited. In: W. Horn (ed.), Proceedings of the 14th European Conference on Artificial Intelligence, ECAI 2000 (IOS Press, Amsterdam, 2000) 421—5.
  • J. White, Word List (2004). Available at: http://calendarhome.com/wordlist.html (accessed 19 March 2007).
  • F. Crestani, Exploiting the similarity of non-matching terms at retrieval time, Information Retrieval 2(1) (2000) 25—45.
  • A. Chowdhury, O. Frieder, D. Grossman and M.C.McCabe, Collection statistics for fast duplicate document detection, ACM Transactions on Information Systems 20(2) (2002) 171—91.[CrossRef][Web of Science]
  • L. Egghe and C. Michel, Strong similarity measures for ordered sets of documents in information retrieval, Information Processing & Management 38(6) (2002) 823—48.[CrossRef][Web of Science]
  • B. Jun-Peng, S. Jun-Yi, L. Xiao-Dong, and S. Qin-Bao, A new text feature extraction model and its application in document copy detection. In: Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE, Xian, 2003) 82—7.
  • H. Rezankova, D. Husek, J. Smid and V. Snasel, Clustering of documents via similarity measures. In: Proceedings of the International Conference on Communications in Computing, 2003 ( CSREA Press, Las Vegas, 2003) 292—9.
  • J. Basu, R. Mooney, K.V. Pasupuleti and J. Ghosh, Evaluating the novelty of text-mined rules using lexical knowledge. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001 (Association for Computing Machinery, San Francisco 2001) 233—8.
  • D.R. Cutting, D.R. Karger, J.O. Pederson and J.W. Tukey, Scatter-gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992 (Association for Computing Machinery, Copenhagen, 1992) 318—29.

This version was published on December 1, 2007

Journal of Information Science, Vol. 33, No. 6, 660-677 (2007)
DOI: 10.1177/0165551506076401


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter    What's this?


This article has been cited by other articles:


Home page
Journal of Information ScienceHome page
Y.-H. Hu, Y.-L. Chen, and K. Tang
Mining sequential patterns in the B2B environment
Journal of Information Science, December 1, 2009; 35(6): 677 - 694.
[Abstract] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Free Full Text (Free PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Courseault Trumbach, C.
Right arrow Articles by Payne, D.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?