¥H»y·N¬°°ò¦¤§ºô¸ô¥Ç¸o¸ê°T·j´M¬ã¨s

On Semantic-Based Intelligent Crime Information Retrieval

on the Internet

 

 

ÃC§Ó¥­ Chih-Ping Yen

¤¤¥¡Äµ¹î¤j¾Ç

¹q¤l­pºâ¾÷¤¤¤ß

peter@sun4.cpu.edu.tw

 

®}ºµ°· Shyong-Jian Shyu

»Ê¶Ç¤j¾Ç

¸ê°TºÞ²z¬ã¨s©Ò°Æ±Ð±Â

sjshyu@mcu.edu.tw

 

ºK  ­n

ÀËĵ¾÷Ãö³q±`ÂǧU¤J¤fºô¯¸ªº·j´M¤ÞÀº¡A¶i¦æºô»Úºô¸ô¥Ç¸o±¡³øªº·j¶°¡AµM¦Ó³oºØ·j´M¤ÞÀº¥Ñ©óºë½T²v¤ÎÀË¥X²v¤£°ª¡A©Ò¥H©¹©¹¦^À³³\¦h¤£¬ÛÃöªººô­¶¡A­P¨Ï°»¿ì¤H­û»Ý¦A¯Ó¶O®É¶¡³v¤@¹LÂo¡A¬Û·í¤£²Å®Ä¯q¡A¦]¦¹¥»¤å±N¹B¥Î´¼¼z«¬ªººtºâ¤è¦¡¡A¨Ó´£°ªºë½T²v¤ÎÀË¥X²v¡A¥H§ïµ½³o­Ó°ÝÃD¡C

­º¥ý¡A§Q¥Î»y·N³õ²z½×±Nµü»Pµüªº¦P¸qÃö«Y¡A²Õ´¦¨»yµü®w¡A«Ø¥ß°_Ãþ¦üWordNet ªº¶¥¼h¦¡¬[ºc¡A¦P®É¨Ï¥Î³o»yµü®w¡A¶i¦æºô­¶¤º®eªº¬Û¦ü«×¤ñ¹ï¡C¥»¤å¦@±À¾É¤­ºØ¬Û¦ü«×ºtºâ¤è¦¡¡G¥]¬A¡uµüÀWÅv­«¬Û¦ü«×¡v¡B¡u¤ÀÃþ«ü¼Æ¬Û¦ü«×¡v¡B¡u¤ÀÃþ«ü¼ÆÅv­«¬Û¦ü«×¡v¡B¡u»~®t®Õ¥¿¬Û¦ü«×¡v¡B¡uµüÀWÅv­«­«­p¡v¡A¨Ã¤À§O¤ñ¸û¨ä¶¡¤§Àu¦H¡A¾Ü¥X³Ì¨Îªº¤èªk¤Î±À½×¥XªùÂe­È¡C

¦¹¥~¡A¥»¬ã¨s¦¨ªG¨Ã»P³¯§Ó¸Û©ó1999¦~¤§¡uºô¸ô¤W°ªºë½T²v¤§¥Ç¸o¸ê°T»`´M¨t²Î¡v¡]ºÙ¬°e-Detective system¡^ªº¬ã¨s­pµe¦¨ªG§@¤ñ¸û¡A¦b¥H·j´Mºô»Úºô¸ô¤W¡u³c°â«Dªk³nÅé¡v¬°¨Ò¶i¦æµû¦ô¡A¹êÅçÃÒ©ú¥»¤å©Ò«ØÄ³¤§¨t²Î¡A¨äF´ú¶q­È³Ì¨Î¹F0.5581¡A¦Ó«e­z¨t²Î³Ì¨Î¶È¬°0.2376¡AÅãµM¥»¬ã¨s¨t²Î®Ä¯à¸û¨Î¡C

 

ÃöÁäµü¡G·j´M¤ÞÀº¡B¸ê°TÀ˯Á¡B¤å¥ó¤ÀÃþ¡B¬Û¦ü«×¡Bºë½T²v¡BÀË¥X²v¡B»y·N³õ¡Bºô¸ô¥Ç¸o¡B¹q¤l°»±´

 

Abstract

We usually search the Web with the help of search engines. Due to the imprecision of the search result, we often face the problem of too many pages recommended. The reason why search engines response many irrelevant pages is that it just exactly matches the search word(s) user entered. In order to cope with the problem, we suggest the determination of similarities that should be associated with a knowledge base to a given topic. That will reduce the number of irrelevant pages significantly.

In this research we first apply to the theory of semantic fields in which a term (concept) forms a term database through its relationships to other concepts. Based on the term databases, we suggest several models to evaluate the similarity between search concepts and the contents of Web pages. They are the model of weighted terms (the modified vector space model), the model of classified weighted terms, and the exponential model of classified weighted terms. The latest one is designed based on to the Facet Analysis Method. We also evaluate the similarity with error correction and term reweighting. The approaches described in this paper are used to construct a search engine for discriminating Web pages advertising pirated compact discs (CDs) that are very difficult to be distinguished from the pages advertising legitimate CDs. We further determine an adequate threshold of term weights for our search purpose as a trade-off of recall and precision. Our search result compared with that of previous work shows the advantage of this approach.

 

Keywords: Search Engine, Information Retrieval, Text Classification, Similarity, Precision, Recall, Semantic Field, Cybercrime, e-Detective