The server is under maintenance between 08:00 to 12:00 (GMT+08:00), and please visit later.
We apologize for any inconvenience caused
Login  | Sign Up  |  Oriprobe Inc. Feed
China/Asia On Demand
Journal Articles
Laws/Policies/Regulations
Companies/Products
Bookmark and Share
An Extraction Algorithm of Chinese HTML Content Based on Similarity
Author(s): 
Pages: 80-84
Year: Issue:  1
Journal: Journal of Southwest University of Science and Technology

Keyword:  内容相似度标签相似度分块文本挖掘;
Abstract: 网页正文提取是WEB挖掘的重要步骤.传统网页正文提取方法都需要经过分块这一步骤之后来识别网页正文块,提出了利用行文本之间的内容相似度和标签相似度结合的方法来提取网页正文.该算法避免了传统网页提取算法的分块步骤,在规范网页之后,先提取网页的最大文本行,然后计算每行文本与最大行的内容相似度和标签相似度,再结合内容相似度与标签相似度来提取网页正文.实验中,利用随机抽取的网页进行了测试,其测试精度接近95...
Related Articles
No related articles found