TEXT MINING: TEXT SIMILARITY MEASURE FOR NEWS ARTICLES BASED ON STRING BASED APPROACH
Keywords:
Text Mining, Cosine similarity, News Article Dataset.Abstract
Now-a-days, the documents similarity measuring plays an important role in text related researches. There are many applications in document similarity measures such as plagiarism detection, document clustering, automatic essay scoring, information retrieval and machine translation. String Based Similarity, Knowledge Based Similarity and Corpus Based Similarity are the three major approaches proposed by the most of the researchers to solve the problems in document similarity. In this paper, the String Based Similarity measure Term Based algorithm Cosine Similarity is used to measuring the similarity between the documents. The nouns in the documents are extracted and context word synset are also extracted using WordNet. The bigram dataset is created based on Context words. In this proposed method the similarity measure between the documents is measured using cosine similarity algorithm. Preprocessing dataset, context word dataset and bigram dataset are used to measure the similarity. The context word document set measure gives a better similarity than bigram and preprocessing document set.