Publication Date



Technical Report: UTEP-CS-14-40


In document analysis, an important task is to automatically find keywords which best describe the subject of the document. One of the most widely used techniques for keyword detection is a technique based on the term frequency-inverse document frequency (tf-idf) heuristic. This techniques has some explanations, but these explanations are somewhat too complex to be fully convincing. In this paper, we provide a simple probabilistic explanation for the tf-idf heuristic. We also show that the ideas behind explanation can help us come up with more complex formulas which will hopefully lead to a more adequate detection of keywords.