- To hell with the pig... I'm going to Switzerland.

Data Mining, Search, and Keyword Cleverness (Tuesday, April 11, 2006)

My latest project has been providing some rudimentary data mining tools for the site. The goal was two-fold. First, I wanted to allow full text search of the site, and second, I wanted to be able to generate keywords for any particular page. This is the first serious application on to make direct use of a database. I've got three tables in the database, one for Words, one for Files, and one for Counts, which keeps track of the correlation between words and files. The words table has a row for each unique word on the site, and a total number of times that word appears. The files table has a row for each unique file. And the counts table has a row for each word in each file, and a number of times that word appears in that file.

The search page will generate a keyword query that searches the database for files that contain all of the keywords, and then sorts them based on a weighted ranking of the keywords. That is, the first word in the query is weighted 100%, the next term is discounted by 75%, the next one is discounted a bit more, and so on. To bring other results to the top, you can move the more relevant terms to the front of the query.

For each page in the results, a snippet of text containing the search terms is generated, along with a list of keywords in the document. The keywords are determined by ranking the ratio of the number of times a word appears in the document, versus the number of times it appears on the whole site. Thus, a word like "the" may appear twenty times in a document, but it only represents one percent of the overall usage of the word "the" on the entire site, so it would be ranked relatively low compared to more interesting terms.

Currently, there are 7,675 unique words in the database, including proper names, spelling variations, and technical terms, and excluding words with fewer than three characters. Of these, 3,658 of these words only appear once on the site. This leads to a problem where a singular word will be ranked more highly than a more meaningful term, because the more meaningful term is used in more than one page. The singular word automatically gets a value of 1.0, while a word appearing in more than one page must be less than 1.0. To combat this, when making the calculation, I simply add one to the total. For words that only appear once, their value is reduced to 0.5. It turns out that this slight adjustment improves the quality of keywords significantly.

Just for fun, I added a couple of other interesting features. The site will now detect if a user has arrived from an external search engine, and will provide a link to search the site for the terms requested. The search page also has an option to sort results by date, rather than by relevance. The date sorted results do not look so good right now, because many older posts have been edited recently, making them appear in the results more recently than they would otherwise. But as more content is added to the site, these results, along with the regular search results, and keywords, should all improve dramatically.

While sufficient for my own needs, these capabilities could definitely be improved. A user might want to step through the results, instead of limiting the number of returned pages to ten, or to be able to click on a keyword to do a search for that keyword, or to have a search that would return pages even if they don't contain all of the search terms, if no pages exist that contain all of the terms, or to index stems, so a search for a word would return variations. Likewise, the keyword algorithm could be improved so that it would be possible to generate meaningful titles for blog entries, and not just lists of keywords. I also need to come up with a way to automatically flush missing files from the database, and refresh changed files as they change, instead of doing it manually.

—Brian (4/11/2006 2:09 PM)


No comments.

(no html)

Disclaimer: Opinions on this site are those of Brian Ziman and do not necessarily
reflect the views of any other organizations or businesses mentioned.