Google Patents: Improved Duplicate File Detection

Posted at 9:41am EST on 05/08/2008

According to a patent that was filed last last month by Google, they are pursuing better duplicate (and near-duplicate) file detection.

Here is the premise for the patent,

What is claimed is:

1. A method comprising: receiving a voice search query from a user; deriving a plurality of recognition hypotheses from the voice search query, each recognition hypothesis being associated with a weight; constructing a weighted boolean query using the recognition hypotheses and the weights; providing the weighted boolean query to a search system; receiving results for the weighted boolean query in response to providing the weighted boolean query to the search system; determining a quantity of the received results that are related to each recognition hypothesis of the plurality of recognition hypotheses; and discarding recognition hypotheses of the plurality of recognition hypotheses having no results to obtain a refined weighted boolean query.

So why would this help Google Search? It would help deal with PDF files, zip files, everything that’s non-HTML basically.

As I see more and more non-HTML pages showing up in search results, this will actually become more of a useful feature to have.

More to come soon from Google Patents!

Comments are closed.