May
08

Google Patents: Improved Duplicate File Detection

Written by Jonathan Dingman
05/08/2008 9:32 ET - Filed under Search

According to a patent that was filed last last month by Google, they are pursuing better duplicate (and near-duplicate) file detection.

Here is the premise for the patent,

What is claimed is:

1. A method comprising: receiving a voice search query from a user; deriving a plurality of recognition hypotheses from the voice search query, each recognition hypothesis being associated with a weight; constructing a weighted boolean query using the recognition hypotheses and the weights; providing the weighted boolean query to a search system; receiving results for the weighted boolean query in response to providing the weighted boolean query to the search system; determining a quantity of the received results that are related to each recognition hypothesis of the plurality of recognition hypotheses; and discarding recognition hypotheses of the plurality of recognition hypotheses having no results to obtain a refined weighted boolean query.

So why would this help Google Search? It would help deal with PDF files, zip files, everything that’s non-HTML basically.

As I see more and more non-HTML pages showing up in search results, this will actually become more of a useful feature to have.

More to come soon from Google Patents!

Tags:,
  • Subscribe via RSS
  • Bookmark to del.icio.us