According to a patent that was filed last last month by Google, they are pursuing better duplicate (and near-duplicate) file detection.
Here is the premise for the patent,
What is claimed is:
1. A method comprising: receiving a voice search query from a user; deriving a plurality of recognition hypotheses from the voice search query, each recognition hypothesis being associated with a weight; constructing a weighted boolean query using the recognition hypotheses and the weights; providing the weighted boolean query to a search system; receiving results for the weighted boolean query in response to providing the weighted boolean query to the search system; determining a quantity of the received results that are related to each recognition hypothesis of the plurality of recognition hypotheses; and discarding recognition hypotheses of the plurality of recognition hypotheses having no results to obtain a refined weighted boolean query.
So why would this help Google Search? It would help deal with PDF files, zip files, everything that’s non-HTML basically.
As I see more and more non-HTML pages showing up in search results, this will actually become more of a useful feature to have.
More to come soon from Google Patents!
Leave your comment Join the discussion