This article is coming directly from Google’s Help for Publishers: Tips for Successful Crawling, but I’m adding some emphasis on the points I think people often overlook.
Tips for Successful Crawling
Below are listed common crawling issues related to the article body that Googlebot extracts from the HTML page. You can also find some tips on to help our crawler find your content and include it in Google News.
- If the article body appears to be too long to be a news article, our crawler may not recognize it as an article. This may happen with news articles that contain user-contributed comments below the article, or HTML layouts that contain other material besides the news article itself.
- If the article body appears not to contain punctuated sequences of contiguous words, we won’t be able to include it in Google News. Make sure that the text of your articles is made up of sentences, and that you don’t use frequent
tags within your paragraphs. - If the article body appears to consist of isolated sentences not grouped together into paragraphs, we won’t be able to crawl it. Try formatting your articles into text paragraphs of a few sentences each.
- If the article body constitutes a relatively small fraction of the text on the overall page, we won’t be able to include it in our News index. Consider removing some of the non-article text on the page.
- If the article body appears to contain too few words to be a news article, we won’t be able to include it. This applies to most links that would lead to news briefs or bulletins, rather than full news articles.
- If the article body appears to be empty, we won’t be able to crawl it. Make sure that the full text of each of your articles is available in the source code of your article pages (and not embedded in a JavaScript file, for example).
- If the article body is prevented from being crawled by a robots.txt file or a robots meta tag, Googlebot won’t be able to access your article. Try removing the “noindex” meta tag or checking that your robots.txt file allows “User-agent: Googlebot” access to the file where your news articles are stored.
Below you can find common crawling issues related to the article title that Googlebot extracts from the HTML page. We also offer you some tips on how to help our crawler find your content and include it in Google News.
- If the title suggests that the article isn’t news-related (for example, “Terms of Service”), we won’t be able to include it in Google News.
- If the title doesn’t appear in other parts of the page, we may not be able to crawl it. To fix this problem you can try this:
- Set the < title > tag on the HTML page to the title of the article.
- Make sure the article title is displayed prominently above the text of the article, such as between < h1 > tags, and that no part of it is a hyperlink. br>
- Ensure that the title is not too long or too short. Currently, the title needs to be between two and 22 words.
If our crawler has trouble finding an article’s date, we may not be able to include it in Google News Below are some tips for displaying the date correctly in your article HTML.
- If we were unable to determine the publication date of the article, we won’t be able to include it in our News index. Try placing a clear date and time for each of your articles in between the article’s title and the article’s text in a separate line of HTML. This should help our crawler correctly identify the publication date for your article.
- If the date that we determined for this article is more than three days old, our crawler will think the article is too old to be considered news content and won’t include it in Google News. If you suspect this is happening for your new articles, try placing a clear date and time for each of your articles in between the article’s title and the article’s text in a separate line of HTML.