Transfer Domain Names

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Sunday, 9 August 2009

Optimize your crawling & indexing

Posted on 22:40 by Unknown
Webmaster Level: Intermediate to Advanced

Many questions about website architecture, crawling and indexing, and even ranking issues can be boiled down to one central issue: How easy is it for search engines to crawl your site? We've spoken on this topic at a number of recent events, and below you'll find our presentation and some key takeaways on this topic.



The Internet is a big place; new content is being created all the time. Google has a finite number of resources, so when faced with the nearly-infinite quantity of content that's available online, Googlebot is only able to find and crawl a percentage of that content. Then, of the content we've crawled, we're only able to index a portion.

URLs are like the bridges between your website and a search engine's crawler: crawlers need to be able to find and cross those bridges (i.e., find and crawl your URLs) in order to get to your site's content. If your URLs are complicated or redundant, crawlers are going to spend time tracing and retracing their steps; if your URLs are organized and lead directly to distinct content, crawlers can spend their time accessing your content rather than crawling through empty pages, or crawling the same content over and over via different URLs.

In the slides above you can see some examples of what not to do—real-life examples (though names have been changed to protect the innocent) of homegrown URL hacks and encodings, parameters masquerading as part of the URL path, infinite crawl spaces, and more. You'll also find some recommendations for straightening out that labyrinth of URLs and helping crawlers find more of your content faster, including:
  • Remove user-specific details from URLs.
    URL parameters that don't change the content of the page—like session IDs or sort order—can be removed from the URL and put into a cookie. By putting this information in a cookie and 301 redirecting to a "clean" URL, you retain the information and reduce the number of URLs pointing to that same content.
  • Rein in infinite spaces.
    Do you have a calendar that links to an infinite number of past or future dates (each with their own unique URL)? Do you have paginated data that returns a status code of 200 when you add &page=3563 to the URL, even if there aren't that many pages of data? If so, you have an infinite crawl space on your website, and crawlers could be wasting their (and your!) bandwidth trying to crawl it all. Consider these tips for reining in infinite spaces.
  • Disallow actions Googlebot can't perform.
    Using your robots.txt file, you can disallow crawling of login pages, contact forms, shopping carts, and other pages whose sole functionality is something that a crawler can't perform. (Crawlers are notoriously cheap and shy, so they don't usually "Add to cart" or "Contact us.") This lets crawlers spend more of their time crawling content that they can actually do something with.
  • One man, one vote. One URL, one set of content.
    In an ideal world, there's a one-to-one pairing between URL and content: each URL leads to a unique piece of content, and each piece of content can only be accessed via one URL. The closer you can get to this ideal, the more streamlined your site will be for crawling and indexing. If your CMS or current site setup makes this difficult, you can use the rel=canonical element to indicate the preferred URL for a particular piece of content.

If you have further questions about optimizing your site for crawling and indexing, check out some of our previous writing on the subject, or stop by our Help Forum.

Posted by Susan Moskwa, Webmaster Trends Analyst
Email ThisBlogThis!Share to XShare to Facebook
Posted in crawling and indexing | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Switching to the new website verification API
    Webmaster level: advanced Just over a year ago we introduced a new API for website verification for Google services. In the spirit of keepi...
  • Structured Data dashboard: new markup error reports for easier debugging
    Since we launched the Structured Data dashboard last year, it has quickly become one of the most popular features in Webmaster Tools. We’ve...
  • "It's on Google! YAY!" - Getting webmaster help in our forum
    Webmaster level: all It's been a bit more than five years now that our Webmaster Help Forum has been up and running, helping webmasters...
  • Supporting rel="canonical" HTTP Headers
    Webmaster level: Advanced Based on your feedback, we’re happy to announce that Google web search now supports link rel="canonical"...
  • Getting started with structured data
    Webmaster level: All If Google understands your website’s content in a structured way, we can present that content more accurately and more ...
  • Responsive design – harnessing the power of media queries
    Webmaster Level: Intermediate / Advanced We love data, and spend a lot of time monitoring the analytics on our websites. Any web developer d...
  • Introducing the Structured Data Dashboard
    Webmaster level: All Structured data is becoming an increasingly important part of the web ecosystem. Google makes use of structured data in...
  • Tell us what you think!
    (Cross-posted on the Google Product Ideas Blog ) The Webmaster Central team does our best to support the webmaster community via Webmaster T...
  • Improving URL removals on third-party sites
    Webmaster level: all Content on the Internet changes or disappears, and occasionally it's helpful to have search results for it updated ...
  • Protect your site from spammers with reCAPTCHA
    Webmaster Level: All If you allow users to publish content on your website, from leaving comments to creating user profiles , you’ll likely...

Categories

  • advanced
  • beginner
  • crawling and indexing
  • events
  • feedback and communication
  • general tips
  • hacked sites
  • hreflang
  • images
  • intermediate
  • localization
  • malware
  • mobile
  • performance
  • products and services
  • search results
  • sitemaps
  • structured data
  • url removals
  • verification
  • video
  • webmaster guidelines
  • webmaster tools

Blog Archive

  • ►  2014 (2)
    • ►  January (2)
  • ►  2013 (35)
    • ►  December (6)
    • ►  November (1)
    • ►  October (2)
    • ►  September (2)
    • ►  August (4)
    • ►  July (2)
    • ►  June (4)
    • ►  May (3)
    • ►  April (2)
    • ►  March (6)
    • ►  February (2)
    • ►  January (1)
  • ►  2012 (55)
    • ►  December (3)
    • ►  November (1)
    • ►  October (5)
    • ►  September (2)
    • ►  August (5)
    • ►  July (5)
    • ►  June (6)
    • ►  May (7)
    • ►  April (7)
    • ►  March (6)
    • ►  February (2)
    • ►  January (6)
  • ►  2011 (75)
    • ►  December (7)
    • ►  November (2)
    • ►  October (5)
    • ►  September (8)
    • ►  August (10)
    • ►  July (5)
    • ►  June (10)
    • ►  May (8)
    • ►  April (6)
    • ►  March (6)
    • ►  February (5)
    • ►  January (3)
  • ►  2010 (81)
    • ►  December (9)
    • ►  November (9)
    • ►  October (4)
    • ►  September (8)
    • ►  August (6)
    • ►  July (2)
    • ►  June (6)
    • ►  May (6)
    • ►  April (12)
    • ►  March (11)
    • ►  February (1)
    • ►  January (7)
  • ▼  2009 (52)
    • ►  December (7)
    • ►  November (9)
    • ►  October (13)
    • ►  September (8)
    • ▼  August (6)
      • Specifying an image's license using RDFa
      • New tools for Google Services for Websites
      • Help test some next-generation infrastructure
      • Optimize your crawling & indexing
      • How do you use Webmaster Tools? Share your stories...
      • Advanced Q&A from (the appropriately-named) SMX Ad...
    • ►  July (5)
    • ►  June (4)
Powered by Blogger.

About Me

Unknown
View my complete profile