In the world of search engine optimization (SEO) many things tend to be in a constant state of flux - Google changes this to their ranking algorithm, Yahoo changes that to the way they weight links, etc. Throughout all this, there are a couple of things that aren't really rocket science and will remain true until the end of time; first is to have the best content you possibly can, and the second is to present that content in the best way possible. This article isn't about making good content, that's your job, but instead this focuses on how to present that content in the best possible way for both your readers/customers and the search engines as well.
This article focuses on using the free link checking software AnalogX LinkExaminer
, there are other options out there both free and commercial, but since I wrote LinkExaminer, I'm only going to talk about it. ;) For those unfamiliar with what software like this does, you can almost think about it like it's your own personal mini spider, just like the ones that all the big search engines use. With it, you can have it spider your entire site and get a much better feel for not only what's on there (if it's a bigger site), but also how it looks from a search engines perspective.
Assuming you've downloaded and are running the latest copy of LinkExaminer, you'll want to tweak a couple of settings to get the most out of it (from the SEO perspective). There's really only a couple of settings that are critical at this point, so to the right you'll see a screenshot of what to tweak. The first option you'll want to turn on is 'Cache pages', what this does is makes it so LinkExaminer keeps a copy of the HTML source code for every page that it spiders. One word of warning, having this turned on causes the program to use up A LOT of memory, so be aware of this especially if you have a site with a large amount of pages (typically more than 100k pages). The only other option you might want to turn on is 'Find similar pages'; with this option, once the spider has completed its scan of the site, it will try to identify pages that look similar to each other. That should be it, feel free to toggle any of the other features as you see fit - none of them should have a substantial impact on the results, but you might be able to tune the scan to catch other things that might not normally be covered.
Gentlemen, start your scanning
Once everything is ready to go, simply enter in the URL of your website and start the scan. In this article I'll be jumping around through a couple of my website, specifically AnalogX
, and Internet Traffic Report
. You'll also notice that in some of the examples, I'm running against localhost instead of the actual live webserver, the reason I do this is that I normally like to tweak things offline and once I'm satisfied with the changes push them up to the public webserver. Another advantage of running it locally is that it doesn't put any additional strain on your webserver, although if the impact of scanning your website is profound enough to affect your sites performance, you should probably look into understanding what's causing the problem.
The basis of why most link checkers were written was, you guessed it, was to check links and in particular broken ones, so this is where we are going to start. Now while search engines don't have any particular problems with broken links, they are an indicator of a site that is either being poorly maintained or perhaps not even maintained anymore. Keep in mind that broken links happen for a variety of reasons; content moves, sites get updated or moved and some of the content doesn't come along for some reason, etc. There are also two kinds of broken links, those that are internal to your site (one of your pages linking to another of your pages) and then those that are external to your site (such as when you link out to another website you reference). No broken links are good, and you probably won't be penalized for a broken external link since you don't have as much control over it, but you more than likely will be penalized for broken internal links.
Using a tool like this on a regular basis is a very simple way to ensure that everything is working correctly and that nothing has inadvertently moved or been misplaced, so let's walk through how to spot them as well as fix them; here's what broken links look like in the main window:
As you can see, the red line is a broken link; the HTTP return code is 404 which means "Object not found". You can also see that this is an internal link so you know it was referenced on one of your other pages; to find out what page you want to open the 'Link details' window by right-clicking on the row. The link details window gives you all the linking specifics from a particular page; links in and out, at what depth in the site hierarchy they are at, what type of link was used, etc (we'll go over this window in more detail later on in the article). As you can see, it's showing a link coming in from only one page (in this case the main page), and the link was from an AREA html tag - now armed with this information you simply go to that particular page and fix the link, it's that simple.
If you're unfamiliar with the term "On-Page SEO", it simply means optimization you do to your own page, as opposed to from other sources (such as links coming into your site, etc). It's also important to point out that I'm only really covering the VERY BASICS
of on-page SEO, but that being said, these make up the foundation and are things that you'll almost ALWAYS want to follow. LinkExaminer has a special column called SEO that points out some of the more common mistakes (the screenshot is just a closeup of the SEO and page Title columns):
As you can see, the actual messages are pretty self-explanatory. In the first line with a warning it has identified two potential problems; the meta description is too short and the page title is actually longer than the description. The rules as far as the SEO recommendation go are normally based on whatever would make it the most compatible; so if Yahoo displays 115 chars of the title while Google only displays 65 chars, then it will warn when something is more than 65 chars.
Through the looking glass
Remember how you turned on the cache pages option in the configuration? Now we're going to start working with the data it has gathered, with this information, we can get a much clearer vision into how a search engine views everything. The first view is into the raw HTML:
This really isn't anything special - it's the HTML that you (should) already be familiar with, there's no formatting or special processing done, just a raw dump of the text. To get this view (or any of the other views) just select the row you're interested in and right-click on it, this will bring up the context menu where you can select the view you're after.
The next view is the parser window:
This view lets you see how the actual HTML parser is decoding the page, how tags are interpreted, text extracted, etc. Of course Google has their own HTML parser which is not going to work exactly the same (just like each browser has its own), but this lets you at least see how things are being parsed and if something isn't being processed correctly. In this view, all the formatting is first removed from the HTML, and then it is reformatted to be displayed in the window again - this is why you'll see that certain elements changed, like that the tags are now lowercase even though in the original they were uppercase. There also is a depth value in here, but it relates to whether or not you're inside of another tag, for instance, the meta section is always inside of the head section, so its depth is +1. This is helpful in seeing whether or not there are missing close tags, which can sometime affect the rendered - if you scroll to the very last tag (which more than likely is the close html tag), the depth should be back to 0. The final view is the content window:
This view gives you probably the closed view to how a search engine sees your page - forget the tags, formatting, etc - just a stream of words. I left in just a bit of formatting to help make it more human readable, but this gives you a sense of how things are seen. While some aspects are now visible that typically aren't, such as image alt text now being used in place of the picture, some elements are not represented, such as whether or not text was included in the header tags, were part of the meta tags, etc. Hidden text is excluded, and the title is not included since in general it is handled in a special way and not as a raw part of the content.
Duplicate content; users don't like it, search engines done like it, and it can sometimes be one of the most difficult things to track down in your site as it grows and ages. LinkExaminer has a special function that's specifically designed to make this task easier as well as give you a better idea of what a search engine might feel is duplicate content and what is not. The content identification doesn't work just by scanning the html content - if this were the case, then when your site design changed the page would effectively look completely new. Instead the engine only works with the content portion (similar to the content view), and looks at how similar the keyword density is between pages - in this way it is able to get more of a soft match between the pages. This also means that if you do simple things like reorder the paragraphs or changing the order of sentences the pages will still show as duplicates. Here's an example of what it looks like from my site:
In the above example, I highlighted the first three lines because they're effectively the same link - the only slight difference between the localhost copy and the one on the web is because of some slight text changes, which is why their percentages are so close. The bulk of the other matches in the screenshot are ironically of my screenshot pages; this is because there isn't much unique content on each page (there's only so many ways to say "main dialog" and "configuration dialog"), but it is helpful to be aware of what is similar. Now, if I was concerned that search engines may (wrongly) see these as duplicate pages I could develop a strategy to introduce more variation to the pages.
The relationship of links and linking is core to both SEO and good web design in general. Regardless of whether or not page caching is enabled, LinkExaminer will always maintain the relationship between links on pages; this means both links coming in as well as links going out. To access the link relationship viewer, right-click on the row you're interested in and select it from the menu; you should see something that looks likes this:
The top section shows inbound internal links pointing to the page (if you forget what page you're looking at, it's listed in the window title), while the bottom section shows all the links that are actually on the page. The number of hits is how many times a particular page actually contains a link to the page, and the link type tells you what type of link it was (image, a href, scripting, etc). The depth is the actual depth of the page in the sitemap; this is useful to get an idea how deeply linked a particular page is. If you want to check out a particular link, you can right click on it to either copy it to the clipboard or launch it in the browser.
This view is particularly useful when diagnosing broken links, redirections, etc - since it will show any of the pages actually linking to the broken link. In the case of a redirection, the URL returned from the redirection will be the only link in the bottom section, so you can follow the full redirection path.
Where do sitelinks come from
Most of you are probably familiar now with what sitelinks are; if not, they are the links to different parts of your site that appear in search results on Google, typically under the link and description. You can tweak these via the webmaster tools on Google, but have you ever wondered how it makes it's determination about what a sitelink is? As with many things SEO, there isn't a hard and fast rule, but one thing that I have noticed is that when Google has picked sitelinks for any site I've been involved with, they have always been exactly the same as the most internally linked pages that are on roughly the same level:
So here you see the most internally linked pages - the only ones that are relevant here at the text/html pages, and the ones that are on the lowest depth (in this case 1). So why is this valuable? For starters, when you're designing a new site this gives you a much better idea of how and what Google may pick. Next, you can take care of ranking flowing through pages that don't matter; in the above example, do I really want to have search.htm be one of my sitelinks, or really get even any traffic apart from users when they're on the site? Probably not.
Obey your robot overlords
When a search engine visits a site, it first checks out the contents of a file called 'robots.txt' at the top of the domain. This file contains information about what spiders are allowed to crawl it as well as what directories they can and can't go into. Some of the more common uses are to exclude things like scripts or calendars or anything else that a spider would encounter that it could get stuck in or wouldn't be of value to people trying to find the site. LinkExaminer has an interpreter that allows it to follow the same rules that a search engine would, or if you want it to take things a bit deeper it's possible to have it ignore the robots rule. Here's what you'll typically see from filtered links:
If you're wondering how it manages to get the Title for the page, even if it didn't download it, the answer is that it didn't - what it displays in the title field (if not title is specified) is the text from the inbound link.
As a side note, here's a power user tip - when you click on the column headers it sorts the list in either ascending or descending order - nothing too groundbreaking there. But did you know that the sorting is maintained when you click on another column? In the above screenshot I clicked on 'Dynamic' first, so that I had sorted the list so I could easily see all the dynamic vs. static links. Then, when I click on Robots, now it is effectively sorted into four parts; Static URLs filtered by robots, Dynamic URLs filtered by robots, Static URLs not filtered by robots, Dynamic URLs not filtered by robots. This is a great trick to use when you want to apply a bit more organization into what you're looking at.
Any time you run a scan, it's trivial to also export a report; this is useful since it gives you visibility into how your site is doing over time. In the configuration you can have it automatically add the time and date to the filename, or handle things yourself. The most common log format you'll probably want to export is the HTML report; this report is fully customizable, so check out the program documentation to get a sense of what it does and what is possible.
The reports aren't just for archiving - they contain most of the information you'll need to fix problems (it contains the pages that link to broken pages, redirections, abnormally large pages, etc). Many times I do my first pass using the GUI, export a report and then continue to work off of it later.
Hopefully you've found this information helpful in your quest for search engine greatness - in my experience nothing beats the old fashion combination of great content and presentation/management. While this won't help you on the content site, with LinkExaminer you get some serious help to make your presentation/management side of things as solid as possible. I would also recommend that you make it a regular event that you run the program over your site, say every couple of months or sooner depending on how often you update it, to make sure everything is running as well now as when you put it up. Good luck!