Standards Compliant Solution for 302 Web Page Hijacking
The HTTP Content-Location header is a standards compliant solution for the 302 web page hijacking problem. Depending on how it’s evaluated, it can be effective as either an HTTP header, or as an HTML meta tag. If two different locations provide conflicting Content-Location data, then priority should be given to the information that was most likely generated by the person that controls the content.
It’s only effective if the Content-Location header from the domain that actually serves the content overrides any Content-Location header served with a redirect.
For example:
Googlebot finds a 302 redirect on hijacker.com that points to a page on v1.magicbeandip.com. Hijacker.com includes a Content-Location header pointing to a url on hijacker.com with the redirect, trying to fool Googlebot into thinking that the content actually does belong to them.
Googlebot follows the redirect to the page on v1.magicbeandip.com. My server actually serves the page content, but also includes a Content-Location header that points to a url on v1.magicbeandip.com. Since my server actually served the content, it’s more likely that my Content-Location header is correct, so my header should be used to determine the canonical url of the content.
Virtually all existing web pages don’t currently have a Content-Location header, so Googlebot can’t allow the canonical url of pages that don’t contain a Content-Location header to be changed by other domains. A Content-Location header should only be able to successfully specify a different canonical url if both urls provide a Content-Location header pointing to the same place. So in order for content from v1.magicbeandip.com to be canonicalized to hijacker.com, both domains have to send Googlebot Content-Location information pointing to the same place.
HTML Content-Location Meta Tag
In order to be effective, the Content-Location data from the person that controls the content must be used. Following this train of thought, it’s logical to recognize Content-Location data from an HTML meta tag as well.
<meta http-equiv="Content-Location" content="http://v1.magicbeandip.com/mycontent.html">
Since it’s very easy for the person controlling the content to create a meta Content-Location tag, this information should be given the higher priority than a Content-Location header served by either the same or different domains. And again, it can’t specify a url on a different domain unless both domains agree. If hijacker.com includes a Content-Location meta tag with a Refresh meta tag, then my domain must provide the same Content-Location information in order for it to be successful.
It Works, But There’s an Easier Way
Specifying the full url with a Content-Location header works, but adds another level of complexity to managing websites. In order to simplify things, I suggest creating a way to specify only the domain the content belongs to, instead of the full url. Determining the correct canonical url would be left up to Googlebot as long as the result is in the specified domain.
For example you could use an “X-Content-Domain” header and meta tag that would be prioritized the same way as I’ve outlined for the Content-Location header. The intent being to give priority to the information that was most likely provided by the person in control of the content.
If I had to choose one or the other I think I would choose X-Content-Domain over Content-Location.
June 3rd, 2005 at 8:15 am
Did you pull the X-Content-Domain out of a hat or something? Googling for it provides TWO results, one being this website. It does not exist in the W3C HTTP specification docs either.
YHBH YHL HAND
HopeSeekr of xMule
http://www.incendiary.ws/
http://www.xmule.ws/
June 3rd, 2005 at 10:29 am
Just to avoid confusion, I don’t know of any search engines that currently use “Content-Location” or “X-Content-Domain” to canonicalize urls. This is intended as a proposal.
I suggested using “X-Content-Domain” as an http header extension allowed by rfc2616 to make it simpler for webmasters to manage.
The problem in page hijacking is that the content gets assigned to the wrong domain. Specifying the full url, as required by the “Content-Location” header, is more information than is necessary to solve the problem.
June 3rd, 2005 at 2:32 pm
O! I’m sorry. You’re advocating the adoption of a new HTTP header … Unfortunately this would take months, if not years, to implement widely enough to fix the problem i’m afraid. In the words of Spong Bob: “Good luck with that.”
-hope-
June 3rd, 2005 at 4:24 pm
Do I think this is the best solution? No.
If Google implimented it, would people start using it? Yes. (Look at the nofollow attribute)
It would be nice if Google did something to resolve the problem, but I’m not going to hold my breath. At least they can’t claim standards compliance as a reason for avoiding it anymore.
June 8th, 2005 at 10:56 am
There is a best solution…
Create a The search engines would run a MD5sum on the hostname for any given page and verify that this matches the hash provided in the sitehash. If not, it is not authorative.