Ever since the internet was first invented, we have known that it's not a very structured place. In fact, the internet is designed to exist without any real structure, so that the way we interact is free from the boundaries and limitations of the world we used to have.
Now you can get information from anywhere, and from anyone, but not only that, the way you get and interact with information can also take any number of different paths.
In many ways, this is also what has made the internet so amazing. Most of the things we use today would never have been possible if it had to be designed with the structures of the old. But it also makes the internet very messy, and one of the many ways this can be seen is with links.
When we link to something, we have a reasonable expectation that it will take our readers to a specific place when they click on it, but... over time links start to decay.
At first, they might just lose parts of their initial value, but then, in many cases, the links stop working, or worse, start to take people to far away places, away from where we originally intended people to go.
In some of these cases, people might simply end up on a spam site. In other cases, it might take people to places where someone is attempting to use the links for fraud. But in the worst cases, the links are redirected to places that attempt to cause direct harm. Places like phishing sites that try to steal something from you, or even worse to malware injection sites, where someone is redirecting an expired link to another site where scripts will try to hack your computer.
None of this is new. This is something we have talked about for ... well ... forever. But so far, none of us has really taken any steps to do something about it.
For instance, back in 2013, The Atlantic reported that 49% of the links referred to in US Supreme Court cases have since stopped working. That's astonishing. Half the links no longer point to where they originally went.
In 2018, the Internet Archive announced that it had built a bot that automatically looked through Wikipedia, and it identified 9 million broken links. The project is ongoing to this day. (Thanks to Peter Nikolow for the heads up).
And last week, BuzzFeed News reported about a 'new' (it's not really new) form of link-jacking, where SEO scammers would buy up domains linked from, for instance, the New York Times, and redirect that traffic to their client's websites. The result being that those clients would get an artificial boost in traffic, and also rank higher in Google search, by using the reputation of the New York Times as a booster. (And thanks to Michelle Manafy for the heads up about that).
An investigation by BuzzFeed News found dozens of examples of the link hijack scheme being used to secure backlinks from at least 10 major news sites, including the New York Times, the Guardian, Forbes, HuffPost, CNN, BBC News, and Bloomberg.
And there are plenty of other examples.
But things are starting to change. The public now demands more structure from the web. They demand that privacy should not just be up to each individual site, but have a common baseline (like GDPR). With authentication, people now demand not only that everything is encrypted, but also authenticated in a much more trustworthy way. But, people are also starting to demand that any site that wants to be reputable, must ensure a level of 'verifiability' in what we publish ... including making sure that links go where they are supposed to.
These are very important trends, but how do you do this? How do we take responsibility for the links we include in our sites, knowing full well that we have no control over any of those external sites?
Well, as BuzzFeed reported when looking specifically at SEO link-jacking, they put the blame almost solely on Google.
Media ethics expert Kelly McBride, senior vice president at the Poynter Institute, said that while publishers do have a small responsibility to safeguard readers from outside threats, link schemes like this one are symptomatic of a larger problem with how Google ranks search results.
And I agree, in the specific case of SEO link-jacks, Google should do something to detect that, and if it detects that a link has been redirected, it should not count that as part of the ranking.
But remember I said this is only a tiny part of the overall problem. All the other examples have nothing to do with Google. All the examples of links no longer working, or sending readers to spam sites, or worse to malware sites, none of those are created to achieve a higher SEO. So Google can't fix any of that for us.
Instead, as publishers, we need to stop pointing fingers at others, and start to assume that responsibility ourselves. The SEO part is Google's problem, but the rest of it is very much our problem as publishers. And more than that, it reflects on the responsibility and trustworthiness that we want to impress on our own readers.
The real way to stop this problem is if we all cleaned up our links on a continual basis.
So when I read about this last week, I tweeted:
It would take a small amount of code (like 10 lines) for publishers to write a script that periodically checked if links in articles were still going to where they are supposed to.
Regardless of Google/SEO, this would be valuable to prevent sending readers to scam sites.
And, after tweeting that, I set aside a bit of time to write this exact script for my site.
At first, I had no real plan for this. I was merely curious as to the extent of the problem. But once I started experimenting, I was shocked to see how big the problem was.
Since 2005, I have posted 1,026 articles (that are still live on this site), within those I have included a total of 7,562 links (about 7 per article), and of those, 3,596 links were bad.
That is a lot!
I also decided to segment the data by year, to get a sense of how this changes over time. And the result is this:
This is amazing to look at. As you can see, a staggering 93% of all links that I included in my articles back in 2005 have since decayed so that they no longer get people to where I originally intended.
You can also see how this changes per year. In 2019, there are a few links that have already broken (8 of them to be exact), but you can clearly see how this decay is increasing over time.
What's even more interesting is if we look at what type of bad link it is, which I have here divided into four categories.
A lot of failures are down to 400 type errors, and 500 type errors.
For instance, when a page can no longer be found, you get a '404 Page not found' error, which you have undoubtedly seen hundreds of times. But there are also other types of 400 errors, like if you are no longer allowed to see a page. If Twitter goes down because of some server error, that's a 500 error.
So, in my script, I detected which error it was.
However, the biggest problem wasn't that the page could no longer be found, nor that the server the page was on was having a problem. Instead, remember what I said above about link-jacking?
With link-jacking, the link itself still works. It will still take you to a page, the problem however, is that you are just 'hijacked' to a different place. So I needed a way to detect that. And what I did was that I compared where the link originally pointed to, to where people actually ended up if they clicked on it. And if that took people to a different domain (like with SEO link-jacking, or most phishing or malware sites), I would detect that as a 'different domain'.
(Note: Another form of redirects are to HTTPS domains, but that is not included here, and I will write more about that below).
Mind you, just because a link is redirected doesn't necessarily mean that something bad is happening. It could simply be that the domain has changed. For instance, if one company merges with another, it's common for the two domains to become one. And as such, all the links to the old domain are simply redirected to the new domain.
There is nothing bad about that, it has nothing to do with spam or malware, although it does often break the link.
So look at this graph again:
By far the biggest problem isn't that a page no longer exists (404 error), but that the link still works, but is now taking people to different places. So, just checking if the link still exists is not enough. You need to also detect this other form.
In my testing, I found that most redirected links have nothing to do with SEO link-jacking, but are instead mostly just spam, or in some cases taken over by bad people.
In fact, the first article I tested, an article from 2005, had five links that now all redirected people to a cryptocurrency malware site. I mean, that was terrible, and as a publisher, it is completely unacceptable that one of my articles inadvertently links to something like that.
So... how do we fix this?
Well, with 3,596 bad links (in my case), the volume is just too high for me to manage this manually. So instead, I decided to write another script that would fix this automatically.
The way it works is very simple. Whenever my first script detects a bad link, my second script kicks in, rewrites the link in my article, and instead takes people to a warning page.
This page looks like this:
You will notice several elements here.
First, it prevents people from clicking on a bad link, and instead tells them that we have detected a problem with it. Mind you, I have no idea what the specific problem might be, because I have no way of detecting what is a good or bad redirect (again, see the exception about HTTPs below), so it's a generic warning.
I then give people three options. Either just go back to the article, see what the link originally looked like from the Internet Archive, or ... do the stupid thing and go to the link (but don't say I didn't warn you).
But let me first briefly talk about the second option, the Internet Archive. If you don't already know, the Internet Archive is like the library of the internet. It periodically scans almost every site online, takes a snapshot, and stores it.
For instance, if you want to see what the New York Times looked like in 2005, you can just go here.
This is the same thing that they did with Wikipedia, and this is now also what I do. Whenever I discover that a link is bad, I now give people the option of simply seeing what it used to look like from the Internet Archive instead.
It's absolutely brilliant.
And just like this, I have now fixed the problem with link rot and link-jacking on this site.
I am automatically checking every link in all my articles (I have set it to run the check once per month). Whenever it detects a bad link (for whatever reason), another part of the script goes back into that specific article and rewrites that link so that people are instead taken to the warning page, where they are given the option of checking what it originally looked like from the Internet Archive instead.
As an added thing, I also rewrite the link to include rel="nofollow". This tells Google not to index that link, which means that it will no longer contribute to the ranking, and thus also prevent the problem that BuzzFeed News was writing about.
I'm personally both proud but also somewhat ashamed of this. I'm proud because of how well it works, and how I'm now protecting my readers from the dangers of expired links. But I'm also ashamed that it took me this long to actually make this.
As I tweeted, it only takes about 10 lines of code to do all of this ... well, 31 lines of code in my case, because of some complications (more below).
But from start to finish, this entire project took me only about 7-8 hours to code, test, and implement (or in my case, two late evenings).
Nothing about this is hard to do.
So, this is my message to you as publishers. You should do this too!
Taking responsibility for the links in your articles is not just a 'Google problem', it's very much something that affects all of us, and with the public's increasing demand for privacy, security, and site integrity, it has become a much bigger factor in relation to the reputation that we have as publishers.
It might not feel like a sexy thing to talk about, it's not like talking about subscription strategies, or churn management, but it reflects on who you are as a publisher.
You want to give your readers a safe place to be, and it's an even bigger bonus if, by doing this, you can also contribute to making the internet as a whole safer. And it doesn't take that much time to build.
It's something worth doing!
Before I end this, I know that several of my readers would like to know more about the technical side of this, and while this site is not about tech, I'm going to make a slight exception to discuss a few things that are worth remembering if you want to do this yourself.
Now, first of all, I have some bad news. Whenever I do something like this, one of the questions I always get is: How can we do this in WordPress? Or how would you do this in whatever other system you are using?
My answer is that I have no idea. The special thing about this site is that it is built entirely using Python. And no, I'm not using Django or Flask either ... I'm just using 100% raw Python.
As such, the way I have specifically implemented this is likely different from how you would do it.
I'm also not going to show you my code, because that would reveal some of the internal structures and logic of my system as a whole, which I don't want out in the open.
But let me briefly talk about some of the complications I had to fix.
First of all, the way you check whether a link works or not is the simplest thing in the world. What you do is that you request the headers of the link (you only want the headers, because you don't want to download the contents of every page).
In Python, this is one line of code. You simply write:
Seriously, that's it. This will give you a HTTP status number that tells you whether the link is fine or not. If it reports 200, the link is fine. If it's 301 or 302, it means that it has been redirected to somewhere else. If it's in the 400s, it means the link is broken, and if it's in the 500 range, it means the server you are trying to reach has some kind of problem.
It's really that simple.
And so, when I wrote on Twitter that this would only take 10 lines of code, I wasn't kidding. In its purest form you would:
Well... kind of done.
The problem is that the internet is a mess, so if you just do this, your script will likely break almost instantly as it comes across something that you didn't expect.
For instance, what if the link just never responds? Okay... well... we can add a timeout so that if it hasn't responded within a short period of time (I set this to two seconds), we assume that it's broken and move on.
Another example is what if the link isn't a link? Well... okay, then we need to add some error handling to deal with that.
But, the biggest problem are the redirects. What do we do about those? Do we just mark them as invalid because they no longer go where we originally found the link? Or do we accept them?
Well, let me give an example.
Back in 2005, I wrote an article where I linked to the Apple iPod. The link at the time was:
Today this link no longer exists, and instead it redirects to this:
You will notice a few changes. First of all, it has changed to a HTTPS encrypted link, and secondly, the iPod is now the iPod Touch.
So is this a valid link?
Well, yes it is. Sure it has changed, and you might also argue that the iPod it is linking to today is different from the one I wrote about in 2005, but this link is still fine.
But what about this example?
Let's say that your original link looks like this:
But now it's linking to this:
Now we have a problem. Yes, the link still works, but clearly people are now ending up somewhere that we don't want them to go, and that may be harmful to them.
So we need to add some way to detect this.
For every link I check, I don't just check the link I have posted, I also follow that link as it is being redirected to see where people end up. Again, this is not hard to do.
We simply tell the system to follow the redirects, then we check where those are going by looking at the 'history', and then check whether the original link and the final link belong to the same domain.
And if it detects that the domain has changed, it classifies that link to be compromised. I set it to error status 400, but the number is irrelevant.
The final step is then to check the result. If the status of the link is anything other than status 200, mark that link as invalid, and update it in the original article so that people instead see the warning page.
So there are a few complications along the way, but it's not technically difficult to make. As I wrote, this whole project took me only about 7-8 hours to do over two evenings.
I did also add a few extra checks more specific to my site. For instance, I check whether a link is going to an external site, or whether it is just going to my own site, and a few other bits like that.
So, again, I will encourage you to do the same thing. When people visit us as publishers, they should feel safe clicking on any link that we include.
One final note: I'm sure some of you are now thinking that "it would be easier if we just didn't link", but don't think like that. Providing links is a vital part of the value that you provide as a publisher.
It's not about giving credit, it's about helping people connect with the things you write about. If you are talking about a new thing, linking to it is an essential part of the value, and without it your articles become meaningless.
So always link! ... But also take responsibility for making sure that they continue to have value.
As I said earlier, this whole thing is very much on us.
Founder, media analyst, author, and publisher. Follow on Twitter
"Thomas Baekdal is one of Scandinavia's most sought-after experts in the digitization of media companies. He has made himself known for his analysis of how digitization has changed the way we consume media."
Swedish business magazine, Resumé