Trend Edition
RSS feeds are one the things that I really like. It allows me to get the information I want, the way I want it. There is just one problem. Not every site offers an RSS feed, and those that do are not offering the right content.
This is something I wanted to do something about.

WEB2RSS is a simple web application that can convert almost any normal website/page into an RSS feed. If you want to receive updates in your feed reader when your favorite site changes, this is the tool for you.
Technical speaking WEB2RSS request the page you want, strips it from all non-text elements and output the rest into a RSS file. What you get is the content of any page, without form elements, script blocks, iframes etc.
Some examples: If one your friend's website does not offer an RSS feed, now it does. If you are really into the art of "Camera tossing" you can now search for it on Flickr and subscribe to the search results. ...or what about being kept up-to-date with the forum you visit frequently, or maybe just a specific topic. What about getting the latest updates to a seminar you are interested in - like the HCI class on Stanford University. Maybe, you just want to get the daily comic strip from your favorite cartoonist, or the latest news from your favorite hokey team.

Or, what about getting Google search results as a RSS feed, or when Google publishes something new on Google Labs. ...or maybe you just want to know when you competitors publishes new content.
You can do all of these things with WEB2RSS.
Using WEB2RSS is pretty simple. Type in the URL you want to use, check the link and... Boom (as Steve Jobs would say) you get RSS link you can subscribe to.

There are a number of more advanced features that you can use. The easiest one is "Include Images". By default, all images are removed from the feed to prevent layout images to clutter up the results. A common problem on table based sites. But, you can force it to include images if you want to do that.
The two other advanced settings are much harder to deal with. They are "Match" and "Exclude". With these you can either match a specific part of a page. If you want to get only a specific image, DIV or table, this is what you use. You can do the same with "Exclude", except it will remove the element you define instead of keeping it.
The hard part of these two settings is that they function using something called "Regular Expressions". If you do not know what this is, my best advice is to not use them. Regular Expression is a very complex beast, but it makes the "WEB2RSS" very flexible and powerful.
Yes, of course. There is a number of things that can prevent a successful conversion. Like:

It does work very well with simple table based sites, and - of course - all the sites that is designed using XHTML and CSS.
WEB2RSS is released as a beta primarily because of worries about scalability. I am not sure how the server will react to heavy use if it. I need some real life usage to know more.
... and let me know what you think. General comments can be posted here, on this page. Use this page for bugs and feature requests.
Thomas Baekdal - Jul. 7, 2006
John, Thanks for stopping by.
1: Strange (about the comments). Could you tell me browser/OS you using? I cannot reproduce the problems :(
2: The URL, I admit, is a bit strange - but this is how it is supossed to look (at least for now). The problem is that to make a more "friendly URL" - like "web2rss.baekdal.com/this-or-that/" I need to save the original URL and all the settings in a database. Something that I want to do, but which increases the complexity of the application.
3: Bloglines... well, I should work (needs more testing to be sure).

4: Hard to tell. The statistics have just come in. About 400 people tried it so far, which isn't really that many. I cannot see if they have subscribed to a feed. I do not record the actual subscription.
There has been a number of problems. I tested it with about 100 different sites before it was launched. But, the amount badly coded sites are staggering.
There has been small problems like handling incorrect body tags (one site had 4 open/close body tags... and WEB2RSS got confused). To bigger problems like handling codepage errors (sites sending out UTF-8 content, but included non-UTF characters = crash).
I have added several extra filters since the launch - to remove garbage code or unwanted tag attributes.
5: Check your inbox in 10 minutes
Johan - Aug. 16, 2006
Hello,
just wanted to thank you for this great concept.
It works very nicely on the kmi.be wheather service page :p Although i had to do some fiddling around with the regular expressions part.
regards,
Johan
krammer - Sep. 1, 2006
Hi. Im so interested in that project. I want to know if you have developed it analysing manually the webs (you have introduced the tags u consider) or u have developed a neuronal network learning from a database for it.
Thomas Baekdal - Sep. 2, 2006
Krammer,
I am not sure I understand your first question.
As for the second one, then it does not use a database at the moment (it might do so in version 2). It simply parses the website into RSS "LIVE"
syed - Oct. 17, 2006
Hello,
Thank you for this excellent tool. Is there a way to adjust table width in the retrieved feed. For some reason, the rendered table in Sage for Firefox is far too small. I tried using the advanced settings in web2rss to modify it, but guess I was using the wrong regular expressions. The URL I've been trying to use is
http://www.cancer.gov/ncicancerbulletin/
Thanks again.
Syed.
Thomas Baekdal - Oct. 18, 2006
Syed, WEB2RSS strips size and style information - regardless of the settings.
This means that any table will displayed at its default size, defined by the program you view the RSS in.
PS: I tried to test you example, and it looks fine in my RSS readers (I do not have Sage).
John Rhodes - Nov. 13, 2006
Thomas,
It's been a few months now. How long until this puppy is out of beta? How long until I can license it! ;-)
Thomas Baekdal - Nov. 13, 2006
Hi John,
I must admit that I have not spent much time on this since it was launched (too many things, too many ideas).
I can very quickly release as final - the main reason for it being in beta was because of bandwidth issues. As it turned out - it was not a problem after all.
Do you have something specific in mind that you need it to do?
BTW: you can license it anytime :o))
James Mead - Feb. 1, 2007
This is a great service. However, the only item in the feed I created seems to always have a pubDate of 2006-01-01.
This means my reader doesn't realise the web page has changed.
For reference the url is http://www.baekdal.com/web2rss/rss.asp?url=http%3A//www.gner.co.uk&m=%3Cdiv%3EUp%20to%20and%20including%26nbsp%3B%3C/div%3E%3Cdiv%3E%28.*%29%3C/div%3E&ex=&img=0&out=rss
Thanks again for a great service.
James Mead - Feb. 2, 2007
Thanks. In case you hadn't noticed I'm using a regex in my url. Don't know if that makes a difference.
Thomas Baekdal - Feb. 8, 2007
James, I got a problem. I cannot determine the exact date for when a website has changed, and as such I cannot supply it in the RSS.
The problem is that the only I know to read the date of page is to look the modified date in the header stream. Unfortunately, most sites I have tested returns the time and date that it was retrieved - not the actual date.
I have done a few tests, and it seems to me that if I change the RSS feed to mimic the modified date provided by the header, the RSS would seem to have been updated all the time (even if none of the content has changed). This will of course make the problem worse.
It would seem that alt=hough the current solution isn't perfect. It is at least the best one.
Anonymous - Feb. 9, 2007
Hai
Nice to know new information through this effort
Dhanapal Andi - Feb. 9, 2007
Hi
Really this is very usefull for our project
James Mead - Feb. 9, 2007
Thanks for looking into it, Thomas. It's a shame websites don't use the modified date response header correctly.
Can you explain how the current method works?
Thomas Baekdal - Feb. 9, 2007
Hi James,
WEB2RSS is an amazingly simply application. All it does is this:
1: Fetch the link
2: Filters the content based on your preferences (optional)
3: Removes any styling and any non-RSS compatible content
Now we got the clean data for the RSS feed (in between all of this, there is a number of validation function to ensure that the content can be transformed to RSS)
4: Create a RSS page
5: Set the title= to be the same as the page's title=
6: Set the description to be the same as the meta tags description
7: Add the clean data to the content area of the RSS page
8: Add a fixed date (January 1, 2006 - 01:00:00 AM)
Now we got the final RSS page
9: Return a "Subscripe to..." link that people can subscripe to.
All of this happens every single time you or your feed reader requests a page.
The feed reader will do the following:
1: The first time it simply displays the RSS feed (like any other feed)
2: The second time it will either do nothing (if the feed hasn't changed), or display it as either updated or as a new feed item if some of the content has changed (this varies greatly from one feed reader to another). And, as you have found some feed reader do not do anything because the date (that is fixed) is the same.
To fix problem I would need to create a much more complicated system. I would need store every single change in a database, with the exact data that it was fetched. I would also need to compare each request with any previous ones in the database, to see if something new has been added (and then create a new database entry).
There are two problem with this:
1: The complexity is more than I have time for.
2: The data storage needs would be enourmous as I would need to store every single page that goes trough WEB2RSS. Today it is simply a live converter - WEB2RSS does not store or record any information. It fetches the page, converts it, and output it as RSS - then it forgets all about it until you (or your feed reader) makes another request.
francesco - May. 17, 2007
ciao!
i'm struggling with some bad-coded sites in order to retrieve some content.
i do agree with you that regex is not for faint hearted people: my heart works fine... my brain a bit less!! :-)
may you put a couple of examples -just to watch out at the syntax ?
grazie!
Daniel Aleksandersen - May. 17, 2007
Too bad it does not work with Michal Shanks' blog over at TVguide.com. :(
Thomas Baekdal - May. 18, 2007
Daniel, I just tried it. I have no poblems with it.
BTW: I am curious though. Why WEB2RSS was created to turn sites without RSS support into an RSS feed. Michael's blog already have an RSS feed:
feed://community.tvguide.com/rssthreads.jspa?forumID=800048552
Francesco. RegEx examples are very site specific. Could you give me a site that you have problems with?
David - Jul. 27, 2007
(rss generating newbie alert - using google reader)
Seems to work ok for blogger.com sites but I can't figure out the required advanced settings incantation for the following discussion forum:
http://www.maxx.co.nz/forum/index.cfm/fuseaction/listings/CFB/1/forum/2
As you can see this df is a table of threads with no preview information. The content of the most recent item is only visible once the link is clicked and the thread is displayed.
Ana - Aug. 24, 2007
This is very similar to Feed43.
Except you don't need to add extra tags to specify what's date, title= and content body....
Abhijit Shylanath - Nov. 13, 2007
Hi, Thomas. Great app. I'd been thinking of making one like this, but you beat me to it. :)
Regarding modified-date, will it be too much of a problem to do a quick checksum of the contents/body of each page w2r reads, and store it along with the complete URL (normalized), and UNIX timestamp? Next time it reads the page, if the checksum doesn't match, it can return a time ahead of the previous timestamp, and update the database accordingly.
Abhijit Shylanath - Nov. 13, 2007
Sorry, my e-mail ID was incorrect in the previous comment. Feel free to delete this one.
Thomas Baekdal - Nov. 13, 2007
Abhijit,
Yes, that would be a problem because there is no backend database behind WEB2RSS and thus nowhere to store a checksum.
WEB2RSS is simply speaking a converter. It will convert almost any page into an RSS format, and then forget that it ever happened.
This was done because the database needed to store any previous states would be very big. WEB2RSS currently converts 17 pages every single minute (in average).
Enkla Z - Nov. 20, 2007
Hi Thomas
I would like to convert this to rss:
http://knuff.se/url/enklabloggen.blogspot.com
so that i can make an rss-box/javascript from it ("Who links to me?") ,but i don't seem to manage.....
Claes - Apr. 3, 2008
Hello Thomas,
I just stumbled upon your excellent service, which would do so much to make my job easier. However it did not work on the sites I need... What I'm doing is that I daily need to perform searches on specific daily journals for certain keywords ("asyl*", "flyktning*") to see whether there has been any new news on those subjects. Unfortunately, I couldn't get your program to work on the sites I need the most. The journals in question are http://www.vg.no, http://www.aftonposten.no, http://www.nrk.no, http://www.dn.se, http://www.svd.se, http://www.svt.se, and some danish journals.
Do you think it might be possible to 'fix' this issue? It would save me, and many others out there I guess, a lot of time!
Best,
Claes
Published: Jul. 4, 2006
in Products

Thomas Baekdal is a Writer, Interaction Designer, Change Advocate and Project Manager.
John S. Rhodes - Jul. 7, 2006
Thomas,
1. Your comments section is acting kind of fruity. The labels next to the fields flash then disappear. I've closed and open my browser several times, refreshed the page several times, and more. Thought you'd want to know.
2. I created an RSS feed for the WEB2RSS page itself. I figured, if the page changes, I'd like to know about it. Does the URL below actually make sense and should this work this way?
http://www.baekdal.com/web2rss/rss.asp?url=http%3A//www.baekdal.com/web2rss/&m=&ex=&img=0&out=rss
3. You'll be happy to know that this functionality seems to work in one of my favorite RSS tools (Bloglines). Very curious to see if this works; love the idea!
4. How has the service worked so far? Success? Issues? Next steps?
5. If you're interested in a quick interview, definitely throw me an email. I'd like to dig into this a little bit.