Trend Edition

Introducing: WEB2RSS

RSS feeds are one the things that I really like. It allows me to get the information I want, the way I want it. There is just one problem. Not every site offers an RSS feed, and those that do are not offering the right content.

This is something I wanted to do something about.

Say hello to WEB2RSS

WEB2RSS is a simple web application that can convert almost any normal website/page into an RSS feed. If you want to receive updates in your feed reader when your favorite site changes, this is the tool for you.

Technical speaking WEB2RSS request the page you want, strips it from all non-text elements and output the rest into a RSS file. What you get is the content of any page, without form elements, script blocks, iframes etc.

Some examples: If one your friend's website does not offer an RSS feed, now it does. If you are really into the art of "Camera tossing" you can now search for it on Flickr and subscribe to the search results. ...or what about being kept up-to-date with the forum you visit frequently, or maybe just a specific topic. What about getting the latest updates to a seminar you are interested in - like the HCI class on Stanford University. Maybe, you just want to get the daily comic strip from your favorite cartoonist, or the latest news from your favorite hokey team.

Or, what about getting Google search results as a RSS feed, or when Google publishes something new on Google Labs. ...or maybe you just want to know when you competitors publishes new content.

You can do all of these things with WEB2RSS.

How to use it

Using WEB2RSS is pretty simple. Type in the URL you want to use, check the link and... Boom (as Steve Jobs would say) you get RSS link you can subscribe to.

There are a number of more advanced features that you can use. The easiest one is "Include Images". By default, all images are removed from the feed to prevent layout images to clutter up the results. A common problem on table based sites. But, you can force it to include images if you want to do that.

The two other advanced settings are much harder to deal with. They are "Match" and "Exclude". With these you can either match a specific part of a page. If you want to get only a specific image, DIV or table, this is what you use. You can do the same with "Exclude", except it will remove the element you define instead of keeping it.

The hard part of these two settings is that they function using something called "Regular Expressions". If you do not know what this is, my best advice is to not use them. Regular Expression is a very complex beast, but it makes the "WEB2RSS" very flexible and powerful.

Are there any limitations?

Yes, of course. There is a number of things that can prevent a successful conversion. Like:

  1. Sites made entirely using Flash cannot be converted, for the simple reason that the underlying code does not contain any text. No text, nothing to convert.
  2. Sites with a lot of DHTML and AJAX etc. does not work either. In this case the content is often not a part of the main page.
  3. There is a bug in XMLHTTPRequest that prevents from handling redirects. So if you trying to convert a page that is redirected to another page it fails. At the moment this a problem out if my control.
  4. The last major problem are sites that is so badly coded that WEB2RSS cannot make heads or tails of it.

It does work very well with simple table based sites, and - of course - all the sites that is designed using XHTML and CSS.

Why Beta?

WEB2RSS is released as a beta primarily because of worries about scalability. I am not sure how the server will react to heavy use if it. I need some real life usage to know more.

Try it

... and let me know what you think. General comments can be posted here, on this page. Use this page for bugs and feature requests.

Comments

1

John S. Rhodes - Jul. 7, 2006

Thomas,

1. Your comments section is acting kind of fruity. The labels next to the fields flash then disappear. I've closed and open my browser several times, refreshed the page several times, and more. Thought you'd want to know.

2. I created an RSS feed for the WEB2RSS page itself. I figured, if the page changes, I'd like to know about it. Does the URL below actually make sense and should this work this way?

http://www.baekdal.com/web2rss/rss.asp?url=http%3A//www.baekdal.com/web2rss/&m=&ex=&img=0&out=rss

3. You'll be happy to know that this functionality seems to work in one of my favorite RSS tools (Bloglines). Very curious to see if this works; love the idea!

4. How has the service worked so far? Success? Issues? Next steps?

5. If you're interested in a quick interview, definitely throw me an email. I'd like to dig into this a little bit.

2

Thomas Baekdal - Jul. 7, 2006

John, Thanks for stopping by.

1: Strange (about the comments). Could you tell me browser/OS you using? I cannot reproduce the problems :(

2: The URL, I admit, is a bit strange - but this is how it is supossed to look (at least for now). The problem is that to make a more "friendly URL" - like "web2rss.baekdal.com/this-or-that/" I need to save the original URL and all the settings in a database. Something that I want to do, but which increases the complexity of the application.

3: Bloglines... well, I should work (needs more testing to be sure).

4: Hard to tell. The statistics have just come in. About 400 people tried it so far, which isn't really that many. I cannot see if they have subscribed to a feed. I do not record the actual subscription.

There has been a number of problems. I tested it with about 100 different sites before it was launched. But, the amount badly coded sites are staggering.

There has been small problems like handling incorrect body tags (one site had 4 open/close body tags... and WEB2RSS got confused). To bigger problems like handling codepage errors (sites sending out UTF-8 content, but included non-UTF characters = crash).

I have added several extra filters since the launch - to remove garbage code or unwanted tag attributes.

5: Check your inbox in 10 minutes

3

Johan - Aug. 16, 2006

Hello,

just wanted to thank you for this great concept.

It works very nicely on the kmi.be wheather service page :p Although i had to do some fiddling around with the regular expressions part.

regards,

Johan

4

krammer - Sep. 1, 2006

Hi. Im so interested in that project. I want to know if you have developed it analysing manually the webs (you have introduced the tags u consider) or u have developed a neuronal network learning from a database for it.

5

Thomas Baekdal - Sep. 2, 2006

Krammer,

I am not sure I understand your first question.

As for the second one, then it does not use a database at the moment (it might do so in version 2). It simply parses the website into RSS "LIVE"

6

syed - Oct. 17, 2006

Hello,

Thank you for this excellent tool. Is there a way to adjust table width in the retrieved feed. For some reason, the rendered table in Sage for Firefox is far too small. I tried using the advanced settings in web2rss to modify it, but guess I was using the wrong regular expressions. The URL I've been trying to use is

http://www.cancer.gov/ncicancerbulletin/

Thanks again.

Syed.

7

Thomas Baekdal - Oct. 18, 2006

Syed, WEB2RSS strips size and style information - regardless of the settings.

This means that any table will displayed at its default size, defined by the program you view the RSS in.

PS: I tried to test you example, and it looks fine in my RSS readers (I do not have Sage).

8

John Rhodes - Nov. 13, 2006

Thomas,

It's been a few months now. How long until this puppy is out of beta? How long until I can license it! ;-)

9

Thomas Baekdal - Nov. 13, 2006

Hi John,

I must admit that I have not spent much time on this since it was launched (too many things, too many ideas).

I can very quickly release as final - the main reason for it being in beta was because of bandwidth issues. As it turned out - it was not a problem after all.

Do you have something specific in mind that you need it to do?

BTW: you can license it anytime :o))

10

Thomas Baekdal - Nov. 13, 2006

It is out of Beta :o)

11

James Mead - Feb. 1, 2007

This is a great service. However, the only item in the feed I created seems to always have a pubDate of 2006-01-01.

01 Jan 2006 00:00:00 GMT

This means my reader doesn't realise the web page has changed.

For reference the url is http://www.baekdal.com/web2rss/rss.asp?url=http%3A//www.gner.co.uk&m=%3Cdiv%3EUp%20to%20and%20including%26nbsp%3B%3C/div%3E%3Cdiv%3E%28.*%29%3C/div%3E&ex=&img=0&out=rss

Thanks again for a great service.

12

Thomas Baekdal - Feb. 1, 2007

Hi James,

I will look into it.

13

James Mead - Feb. 2, 2007

Thanks. In case you hadn't noticed I'm using a regex in my url. Don't know if that makes a difference.

14

Thomas Baekdal - Feb. 8, 2007

James, I got a problem. I cannot determine the exact date for when a website has changed, and as such I cannot supply it in the RSS.

The problem is that the only I know to read the date of page is to look the modified date in the header stream. Unfortunately, most sites I have tested returns the time and date that it was retrieved - not the actual date.

I have done a few tests, and it seems to me that if I change the RSS feed to mimic the modified date provided by the header, the RSS would seem to have been updated all the time (even if none of the content has changed). This will of course make the problem worse.

It would seem that alt=hough the current solution isn't perfect. It is at least the best one.

15

Anonymous - Feb. 9, 2007

Hai

Nice to know new information through this effort

16

Dhanapal Andi - Feb. 9, 2007

Hi

Really this is very usefull for our project

17

James Mead - Feb. 9, 2007

Thanks for looking into it, Thomas. It's a shame websites don't use the modified date response header correctly.

Can you explain how the current method works?

18

Thomas Baekdal - Feb. 9, 2007

Hi James,

WEB2RSS is an amazingly simply application. All it does is this:

1: Fetch the link

2: Filters the content based on your preferences (optional)

3: Removes any styling and any non-RSS compatible content

Now we got the clean data for the RSS feed (in between all of this, there is a number of validation function to ensure that the content can be transformed to RSS)

4: Create a RSS page

5: Set the title= to be the same as the page's title=

6: Set the description to be the same as the meta tags description

7: Add the clean data to the content area of the RSS page

8: Add a fixed date (January 1, 2006 - 01:00:00 AM)

Now we got the final RSS page

9: Return a "Subscripe to..." link that people can subscripe to.

All of this happens every single time you or your feed reader requests a page.

The feed reader will do the following:

1: The first time it simply displays the RSS feed (like any other feed)

2: The second time it will either do nothing (if the feed hasn't changed), or display it as either updated or as a new feed item if some of the content has changed (this varies greatly from one feed reader to another). And, as you have found some feed reader do not do anything because the date (that is fixed) is the same.

To fix problem I would need to create a much more complicated system. I would need store every single change in a database, with the exact data that it was fetched. I would also need to compare each request with any previous ones in the database, to see if something new has been added (and then create a new database entry).

There are two problem with this:

1: The complexity is more than I have time for.

2: The data storage needs would be enourmous as I would need to store every single page that goes trough WEB2RSS. Today it is simply a live converter - WEB2RSS does not store or record any information. It fetches the page, converts it, and output it as RSS - then it forgets all about it until you (or your feed reader) makes another request.

19

francesco - May. 17, 2007

ciao!

i'm struggling with some bad-coded sites in order to retrieve some content.

i do agree with you that regex is not for faint hearted people: my heart works fine... my brain a bit less!! :-)

may you put a couple of examples -just to watch out at the syntax ?

grazie!

20

Daniel Aleksandersen - May. 17, 2007

Too bad it does not work with Michal Shanks' blog over at TVguide.com. :(

21

Thomas Baekdal - May. 18, 2007

Daniel, I just tried it. I have no poblems with it.

BTW: I am curious though. Why WEB2RSS was created to turn sites without RSS support into an RSS feed. Michael's blog already have an RSS feed:

feed://community.tvguide.com/rssthreads.jspa?forumID=800048552

Francesco. RegEx examples are very site specific. Could you give me a site that you have problems with?

22

David - Jul. 27, 2007

(rss generating newbie alert - using google reader)

Seems to work ok for blogger.com sites but I can't figure out the required advanced settings incantation for the following discussion forum:

http://www.maxx.co.nz/forum/index.cfm/fuseaction/listings/CFB/1/forum/2

As you can see this df is a table of threads with no preview information. The content of the most recent item is only visible once the link is clicked and the thread is displayed.

23

Ana - Aug. 24, 2007

This is very similar to Feed43.

http://feed43.com/

Except you don't need to add extra tags to specify what's date, title= and content body....

24

Abhijit Shylanath - Nov. 13, 2007

Hi, Thomas. Great app. I'd been thinking of making one like this, but you beat me to it. :)

Regarding modified-date, will it be too much of a problem to do a quick checksum of the contents/body of each page w2r reads, and store it along with the complete URL (normalized), and UNIX timestamp? Next time it reads the page, if the checksum doesn't match, it can return a time ahead of the previous timestamp, and update the database accordingly.

25

Abhijit Shylanath - Nov. 13, 2007

Sorry, my e-mail ID was incorrect in the previous comment. Feel free to delete this one.

26

Thomas Baekdal - Nov. 13, 2007

Abhijit,

Yes, that would be a problem because there is no backend database behind WEB2RSS and thus nowhere to store a checksum.

WEB2RSS is simply speaking a converter. It will convert almost any page into an RSS format, and then forget that it ever happened.

This was done because the database needed to store any previous states would be very big. WEB2RSS currently converts 17 pages every single minute (in average).

27

Enkla Z - Nov. 20, 2007

Hi Thomas

I would like to convert this to rss:

http://knuff.se/url/enklabloggen.blogspot.com

so that i can make an rss-box/javascript from it ("Who links to me?") ,but i don't seem to manage.....

28

Claes - Apr. 3, 2008

Hello Thomas,

I just stumbled upon your excellent service, which would do so much to make my job easier. However it did not work on the sites I need... What I'm doing is that I daily need to perform searches on specific daily journals for certain keywords ("asyl*", "flyktning*") to see whether there has been any new news on those subjects. Unfortunately, I couldn't get your program to work on the sites I need the most. The journals in question are http://www.vg.no, http://www.aftonposten.no, http://www.nrk.no, http://www.dn.se, http://www.svd.se, http://www.svt.se, and some danish journals.

Do you think it might be possible to 'fix' this issue? It would save me, and many others out there I guess, a lot of time!

Best,

Claes

 

Published: Jul. 4, 2006
in Products

Subscribe / Select »

Thomas Baekdal

Thomas Baekdal is a Writer, Interaction Designer, Change Advocate and Project Manager.

» About Baekdal
» Contact Information