Tuesday, April 28, 2009

The FDA has an Area 51 on its web-site??

Here's something you won't see every day...it's the FDA's robots.txt file.

For the uninitiated, robots.txt is a small file placed on a web-site to indicate which pages on your website can be crawled by indexing robots (Googlebot, Yahoo! Slurp, etc.). It basically says "Hey Googlebot, you can index these pages, but stay away from those over there."

Here's the FDA's robots.txt file it is in its entirety--I've colored the parts that intrigue me.
#robots.txt file for http://www.fda.gov

#Added for Bristol-Myers on Sept 2005
User-agent: vspider Disallow: /

#For all other crawlers
User-agent: *
Disallow: /scripts/
Disallow: /data/
Disallow: /binn/
Disallow: /cder/test/
Disallow: /opacom/area51/
Disallow: /oashi/aids/listserv/
Disallow: /cdrh/ftparea/cdrh/MDR/coll/mdr/mdrcoll/
Disallow: /foi/warning_letters/d1371b.pdf
Disallow: /foi/warning_letters/archive/
Hit-rate: 30 # wait 30 seconds before starting a new URL request default=30
Visiting-hours: 23:00EDT-05:00EDT #index this site between 11PM - 5AM EDT
Concurrent-hits: 2 # limit concurrent active URLS to 2 for each index server
1. What's the deal with Bristol Myers' request to ban vspider? And why did the FDA comply with the request? From what I can tell, vspider is a personal indexing robot that can be used by anyone to index a site. Curious in CT.

2. What's going on in area51 and why can't it be indexed? I tried to look at the contents and got a "denied" error...so perhaps it holds the medical records for the little green men in Nevada.

3. Why block indexing of one specific warning letter (d1371b.pdf)? If you try to go to fda.gov/foi/warning_letters/d1371b.pdf you get a 404 (not found) error, but I have a copy from my own search engine. It's a pretty vanilla warning letter from 1998 sent to Trinity Chemical Corporation. Again, I'd love to hear the rationale behind this decision.

4. Why block indexing of the archived warning letters?