Search: Changes, Talks and Flames

Warning: extremely long post (2,838) words!

Well, a lot has been happening in the “World of the Open Directory Project” (a.k.a. DMoz) in the last couple of weeks.

First of all, because of server load issues the internal editor forums have been moved to a new server (yippee!) that authenticate with the main server to ensure only valid users can log in. Good idea, but it’s had a few teething troubles (boo!).

Secondly, to help reduce the load on the main part of the ODP, editors have been given a “special” port number on which to connect to edit (hopefully reducing some of the overloading issues on the Apache webserver) – all good. Except if you are behind a corporate firewall and they block that port number 🙁

Thirdly, the “mirror server” at http://ch.dmoz.org/ (which is hosted by a fellow editor in Zurich, Switzerland) seems to have “taken off” a bit and is being used by a larger number of people now (mainly as it’s a lot faster) – it’s transferring around 19Gb of data a month.

Fourthly, the ODP staff members have managed to produce “a” copy of the RDF dump. The RDF dump is, in fact, a big big big file which contains the URLs, titles and descriptions of all the (nearly) 4million sites listed in the ODP. Due to a large number of technical issues, this dump hasn’t been correctly produced since September last year. The RDF dump is usually downloaded by organisations such as ‘Google’ to produces localised copies of the ODP (for instance the “PR enhanced” listings at the Google Directory). ODP staff have (this week in fact) managed to produce an RDF dump which is available via rdf.dmoz.org: there’s only a slight problem. It doesn’t contain “catid”s (unique category identifier numbers) – this is because these numbers got “clobbered” during the technical problems and so ODP staff are having to manually correct these database problems. Hopefully they’ll be fixed soon – but at least the ODP search has now been updated (since that uses the RDF dump) and there is an RDF dump for others to download and play with (which I’m intending to do this weekend).

Fifthly, one of the founding members of the ODP (Rich Skrenta) has recently given a talk about the early years of the ODP to “The Internet Developer Group”. The talk is available for viewing in a series of large JPEG files in an online presentation – but for speed of access, I’ve transcribed them here:

Genesis Of The Open Directory Project
Rich Skrenta

January 21, 2003
March 1998
- Work project was winding down
- Going up and down Sand Hill road trying to get a web-calendar startup funded
- Read Danny Sullivan’s report on Yahoo’s listing problems on Search Engine Watch
(image of Danny Sullivan’s Search Engine Watch’s “Yahoo Special Report” from http://www.searchenginewatch.com/sereport/97/09-yahoo.html)
(image of Wired News’s “Does Yahoo Still Yahoo?” article from http://www.wired.com/news/print/0,1294,10236,00.html)
Idea for GnuHoo
- Yahoo seemed to be ignoring their core asset – the directory
- How could we build a competitor?
- Didn’t want to pay an editorial staff – even a cheap one
- Tequila + Brainstorming = GnuHoo
Idea for GnuHoo
- Use volunteer editors to build a web directory like Yahoo’s
- Volunteers would do a better job than paid generalists, since they would be experts about their area & have a personal interest
- Restrict editors to sub-branches of the directory, to limit the harm they could do
Original Goals
- Thought if we could reach 1,000 editors the directory would be successful
- Bootstrap problem was key – how to get the first 10,000 sites. The directory had to look “real” from Day 1
- Figured we needed 1M sites for a competitive directory
- Original get-off-the-coach motivational goal: We told ourselves that if we could get a story in Wired out of the effort, it would be worth doing
“Seed” Problem
- Needed a hierarchy & 10,000 sites to launch the directory
- Briefly considered Dewey Decimal
  - good thing we didn’t, it’s not free
  - didn’t seem to fit the web
- Original GnuHoo hierarchy mirrored Usenet
(shows how various USENET groups mapped to the relevant GnuHoo categories)
(image of the “Original Homepage Mock-Up”)
Category Bootstrapping
- Scanned URLs mentioned in newsgroups to find seed sites for the corresponding directory category
- This yielded something that looked pretty good at a casual glance
- …but a lot of the original see URLs were bad sites or placed in the wrong category
- The first editor in a category simply had to delete or move the bad entries, which left behind a good category
Coding & Launch
- Coded from April-June, 1998
- Perl cgi and flat files
- Simple HTML forms to add/edit/delete websites in the directory
- Web pages served from static HTML files in a directory tree
- HTML files regenerated whenever an edit was made
Simple Flat File Format
u: http://www.newhoo.com/
t: NewHoo!
d: The largest human-edited directory of the web
c: Computers/Internet/Web_Directories
Minimalist Design
- Minimal locking, last-writer-wins semantics
  - flock() only used for category counts
- Write-with-append, rename() only safe operations
- No big database
- A few DBM files for minor stuff
Coding & Launch
- Used publicly-available software for keyword search of the directory; Originally Glimpse, later Isearch
- First ran on BSDI, later moved to Linux
  - filesystem progression: ufs, ext2, vxfs
- Launched June 5, 1998
- Acquired by Netscape in October, 1998
(image of the original NewHoo homepage)
(image of the Wired News “The Distributed Yahoo: ‘NewHoo'” news article from http://www.wired.com/news/print/0,1294,13625,00.html)
Early Press was Key to Growth
- About 1% of the visitors to NewHoo applied to become editors
- Some fraction of those would be accepted
- The more traffic we got, the more editors we would get
- We grubbed around for any hits we could in the beginning
- Initial Slashdot, Netly, Wired, Red Herring stories were vital traffic sources
- No matter what the story said, “Just spell our URL right”
(image of the “About the Open Directory Project” page from http://ch.dmoz.org/about.html)
Social Design of NewHoo
- Not a free-for-all links page – every editor had to apply & be approved
- Every edit logged and possible to undo
- Hierarchy of editors, with senior ones keeping an eye on the new ones
- Emergent editing guidelines, enforced with peer review
Why Did You Apply to be a NewHoo Editor?
“There is a link to my old warwick uni account that has been dead for two years. As editor I could change it.”
Why Did You Apply to be a NewHoo Editor?
“I’m already building Linux indexes and sites, better to have them all nicely integrated in computers/software/linux”
Why Did You Apply to be a NewHoo Editor?
“We already maintain a site called CoinLink which lists over 800 coin related sites. We know the coin industry and could easily assist in building and maintaining this section of the index.”
Why Did You Apply to be a NewHoo Editor?
“You have no category in Recreation/Collecting that focuses on Christmas ornament collecting. Ornament collecting is one of the fastest growing hobbies. I’ve collected ornaments for 25 years and feel I know many of the “best” web sites dealing with this subject.”
Motivations to Edit
- Same urge that makes you straighten a crooked picture you see on the wall
- People were maintaining link lists on their own manually; they could do so more easily with NewHoo’s web forms
- Didn’t need to see the whole directory finished to have their category be useful
- …but knowing they were helping to build the pyramid was a warm fuzzy
Directory Editing is Amenable to Incremental Effort
- First editor finds a good site and adds it
- Second fixes a typo in the description
- Third editor moves it to a more appropriate category
- Fourth editor later notices the site moved and fixes the URL
- Not as hard as writing device drivers; many can help
- If you ask too much, results fall off quickly
The Free Use License
- Netscape offered the data from the ODP under a free-use license
- Directory data was adopted by Lycos, AltaVista, Google and other search engines
- Only requirement was that the Add URL link point back to dmoz.org
  - helped keep dmoz authoritative & prevent forks
GnuHoo -> NewHoo -> ODP
- FSF objected to the “Gnu”
- Yahoo objected to the “Hoo”
- Netscape renamed it to the Open Directory Project and hosted it on directory.mozilla.org
- directory.mozilla.org was too long to type, so we shortened it to dmoz.org
Robozilla
- Lloyd Tabb wrote a crawler to visit every site in the ODP to see if it was 404/301/302
- Didn’t take action on its own, but alerted editors to potentially bad or moved sites
- Brought bad sites in the ODP down to 0.25%
- Our crawl of Yahoo showed 8% bad links
“That’s a Problem We Want to Have”
- Design decisions were made in the interest of expediency. Why invest more time in the infrastructure if the site never takes off?
- Still running much of the 1.0 code today, over 4 years later
- Zillions of flat files in a gigantic VXFS filesystem
- Were we wrong? No, I don’t think so
The ODP Won
- 55,000 total editors, probably 10,000 active
- 3.4M sites, 460K categories
- Largest human-created taxonomy ever
- Several times larger than competitors
- Cited in 83 academic research papers
  (source: citeseer.nj.nec.com)
The ODP “Won”
…but directories no longer scale to the web for users:
- small web: use a directory
- big web: use keywords
Everyone uses Google :—)
“Lost Ark” Ending?
- The traffic & validation provided by Netscape was key to the ODP’s success
- Possible future: lost server in an ops farm
- What new idea can take the ODP to the next level?

And, to round all this super long post of – I’ve been flamed! Yep, the following was posted over at Resource-Zone by user “odpobserver” at 30/01/03 08:16 PM:

>>we do not care if Google lists your site because we do or decided to exclude it. All we care about is the content on http://ch.dmoz.org/ .

Beebware,

Let me get this straight. You are an SEO/”unpaid ODP editor” and you dont care that Google uses the ODP. I would suggest that you do care, because that is how you would be successful in fulfilling your services of SEO consultant. Getting your clients in prime ODP categories would greatly enhance your performance in Google!

Just what services do you provide? When a client is unable to achieve ODP listing, do you negotiate behind closed doors with other “unpaid” ODP editors to obtain ODP inclusion? Do you consult the internal ODP forums unavailable to the submitter to obtain information that you provide the submitter for a fee?

Beebware this is a good deal for you, isn’t it. Unpaid volunteer, right!

Your credibility is in severe question when you have such a blatant conflict of interest!

It is not only important to avoid a conflict of interest, it is important to avoid the appearance of a conflict of interest. Like so many “unpaid” ODP editors/SEO consultants, you have avoided neither!

My response to this was to basically re-iterate what I said in an earlier blog entry, but in a terse and “pointed” manner (mainly to ensure it was understood): (posted at 30/01/03 08:57 PM)

First of all, I’ve been an ODP editor for over 3 years but an SEO for less than a month.

Secondly, my editor logs are open to all editors to view if there are ANY allegations of abuse against me.

Thirdly, I have declared (again viewable by all editors) any and ALL sites I have ANY connections with (past and present employers, sites I’ve designed, sites I’ve promoted etc etc) to be open about these purposes.

Fourthly, the company that I am employed as does NOT guarantee listings in ODP (and, as I made clear at the interview) I will not compromise my editor position by placing our clients in the ODP. I actually thought I wouldn’t get the job because I wouldn’t compromise my position as an editor – but my new boss was totally understanding and hasn’t once even hinted at something like that. He did have an enquiry today about why a certain site isn’t listed in the ODP and I told him that it was likely because it was extremely similar to another site also actually owned by us. I have not and WILL NOT compromise my editor position. If you have ANY evidence at all that I have, please feel free to report it to a meta editor and they will remove me from the ODP.

Fifthly, a good SEO company can get high rankings in Google WITHOUT an ODP listing. Yes, most people think it is “essential” to be listed in the ODP to get high “PR” value – but the crux of the matter is, there’s a lot more to it then that. My employer owns and operates around half a dozen sites – only one of those is listed on the ODP (I have not edited _any_ of them except to add a note to the sites to indicate that I am affiliated/connected with them) – but all of the sites appear on the first page of results on Google for the targeted key phrases.

Yes, you could argue there _could be_ a conflict of interest: but only if I there be. I have, in fact, actually REMOVED one of our clients sites from the ODP (as it was a doorway site and should never have been listed in the first place) – I aim (as all editors should do) to treat all sites equally: if you feel that I haven’t, then (again) please report it to a meta editor and they will remove my editing rights in the ODP.

Next time you start throwing allegations about, please ensure that you have at least a minimal amount of proof to back them up…

However, the moderators of the forum decided that the post was not appropriate to the forum and so they deleted odpobserver’s post and my reply (all in line with the forum guidelines where it states “Complaints about specific people working or volunteering their time at ODP.” / “Discussion of the ways in which ODP runs itself.” / “Discussion of how to use the ODP to optimize search engine rankings and site promotion. This ODP is not a search engine, and we don’t rank or optimize web sites.”)

Related

One Comment

Leave a Reply Cancel reply