MetaChat REGISTER   ||   LOGIN   ||   IMAGES ARE OFF   ||   RECENT COMMENTS




artphoto by splunge
artphoto by TheophileEscargot
artphoto by Kronos_to_Earth
artphoto by ethylene

Home

About

Search

Archives

Mecha Wiki

Metachat Eye

Emcee

IRC Channels

IRC FAQ


 RSS


Comment Feed:

RSS

17 May 2006

OMG SERIOUS QUESTION! Statistics help needed.[More:]
I am writing a Java utility to help us assess the results of a test we are performing and need help figuring out the best way to do so. I'll try to simplify this as much as I can.

We are evaluating several enterprise search/tagging engines. The engines are supposed to be able to process documents and, given a specific "vocabulary" (or taxonomy), tag the documents based on their content. We have subject matter experts going through a number of the documents and hand-tagging them. We have to compare the results of the engines to the manual results of the SMEs.

This would be easy (and, in fact, already done) if it was as simple as making sure the tags assigned by the engines matched the tags assigned by the SMEs. But they are prioritizing the tags and they want to also assess how accurate the search engines' relevance measures were.

I'll give an example. A document has, say, ten tags assigned to it, in a particular releveance order. I have to take the set of tags (which can have any size) returned by the search engine and give some measure (a percentage?) of the accuracy. So I need to compare two ordered sets of arbitrary size. And I don't even know where to begin.

The SME may have assigned tags {1, 2, 3, 4, 5} to the document. So, um, if the search engine returns {1, 2, 3, 4, 5}, that's 100%, I guess. But what about {1, 2, 4, 3, 5}? Is that more or less accurate than {1, 2, 5, 3, 4}? I guess less, but how do I come up with a number to represent that? How about {1, 2, 3, 4, 5, 6} or {1, 2, 3, 4} or {2, 1, 3, 4, 5}? You see what I'm getting at here.

I imagine there is some statistical algorithm for this, and probably even a statistics Java package already out there that I can use.

We are evidently ignoring the fact that the search engines are actually likely to do a BETTER job at tagging the data than the SMEs will. The whole test is kind of backasswards if you ask me, but I gotta do what they ask.
posted by mike9322 17 May | 17:08
A weighted geometric distance algorithm?
posted by flopsy 17 May | 18:15
Factor analysis. Check stats packages. Gives weights for closest polynomial match.

Alternately, the normalized geometric distance -- adjusted so the standard deviation is one -- would compensate for differing numbers of tags.

1) find the maximum number n of tags.
2) code tags as vectors in n-space. if there are less than n tags, set the remaining dimensions to zero.
3) find the pairwise distance between the machine vector and the SME vector as the square root of the sum of the squares of the differences between each dimension's value.
4) find the StDev of the distances.
5) normalize all the distances by the inverse of the StDev.

How cum I don't have your job? Want mine?

This is what I get for having insomnia at 2:38 in the morning...
posted by warbaby 18 May | 04:40
On further consideration, I hope you have more than one SME rating every article. Otherwise, how are you going to going to know how the SME's rank compared to yer mythical hypothetical standard. I'd think that you'd want your machine to rank median or better compared to the SME's. To be fair, you should throw in some mediocre minds, like say your company's middle-manangement, as a baseline.
posted by warbaby 18 May | 04:45
Thanks, warbaby. I'm not sure I completely understand, but I have only been up for 10 minutes so I'll give it some time.

And, you are exactly right on the SME thing. That's what I was saying at the end of my first comment - it's an absolutely stupid thing to be testing. SMEs are subjective and prone to moods and on any given day may tag a certain document in different ways. So comparing what the software came up with (objectively) to what the SMEs came up with (subjectively) is RETARDED. Ahem. But I'm just the programmer. And this is government work at its most typicalfinest.
posted by mike9322 18 May | 06:29
Too bad about not having any control over the experimental design. All the stats in the world can't fix a poorly designed experiment with inadequate controls. This is the story of the soft sciences.

You typically can't get away with it in the hard sciences -- although the occasional N-Ray, polywater and cold fusion does slip through for a while.

Get far enough away from the numbers and things like Intelligent Design become possible.

There is a way of using a group of experts to refine their estimates and provide a confidence level of the probable accuracy called the Delphi Method. (that's really two links though it looks like one. The second link is the classic.)

posted by warbaby 18 May | 09:47
The Delphi Method (using feedback, consensus and iterative polling) is why democracies work and totalitarian systems implode.

It's also why collectivist anarchism (not liberal democracy as Frank Fukuyama claims) occupies the "destination" (misleadingly called the "end" by Frank) of history.

Everybody really does know everything.

Paradoxically, the things "everybody" knows (without reflection) are usually false.
posted by warbaby 18 May | 09:55
Awesome || If you're of a nervous disposition, look away now ...

HOME  ||   REGISTER  ||   LOGIN