MetaChat - Ask Mecha:

MetaChat is an informal place for MeFites to touch base and post, discuss and chatter about topics that may not belong on MetaFilter. Questions? Check the FAQ. Please note: This is important.

27 December 2005

Ask Mecha: I need a free medium-sized (c. 5 million words) corpus of representative American English. I do not need parts of speech tagged, just the text itself. [mi]

post by: orthogonality at: 03:13 | 21 comments

(just hit the "more" button when you're writing your post, and put the more inside text after the [:more:] thing that appears)

posted by agropyron 27 December | 03:18

I'm questioning whether Gutenberg texts (all of which predate 1923) will be sufficiently representative. Ideally, the corpus would have been typed using a standard modern American keyboard.

Also, I'd like corpora of C, C++, and Java code. I'm guessing the linux core source will do for the C -- can you suggest large samples of C++ and Java?

Thanks.

posted by orthogonality 27 December | 03:18

Which era, and what quality, of American English?

If it doesn't matter, I'd hit up Project Gutenburg for Mark Twain/Sam Clemens.

Otherwise, newspaper archives. Might take effort to aggregate it all. You might be able to call up your local paper - if they have an online version, they may have an aggregate of their print.

posted by porpoise 27 December | 03:20

What are your resources in the way of web scrapers, or scripting abilities? I'd suggest doing a Google blog search for some common word and slurping up all the resulting blog entries.

posted by agropyron 27 December | 03:21

Whoops - didn't look at preview. Realy - it depends on what you need the data for and what you're trying to do with it as to whether something might be valid or not.

Ask mathowie for a text dump from metafilter?

posted by porpoise 27 December | 03:22

Journalistic writing is somewhat stilted compared to generic "American English". Of course, blog entries will be representative of horribly written American English. It all depends on what you're looking for.

posted by agropyron 27 December | 03:23

It should be pretty simple to write a perl script that tries every MeFi thread number up to today's posts, and stores the text of the thread.

posted by agropyron 27 December | 03:25

No HTML, please. Yes, I can spider, and yes, I can remove the goddam tags, but really, isn't there just a nice fat .txt file somewhere?

posted by orthogonality 27 December | 03:59

Why?

posted by Cryptical Envelopment 27 December | 04:31

I need to calculate the frequency of letters and digraphs. Most of what's available only gives digraph frequencies for the alphabet -- I need to include numbers and punctuation too, and I need all n*n digraph, not just the most frequent ones.

Then I need to throw the corpus at a genetic algorithm, to calculate least cost graphs from letter-to-letter.

posted by orthogonality 27 December | 05:02

Or I need a new magic keyboard.

posted by orthogonality 27 December | 05:03

OK. And failing that, here is a magic $40,000 so you can go to graduate school. (Again? ;-)

posted by Cryptical Envelopment 27 December | 05:34

Damn. Orthogonality is hardcore. I know where he's going with this. Did you get your joysticks? You're reinventing the wheel - but there's a lot to be said for better mousetraps. Your need-based invention might just prove lucrative or at least useful.

Anecdotally, there was some Palm/Pocket PC writing system from a few-odd years ago that was whole word based.

To use it you navigated through a floating-zooming menu that was arranged via statistical word probability. As you zoomed in you simply navigated to the more correct word and/or letters with the pen. When the correct word was reached you released the pen contact and the word was entered, then you began again for the next word.

Speedwise it was highly competitive with typing, and certainly less work and more comfortable, and amazingly accurate in it's accuracy. I remember you never had to zoom very far to find the right word.

posted by loquacious 27 December | 07:46

Here ya go: The Oxford Text Archive. Search for "American" in title.

posted by warbaby 27 December | 10:57

And then there's the Internet Archive: Text Archive

There's got to be a body of research available among the "travesty tree" crowd. A little research in the library would zero in on this and then you could do an "invisible college" search for authors. This seems like the sort of project where calling the authors up on the telephone would yield some very interesting paydirt.

Research in a nutshell:

1) Pose a well-framed question.
2) Consult a likely source.
3) Goto #1 and repeat as necessary.

posted by warbaby 27 December | 11:06

Oh, and always ask another question.

posted by warbaby 27 December | 11:12

How about some Gaddis?

posted by Hugh Janus 27 December | 12:27

loquacious: Yeah, I got a Logitech xbox style controller (cheapest available at a local store), and if I don't write the thing as an actual HID driver, the software for MS-Windows will be pretty simple: a call to joyGetPosEx() followed by some processing, followed by calls to SendInput().

(Writing it as a HID driver will be a better implementation, but not necessary for a proof of concept.)

The corpus is needed to figure out which keys to assign to which joystick combinations: we want to minimize the required joystick "travel", so we need to make the keys used in the most common digraph "close" to each other.

Then we write a simple arcade-style game to train the user on the joystick moves.

posted by orthogonality 27 December | 19:15

Ortho, are digraphs the best way to determine next characters? Consider: an > ano > anot whereas an > ano = no > anos ... meaning ano is most likely followed by t, whereas o is most likely followed by s (I'm making up the relationships, but just suppose these are correct.) The full-word n-graph gives much tighter coupling to the next letter than just the digraph.

That's awfull terse, but the entire sequence of the preceeding letters contains more information than just the adjacent pairs. This suggests finding least distance paths through n-space (where n is the maximum word length)will be better at picking the next letter than a simply two-dimensional proximity matrix. This leads to a very sparse data structure (as opposed to a fully populated n x n matrix.

This should make your magic keyboard much smarter and faster...

posted by warbaby 27 December | 21:31

warbaby: the magic keyboard is the one that broke. This is my attempt to replace it.

Yes, by tracking the last N letters, we can (often) predict the next likely letter. But we don't want to change the movements required to make any particular letter, and prediction is only useful if we do.

What we're trying to do is not predict, but to minimize the amount of work done by the user. We want the most used letters (ETAOIN...) to require the least finger travel, but we also want to minimize travel due to digraphs: if "he" is a frequent digraph, we might want to put "h" close to "e".

posted by orthogonality 28 December | 04:08

Anecdotally, there was some Palm/Pocket PC writing system from a few-odd years ago that was whole word based.

To use it you navigated through a floating-zooming menu that was arranged via statistical word probability. As you zoomed in you simply navigated to the more correct word and/or letters with the pen. When the correct word was reached you released the pen contact and the word was entered, then you began again for the next word.

I've tried that, on my PC. There was a demo somewhere. It was well cool.

posted by stavrosthewonderchicken 28 December | 07:05

← Falling sand || This is A Thank You Post for the Metachatters Who Made My Christmas Great! →

HOME || REGISTER || LOGIN