Google Developer Day: Theorizing from Data
Posted in GDD07 on May 31st, 2007 at 19:22:39- Growth of available text corpuses: 100 Trillion Words on the internet.
- LDC N-gram corpus: 13 million unigrams (much more than OED).
- Parametric, semi-parametric, non-parametric, hand-built, blank slate data-derived models.
- What you can do with data: What do you do with a copy of the internet?
- Google Sets, clustering algorithm
- Extraction via regex -> Extraction via relations -> extractions via example
- Unsupervised learning/machine reading: determining relationships from text automatically, patterns, extraction of facts based on patterns.
- Learning classes and attributes of classes: Planets and stars. Attributes: size, temperature, density. Other example: Video games, attributes are makers, developers, designers, etc.
- What’s the right source for attributes: Documents on the web vs. queries. Documents are informative, but hard to parse. Queries are easy to parse and short. Attributes from them are different: ‘manufacturer, dose’ from documents ‘side effects, half life, mechanism of action’ from queries (on same class of classes)
- Statistical Machine Translation: Parrallel texts. Web page in english, web page in german, they correspond: align, but can’t do that to start, do it probalistic until you have some data. Search for optimal solution: Best english translation of foreign sentence: probability of the English coming out against lots of data.
- Showing Arabic -> English translation, Chinese -> English. Arabic: ~one disfluency from english per sentence, three in Chinese.
- Take sentence in Chinese, look up in dictionary, look up probability, determine the probability of bigram combinations and pull them out. Use trigram sequence, etc. Enumerate, multiply probabilities, out of all of them, one is most probable.
- Translation Model, Language Model, Decoding Algorithm.
- Translation Model: Counting up parrallel corpus counts. Phrase pairs: Bring in linguist. Tried that, turned out that it hurt as much as it helped: theories work right most of the time that are exceptions, and statistics are better at exceptions, and deal with general data well enough, so statistical is the choice.
- Language Model: Google, 7-grams, 1trillion words vs. 3-grams + 1 billion in the past.
- English Model is better: parrallel corpus is harder to find. Why not apply same reason to all factors?
- Features that don’t help: parse trees, part of speech tags, wordnet categories, treebank, framebank didn’t help. Raw counts from data is the only thing that helps. Useful information, but we haven’t figured out how to help.
- More data helps: Still adding more training data (linear).
- Scaling: how many bits to store probabilities? 64-32-16 bit? Trillions of numbers == 8GB vs. 1GB. Empirical answer: minimal bits that don’t lose performance. Answer was 4bits. Almost no difference between 4bit -> 8bit. Counts of bits only matter up to about 4bits.
- Word alignment needs a lot of memory because it stores all lexical co-occurences. Store all combinations of words. Possible solutions: Stemming produced -> produce or possible truncating: produced -> produc. Not as accurate, but maybe good enough: Use smallest representation that does not hurt alignment. Truncating at 7 characters, not much affect, 6 characters, 5 characters -> better, 4 -> better still. 4 characters works best. Saves a lot of space, re-run experiment if it changes. Empricial science.
- Lots of data is important -> Happy user.
- Better models learned from data: lots of writers who do editor jobs -> happy user.
- 3 turning points in history of developing information: Sebastian Brant: “There is nothing nowadays that our children fail to know.” re Gutenburg Press
- Ben Franklin: “Common person whole book: farmers are as intelligent as most gentlemen” re Public Library
- Bill Clinton: “America has changed the way it works learns and communicates” re Internet
- Google is going to continue the trend.
- Questions?
- Data has been analyzed the most is stock market data: have you used any of the theories from people of UN and applied it to this field. “Google Hedge Fund might be a good idea”, but it’s always difficult to try to beat the market. Even if you understand it, if everyone else does, it doesn’t help you.
- Q: “Last slide — author -> user, working on any tools assisting authors in what to write based on what people are writing?” “People do research before they write things. More informed arguments because they look at that. Looking for better way to close the loop. What are people interested in? Google Trends: live trends “, Q: “Enter body of text, ranking against search queries” A: “We provide analytic for are you readig this, but not ‘why’? interesting area.”
- Q: “You show instances of automated translation with disfluences — is automated translation close, far away, etc.?” A: “It’s hard to get a lot better. Still growing, but limitations to approach. Have more to do with real world — all we can do is filter through language on the page, and sometimes that’s not enough. In English we have no gender agreemnet: it in English can’t be masculine/feminine easily. Dropped a brick and it broke: can’t tell which one, because that’s physics, not linguistics. Maybe combination of automatic + post-edit via humans, and have true fluency. 100% automated is tough.”
- Q: “Speculation — Google 411 so Google could gather spoken data. Is that true? Any numbers on non-textual data?” A: “Thought 411 was great, hook up local products. Gather more training data, more data is better for your products because you have that data. Interested in non-textual data, starting to branch out a bit more, various areas, maps, apis, think that’s very important, image search for many years, throw away the images and look at text for determining results, now we’re starting to do image analysis to get text on still and video images on YouTube. “
- Q: “Followup: Speech to text. Intuitions as to whether anything here works for speech recognition?” A: “Applicable, but speech is harder to get feedback on where you’re right and wrong. IF you had a good source of where you were right/wrong, it would be possible to do that, translation model and language model would apply if you have the right place to get data, more data you get better you do.”
- Q: “Can you help us with spam? Really big problem — comment spam. Submit every comment, and get back a response.” A: “Interesting, so we should have name (on vaccation) for this. We have done nofollow, but it’s still a problem. Intriguing: submit — spam/not spam. I can’t commit, something worth looking into.” (Ed note: This is Akismet. Use it. It rocks.)
- Q: (Missed) A: “We look at what people are clicking on can be a clue in what they’re actually looking for. Do your queries appear on page? In links? in title? Pagerank? Other factors? are people clicking on this page? Other clues. All that data is useful.”
- Q: “How do you measure the diff between someone finding something and being satisfied?” A: “Keeps me up at night. Worked at a company that got from amazon. Amazon, you get a credit card purchase, that’s a success — never have that. Show some results, click on result: we don’t know you. Toolbar tests, but mainly learned that no one clicks ‘yes’, only ‘no’. Observer what happens, but not if you liked it. Many millions of examples: don’t trust any one click, can add them all up and hope that they work. At Berkley, CTO from (couldn’t hear), ‘what we do is measure us against google’. We’re 100% on that metric. But not good enough.”
- Q: “How can you tell that a website isn’t using machine translation to begin with?” A: “We had that problem. Arabic showed up. Didn’t have much arabic speaking, is hard to catch. Arabic speakers said ‘That’s just junk machine translation’. Business model around trying to make money off machine translation. Spam detection: good vs. bad. Problem, but we think we can deal with it.”
- Q: “Britannica vs. Joe’s Weblog: How do you take into account slang vs. formal language?” A: “News sites vs. non-news sites is useful. Slang vs. not-slang — just go by the totals.”
- Q: “Would this work with OCR?” A: “OCR is another thing we’re doing — book scanning, ties into all these, OCR systems do a good job, trying to correct things with language models, can improve output of OCR by better model of what makes sense.”