- Growth of available text corpuses: 100 Trillion Words on the internet.
- LDC N-gram corpus: 13 million unigrams (much more than OED).
- Parametric, semi-parametric, non-parametric, hand-built, blank slate data-derived models.
- What you can do with data: What do you do with a copy of the internet?
- Google Sets, clustering algorithm
- Extraction via regex -> Extraction via relations -> extractions via example
- Unsupervised learning/machine reading: determining relationships from text automatically, patterns, extraction of facts based on patterns.
- Learning classes and attributes of classes: Planets and stars. Attributes: size, temperature, density. Other example: Video games, attributes are makers, developers, designers, etc.
- What’s the right source for attributes: Documents on the web vs. queries. Documents are informative, but hard to parse. Queries are easy to parse and short. Attributes from them are different: ‘manufacturer, dose’ from documents ’side effects, half life, mechanism of action’ from queries (on same class of classes)
- Statistical Machine Translation: Parrallel texts. Web page in english, web page in german, they correspond: align, but can’t do that to start, do it probalistic until you have some data. Search for optimal solution: Best english translation of foreign sentence: probability of the English coming out against lots of data.
- Showing Arabic -> English translation, Chinese -> English. Arabic: ~one disfluency from english per sentence, three in Chinese.
- Take sentence in Chinese, look up in dictionary, look up probability, determine the probability of bigram combinations and pull them out. Use trigram sequence, etc. Enumerate, multiply probabilities, out of all of them, one is most probable.
- Translation Model, Language Model, Decoding Algorithm.
- Translation Model: Counting up parrallel corpus counts. Phrase pairs: Bring in linguist. Tried that, turned out that it hurt as much as it helped: theories work right most of the time that are exceptions, and statistics are better at exceptions, and deal with general data well enough, so statistical is the choice.
- Language Model: Google, 7-grams, 1trillion words vs. 3-grams + 1 billion in the past.
- English Model is better: parrallel corpus is harder to find. Why not apply same reason to all factors?
- Features that don’t help: parse trees, part of speech tags, wordnet categories, treebank, framebank didn’t help. Raw counts from data is the only thing that helps. Useful information, but we haven’t figured out how to help.
- More data helps: Still adding more training data (linear).
- Scaling: how many bits to store probabilities? 64-32-16 bit? Trillions of numbers == 8GB vs. 1GB. Empirical answer: minimal bits that don’t lose performance. Answer was 4bits. Almost no difference between 4bit -> 8bit. Counts of bits only matter up to about 4bits.
- Word alignment needs a lot of memory because it stores all lexical co-occurences. Store all combinations of words. Possible solutions: Stemming produced -> produce or possible truncating: produced -> produc. Not as accurate, but maybe good enough: Use smallest representation that does not hurt alignment. Truncating at 7 characters, not much affect, 6 characters, 5 characters -> better, 4 -> better still. 4 characters works best. Saves a lot of space, re-run experiment if it changes. Empricial science.
- Lots of data is important -> Happy user.
- Better models learned from data: lots of writers -> happy user.
- 3 turning points in history of developing information: Sebastian Brant: “There is nothing nowadays that our children fail to know.” re Gutenburg Press
- Ben Franklin: “Common person whole book: farmers are as intelligent as most gentlemen” re Public Library
- Bill Clinton: “America has changed the way it works learns and communicates” re Internet
- Google is going to continue the trend.
- Data has been analyzed the most is stock market data: have you used any of the theories from people of UN and applied it to this field. “Google Hedge Fund might be a good idea”, but it’s always difficult to try to beat the market. Even if you understand it, if everyone else does, it doesn’t help you.
- Q: “Last slide — author -> user, working on any tools assisting authors in what to write based on what people are writing?” “People do research before they write things. More informed arguments because they look at that. Looking for better way to close the loop. What are people interested in? Google Trends: live trends “, Q: “Enter body of text, ranking against search queries” A: “We provide analytic for are you readig this, but not ‘why’? interesting area.”
- Q: “You show instances of automated translation with disfluences — is automated translation close, far away, etc.?” A: “It’s hard to get a lot better. Still growing, but limitations to approach. Have more to do with real world — all we can do is filter through language on the page, and sometimes that’s not enough. In English we have no gender agreemnet: it in English can’t be masculine/feminine easily. Dropped a brick and it broke: can’t tell which one, because that’s physics, not linguistics. Maybe combination of automatic + post-edit via humans, and have true fluency. 100% automated is tough.”
- Q: “Speculation — Google 411 so Google could gather spoken data. Is that true? Any numbers on non-textual data?” A: “Thought 411 was great, hook up local products. Gather more training data, more data is better for your products because you have that data. Interested in non-textual data, starting to branch out a bit more, various areas, maps, apis, think that’s very important, image search for many years, throw away the images and look at text for determining results, now we’re starting to do image analysis to get text on still and video images on YouTube. “
- Q: “Followup: Speech to text. Intuitions as to whether anything here works for speech recognition?” A: “Applicable, but speech is harder to get feedback on where you’re right and wrong. IF you had a good source of where you were right/wrong, it would be possible to do that, translation model and language model would apply if you have the right place to get data, more data you get better you do.”
- Q: “Can you help us with spam? Really big problem — comment spam. Submit every comment, and get back a response.” A: “Interesting, so we should have name (on vaccation) for this. We have done nofollow, but it’s still a problem. Intriguing: submit — spam/not spam. I can’t commit, something worth looking into.” (Ed note: This is Akismet. Use it. It rocks.)
- Q: (Missed) A: “We look at what people are clicking on can be a clue in what they’re actually looking for. Do your queries appear on page? In links? in title? Pagerank? Other factors? are people clicking on this page? Other clues. All that data is useful.”
- Q: “How do you measure the diff between someone finding something and being satisfied?” A: “Keeps me up at night. Worked at a company that got from amazon. Amazon, you get a credit card purchase, that’s a success — never have that. Show some results, click on result: we don’t know you. Toolbar tests, but mainly learned that no one clicks ‘yes’, only ‘no’. Observer what happens, but not if you liked it. Many millions of examples: don’t trust any one click, can add them all up and hope that they work. At Berkley, CTO from (couldn’t hear), ‘what we do is measure us against google’. We’re 100% on that metric. But not good enough.”
- Q: “How can you tell that a website isn’t using machine translation to begin with?” A: “We had that problem. Arabic showed up. Didn’t have much arabic speaking, is hard to catch. Arabic speakers said ‘That’s just junk machine translation’. Business model around trying to make money off machine translation. Spam detection: good vs. bad. Problem, but we think we can deal with it.”
- Q: “Britannica vs. Joe’s Weblog: How do you take into account slang vs. formal language?” A: “News sites vs. non-news sites is useful. Slang vs. not-slang — just go by the totals.”
- Q: “Would this work with OCR?” A: “OCR is another thing we’re doing — book scanning, ties into all these, OCR systems do a good job, trying to correct things with language models, can improve output of OCR by better model of what makes sense.”
Archive for May, 2007
Some notes from the Python Design Patterns:
- Largely talking about Facade vs. Adapters are: Facade is creating simple interfaces around very-rich interfaces to limit you to what the client needs. Adapters are well-suited for small-scale, Facade better when you have a large API you want to hide.
- favor object composition over class inheritence
- Object Wrapping: Law of Demeter: You should only have one dot in your path. Client only talks with wrapper: delegate under to cover.
- “Factory” is essentially built in in Python
- Examples of adapters, talking about two tools which provide same functionality, talking about per-class subclassing, passing instances into wrapper, etc. etc. “Mixins are the smartest usage of multiple inheritence”: inherit from two class, and override the method you don’t want called to call the one you do want called. “Mix and match.” This is what OpenLayers uses, I think: Schuyler says that anyway
Speaker seems very intelligent, but a bit dull as a speaker: he’s also probably speaking significantly over my head.
At the Google Developer Day. Sitting in the Google Gears session: it’s pretty frickin cool. I just created an OpenLayers Map, wrote 15 lines of code, and that page will now load, even when I’m online.
Other things it does:
- WorkerPool — run code in the background in your browser. Demoing finding prime quadruplets: user interface continues to be responsive — can run multiples at the same time, and user experience is not bad. (Oh man, how I wish this were built in by default.)
- Storage — local storage in a sqlite database, including full text search: millions of documents, working on fts3, which will be 10s of millions.
- Local server — caches data for offline use.
There are tons of limitations — very early release — but it’s available for Firefox and IE, and can be built for Safari.
The upcoming week is going to be a ton of fun:
- Where 2.0: 29th-30th. I’ll be presenting OpenLayers at 4:30 on the 29th, and during the conference I’ll also be figuring out how to sanely hand out my limited supply of MetaCarta Labs On A Stick drives.
Where last year was great: it was my first introduction to a lot of geo people, and I’m looking forward to it this year. It’s a great way to get out into the world a bit more and meet the people who actually might have a use for all the code I write.
- Google Developer Day: May 31st. Although I’m not a huge Google person, this should be a good way to try to catch up to the mainstream in web mapping. Google has for a long time set the standard for web mapping, and I’m not convinced yet that this isn’t still the case (No matter how much I love OpenLayers).
- WhereCamp: Jun 2-3. Anselm Hook and Ryan Sarver have put together WhereCamp, the geounconference, for the weekend after Where. Hosted at the Yahoo! campus, this is going to be a great event for all geohackers, and I’m hugely looking forward to it.
I’ve still got to finish my OpenLayers presentation, and finalize the content for these Labs-on-a-Stick drives, but I’m hoping that once I do I can sit back, and hopefully get in some good time doing OpenLayers hacking: Schuyler, myself, Erik and Tim are all going to be in town, so we should be able to move forward on some stuff while we’re all in the same physical location.
Looking forward to meeting anyone and everyone at the conference… feel free to drop a comment if you feel like we should meet up and haven’t already made plans.
A better release announcement, which I wrote over on the Labs Weblog:
MetaCarta Labs is proud to announce the release of FeatureServer, an open source, Python based RESTful feature server.
FeatureServer allows you to store points, lines, and polygons in a number of backends, and get the data out all at once or one piece at a time. You can get the data out as KML, JSON, GeoRSS, GML/WFS or even as HTML.
FeatureServer is primarily designed as a lightweight vector feature storage companion to the new OpenLayers vector capabilities.
A demo of FeatureServer in OpenLayers is available at the FeatureServer Demo page. You can see features from this demo in:
FeatureServer is the first project we’ve released under our copyright-only open source license, a move designed to protect both MetaCarta and project contributors. We’re moving forward with getting this license approved by OSI, and plan to release future open source software under this license.
I’ve been looking forward to this for a long time, and am looking forward to demoing it for people at Where 2.0. I think that FeatureServer really is the first step forward in the world of interacting with geodata on a feature level rather than a giant-database level. Per-feature interaction — view, update, delete — as well as easy-to-use attribute querying are both huge boons when working with lots-o-data: instead of loading 700,000 features, or even just loading features based on a bounding box, you can talk about a feature based on its unique identifier.
It also makes a really great translator — for example, it makes it possible to load up your Flickr photos as a WFS server which you can drop into OpenLayers. No need to depend on the external interfaces: anything which provides a geospatial data API can be converted to a FeatureServer datasource.
The ability to read in GeoRSS and KML are great from the point of view of an aggregator: simply pipe the output of a curl GeoRSS download into a FeatureServer upload, and you’ve got a GeoRSS post collector.
There’s a ton of possibilities, and I’m looking forward to helping people explore them now that the software is available and open source.
I wonder how many people are going to the MetaCarta Public Sector User Group Meeting today…
Went out for a late night walk tonight: though my path was a bit unclear, I think I made a decently accurate map:
If you hover over the circles, you’ll see points of interest along the route. I recommend zooming into the main MIT building, to see the confused meandering I was forced to.
This demo uses OpenLayers, TileCache, and FeatureServer. It was created entirely via a web interface.
When at the Museum of Science, I noticed that the giant globe in the atrium near the entrance of the museum is copyrighted by Rand McNally. Although I understand that the interpretation of a lot of raw data to make a 3D model of the Earth is definitely a creative work, it was still strange to read ‘Globe copyright Rand McNally…’
I think from now on, I’m not going to try to answer questions on mailing lists which don’t follow the guidelines of “How to Ask Questions the Smart Way“. I spend way too much time trying to extract information that’s needed to solve any
problems. Obviously, in some cases, it’s hard to know which information is important, but with two open source projects with growing user populations, it’s very hard for me to take the time to coax every person through getting the information that is important to solving problems.
I think it is a sign of the growth in popularity of the OpenLayers and TileCache projects that I’ve come to this stage — it’s only when projects reach a certain level of maturity that people start asking enough questions that I don’t feel like I can just answer everything. Certainly, for the first 6 months of the OpenLayers project, I never felt that answering questions on the mailing list was something that was a burden to do, but over the past couple months, it’s been an increasing level of traffic from users who haven’t taken the time to investigate the cause of their problems fully. I think the time has come to stop devoting as much energy to solving *every* problem, and instead solve the problems that can’t be solved by other people.
Of course, an additional step is to work to create a forum in which users can offer cash to solve their problems — you put up a question, with a cash bounty on it, and you either increase the amount of money you’re willing to spend to get your question answered, or you answer it yourself.
This seems like the kind of thing that there should already be software to do. I’ve never heard of any, though. I wonder why that is. Probably something obvious I’m missing.