Thursday, June 17, 2:30pm, 2405 Siebel Center

“Learning by Reading: From Information Extraction to Machine Reading”

Dr. Eduard Hovy, Information Sciences Institute / University of Southern California

Creating computer systems that educate themselves by reading text was one of the original dreams of Artificial Intelligence.  Researchers in Natural Language Processing (NLP) have made initial steps in this direction, especially with Information Extraction and Text Mining, which derive information from large sets of data.  Can one, however, build a system that learns by reading just one, or a small number, of texts about a given topic?

Starting in 2002, three research groups in an experiment called Project Halo manually converted the information in one chapter of a high school chemistry textbook into knowledge representation statements, and then had a knowledge representation system take the US high school standardized (AP) exam.  Surprisingly, all three systems passed, albeit not very well.  Could one do the same, automatically?  In late 2005, DARPA funded several small pilot projects in NLP, Knowledge Representation and Reasoning (KR&R), and Cognitive Science to take up this challenge, which grew into Project Möbius, a collaboration of SRI, USC/ISI, University of Texas Austin, Boeing, and BBN Inc.  The Möbius prototype learning-by-reading system read paragraph-length Wikipedia-level texts about the human heart and about engines, built up enough knowledge to apply inferences, to produce its own further reading requests, and to answer unseen questions.  Results were encouraging.  In 2009, DARPA funded a new 5-year program called Machine Reading, which funds three large teams that include many of the top NLP and KR&R research scientists in the USA.

This talk describes the Machine Reading program and provides details about one of the three teams, RACR, which is led by IBM's IE/QA team, and includes researchers at USC/ISI, University of Texas Austin, CMU, and the University of Utah.  The system contains several reading engines that are being composed into a single large framework, with access to a cluster of several thousand computers for large-scale experiments.  The reading engines include traditional Information Extraction engines, parsers, converters to various logical form representations, abstract semantic models of world knowledge, and various kinds of abductive and other reasoning engines.  I will focus on the use of large repositories of background knowledge and their various uses to support reading and inference, and describe the experiments currently being done.