Thursday, May 26, 1:00 - 3:00 p.m.,
Expert Speaker Series DSSI 2011
Karim Saadah, Yahoo
June 3, 2011; 1:00 - 2:30 p.m.
Apache Hadoop has become the platform of choice for developing large-scale data-intensive applications. In this tutorial, we will discuss design philosophy of Hadoop, describe how to design and develop Hadoop applications and higher-level application frameworks to crunch several terabytes of data, using anywhere from four to 4,000 computers. We will discuss solutions to common problems encountered in maximizing Hadoop application performance. We will also describe several frameworks and utilities developed using Hadoop that increase programmer-productivity and application-performance.
In this talk, we look at the challenges on how to map applications into information networks and related issues. In an interconnected world, the evolution of one entity may cause a series of significant value changes of some others. For example, the currency inflation of Thailand caused the slumping currencies of other Asian countries, which finally lead to the financial crisis in 1997. We will call these entities with high impacts shakers. We’ll discuss the problem of how to discover shakers through a novel concept of construction of a cascading graph to capture the causality relationships among the evolving entities over some period of time, and then infer shakers through the graph. Next we consider the problem of using the network approach to provide a more efficient approach to solve the top-k maximal frequent pattern mining problem. This is achieved through building a pattern graph from the transaction database after some initial fast processing. Different from traditional bottom up strategies such as level-wise or tree-growth mining approaches, the graph based method works in a top-down manner. It can pull large maximal cliques from the pattern graph and directly use such large-sized maximal cliques as promising candidates for long frequent patterns. This greatly reduces the execution time compared to the traditional bottom up approaches.
June 7, 2011; 1:30 - 2:00 p.m.
People communicate using language, whether spoken, written, or typed. A significant amount of this language describes the world around us, especially the visual world in an environment, or depicted in images or video. Such visually descriptive language is potentially a rich source of 1) information about the world, especially the visual world, and 2) training data for how people construct natural language to describe imagery. In addition there exist billions of photographs with associated text available on the web; examples include web pages, captioned or tagged photographs, and video with speech or closed captioning. In this talk I will describe several projects related to images, descriptive text, and depiction, including: automatically labeling faces in news photographs, discovering visual attribute terms from noisy web collections, and generating simple natural language descriptions for images.
June 17, 2011; 1:00 - 2:30 p.m.
Opinion mining or sentiment analysis is the computational study of people’s opinions, appraisals, and emotions toward entities, events and their attributes. Opinions are important because they are key influencers of our behaviors. Our beliefs and perceptions of reality, and the choices we make, are to a considerable degree conditioned on how others see and evaluate the world. For this reason, when we need to make a decision we often seek out the opinions of others. This is true not only for individuals but also for organizations. In the past 10 years, it attracted a great deal of attentions from both academia and industry due to many challenging research problems and a wide range of applications. In this talk, I will first present an abstraction of the problem, which provides a structure to the unstructured text and reveals that opinion mining is a multifaceted problem consisting of many highly interrelated sub-problems. These sub-problems present multiple challenges for NLP, text mining, and machine learning. I will then describe several techniques to deal with some of the sub-problems.
June 24, 2011; 1:00 - 2:30 p.m.
In this talk, I will introduce the machine learning technologies for query document matching in search, which we have developed at MSRA. I will start my talk by pointing out that query-document mismatch is one of the major challenges for web search. I will explain how we think that the challenge can be addressed. Specifically, it is necessary to perform better query understanding and document understanding, and conduct query-document matching at sense, topic, and structure levels. Next I will describe the machine learning techniques which we have developed for query document matching, including Learning of Query Document Matching Model and Regularized Latent Semantic Indexing. I will also discuss the relationships between our methods and existing methods. Finally, I will outline future research directions for query-document matching.