|
11:15 AM - 12:30 PM
- An Information Retrieval approach to Spelling Suggestion
An Information Retrieval approach to Spelling Suggestion
Popular search engines like Google, Yahoo provide suggestions for misspelled queries this conveys spelling suggestion is a vital feature for web search engines. However, many of the spelling suggestion techniques today depend on query logs. However, not all search engines have the luxury of using query logs because their query log might not be large enough. Insufficient query logs cannot be an excuse for absence/poor behaviour of spelling suggestion. In this presentation we talk about the spelling suggestion component build for web search engine called Setooz, and is built using the web document collection alone and requires no training data or human intervention.
- Key Phrase Extraction
Key Phrase Extraction
Keyphrases are sequence of words that capture the main topics covered in a document. It is very useful in summarization, text document classification, document clustering etc. In this presentation I will cover two major issues. i.e. (1) Comparative study of Techniques for Identification of meaningful phrases, Like: Semantic Relatedness based Techniques, probabilistic Techniques, corpus or external knowledge supported techniques with Our N-gram Filtration Technique and (2) Keyphrase extraction Technique based on N-gram Filtration. Due to least uses of linguistic resources and unsupervised nature it does not require heavy setup for running the application and opens the possibilities to extend in multi-lingual environments.
- Relationship Extraction using Probablistic Graphical Models
Relationship Extraction using Probablistic Graphical Models
Information Extraction helps us in creating/extracting structured data from unstructured text. Here i present some work using Graphical models and Relational Learning and its advantages. I will also present some Max Margin methods which also takes in structural information of the words. These models helps in quicker and more effective process and can be applied over heterogeneous data sources, tables and documents.
- Progressive Summarization
Progressive Summarization
Text summarization is the process of condensing text to its most essential facts. A well composed and coherent summary is a solution for most information overload problems. Progressive summarization is an emerging area within summarization community, where summaries are generated with a sense of prior knowledge. In this presentation, we introduce progressive summarization, discuss the intricacies involved, and propose a simple and effective solution to the problem.
- Prominence based scoring of speech segments for automatic speech-to-speech summarization
Prominence based scoring of speech segments for automatic speech-to-speech summarization
The talk presents an automatic speech to speech summarization
system using prominence based scoring of speech segments. Previous approaches have used acoustic features such as F0, intensity and duration to train a classifier to label a speech segment as belonging to summary or not. However these approaches require gold standard human labelled summaries to train a classifier. There were also attempts to use lexical features derived from automatic speech recognition (ASR) transcripts
in combinations with acoustic features to build speech-to-speech summarization systems. It should be noted that ASR systems are not available for all languages (especially for low resource languages). Hence it is important to investigate approaches for speech-to-speech summarization which does not depend on gold standard human labelled summaries or ASR. It is known that content words (nouns, verbs etc,.) are usually stressed or made prominent. In this work we explore prominence based features for scoring speech segments to generate automatic speech-to-speech summaries.
In the current work we show that summaries generated by prominence based scoring of speech segments are as good as summaries generated by a classifier trained on acoustic features such as F0, intensity and duration. We also propose a method to combine prominence scores with maximum marginal relevance (MMR) scores of speech segments when manual/ASR transcripts are available. We also compare the performance of the proposed method with text summarizer based on MMR.
2:30 PM - 3:30 PM
- Phrase based Query Expansion Technique for Enterprise Search
Phrase based Query Expansion Technique for Enterprise Search
Not Available...
- Enterprise Search
Enterprise Search
We present a system for searching data in enterprise environment. System performance was improved using different techniques like search based on different roles in environment, semantic search using existing ontologies, use deeper NLP techniques (like named entities) and document categorization.
- Knowledge Base Population
- Vijay Bharath, Kranthi Reddy
Knowledge Base Population
The aim of the project is to build an automated system for discovering information about named entities and incorporating it in to structured knowledge source. The task can be broken down into 2 parts.
- Entity Linking : In this task we link named entities from news articles to nodes in a structured knowledge base (Eg: Wikipedia). A node in structured knowledge base contains information about named entities.
- Slot Filling : After linking named entities to nodes in the knowledge base, we extract information from the news article and use it to update the information already present in the node.
Applications : Besides reducing manual effort, high quality and latest information from various news articles can be used to automatically update knowledge sources.
- Distributed Computing for IR
Distributed Computing for IR
This talk will present work done on the Learning Scheduler developed here at the SIEL. We will present our scheduler targeted for heterogeneous clusters running repetitive batch jobs. The scheduler uses pattern classification to classify tasks into two sets, good and bad. Good tasks are those that do not overload resources on a machine. From the list of good tasks the scheduler chooses tasks with maximum priority. Priority of jobs is calculated from user supplied priority functions. The scheduler achieves user specified level of utilization, measured as load averages.
4:30 PM - 5:30 PM
- Learning the Click-Through Rate for Rare/New Ads from Similar Ads
Learning the Click-Through Rate for Rare/New Ads from Similar Ads
Sponsored search has quickly became the largest source of revenue for Web search engines. Search engines generate revenue from click/impression events on ads. Click on an ad depend heavily on the rank at which the ad is displayed on the search page. The ordering of ads on the search page is done based on the historical click information. Hence accurately predicting the click-through rate (CTR) of an ad is of paramount importance for maximizing the revenue. We first consider the problem of removing the inherent presentation and position bias from the click-through logs for the already established ads. For newly created ads or rare ads we do not have sufficient historical information to calculate their CTR values. We present a model that inherits the click information of rare/new ads from other frequent ads which are semantically related. The semantic features are derived from the query ad click-through graphs and advertisers account information. We use gradient boosted decision trees (GBDT) as a regression model. Experiments show that the model learned using these features gives a very good prediction for the CTR values of the ads. Improvements obtained in the paper are found significant at 99% significance level.
- Extracting Subjectivity in Sentiment Classification
Extracting Subjectivity in Sentiment Classification
Sentiment classification focuses on predicting the polarity of a review. Unlike topical classification that focuses on keywords, sentiment classification focuses on subjectivity. The classification can be binary or multi variant. The key for sentiment classification is to extract subjectivity. However, existing approaches in extracting subjectivity are heavily reliant on language resources like lexicons, POS tagger etc. In this presentation, we propose alternative approaches that does not require any resource for extracting subjectivity in a review and rate it.
- CLIR
CLIR
Will be updated shortly...
- Learning to Rank Categories for a User Query
Learning to Rank Categories for a User Query
In a Web search engine scenario, knowing the intent behind a user
query helps in tasks such as improving the search results, providing ads relevant to the user query, etc. One of the representations of the user
intent could be the categories to which a user query belongs to. In this work, we discuss query categorization which involves categorizing
a query into a ranked list of one or more predefined categories. We view query categorization as a ranking problem, and apply learning to rank
algorithms to learn such a ranking of categories for a user query. This research on learning to rank categories for a user query is still ongoing and
the preliminary results show an improvement in precision when compared with baseline system.
- Knowledge Access in Smart Spaces
Knowledge Access in Smart Spaces
This work aims to explore the area of creating smart spaces using Internet tablet devices while specifically focusing at a classroom environment. Primarily, it aims at providing a student with the ability to retrieve, extract and access the information as needed in the context of a classroom using contextual search, personalization and summarization technologies while also integrating teaching and learning aids.
- Cloud Computing
Cloud Computing
This poster presents the overall research activity carried out in the cloud computing group in SIEL. We are primamrily focusing our research in three main areas: 1. Cloud middleware and services, 2. Role of cloud computing in E-governance and language technologies, and 3. Cloud migration. The research on cloud middleware is about improving cloud application stacks such as MapReduce by discovering new resource management and scheduling algorithms, improving energy efficiency of cloud middleware, and finally innovative applications of virtualization. We are also working on utilizing cloud computing for better e-governance models and utilizing cloud offerings to build scalable language technology applications such as multilingual machine translation and cross language information retrieval. In cloud migration we are working on the collaborative approaches of cloud computing, federation of resources, SLA establishment, and cloud interoperability.
- Generating simulated feedback: A prognostic search approach
Generating simulated feedback: A prognostic search approach
Implicit Relevance feedback has received wide attention recently, as a means to capture the search context to personalize and improve search accuracy. However, such feedback is usually not available to public or even research communities at large for various reasons. This makes it difficult to experiment and evaluate web search related research and especially personalization algorithms. We attempt to solve this problem by generating simulated relevance feedback -- an articially generated implicit relevance feedback. We were able to achieve an accuracy of 65% in generating simulated relevance feedback through our approach.
- Semantic similarity of tags in social bookmarking systems
Semantic similarity of tags in social bookmarking systems
Social bookmarking systems are built upon three dimensions: resource, user, tags. Although several approaches are proposed to measure the semantic relatedness between tags, these approaches consider two out of the three dimensions. This projection causes loss of valuable information that can be harnessed in multiple applications. We propose a model that addresses all the three dimensions. Further, we define a semantic similarity measure based on this model. The initial results of our approach show an increase of 5% in precision of semantic similarity over a popular similarity measure. We further evaluate our approach in an application- tag clustering, and prove the practical effectiveness of our approach.
- Challenges in Language Identification
Challenges in Language Identification
Due to the diversity of documents on web, language identification is a vital task for web search engines during crawling and indexing. Among the many challenges in language-identification, the unsettled problem remains identifying Romanized text language. The challenge in Romanized text is the variations in word spellings and sounds in different dialects.
- CLIR in Wikipedia
CLIR in Wikipedia
We refer to our globally interconnected information infrastructure as the World-Wide-Web. At present, however, it is far less than that. For someone who reads only English, it is presently the English-Wide-Web. A reader of only Hindi sees only the Hindi-Wide-Web. So, with a wide collection of documents in various languages it is clearly evident that CLIR is a key factor to universal usability.
In wikipedia, the current scenario of search across the languages is not realistic. So we would like to bridge the gap between the information in english and in the language known to the user such that it expands the usability of wikipedia many fold. We intend to build a Cross Language Information Access system, which harnesses the structured wikipedia data to maximum extent possible and use least amount of language specific resources.
- Indian Language News Clustering
Indian Language News Clustering
News sources in the Internet have increased enormously. In Indian languages, there are news websites which provides news articles in each language. The similar content is widely spread in different languages.The MLDC technique helps in aggregating the similar content across the languages. This helps in improving the usefulness of the content about a particular topic by including the information from the other language document.
- CLIA
CLIA
India Search enables users to query in one Indian language and retrieve results in other Indian languages plus in English in tourism domain. It also includes features like Summarization, Information extraction of selected templates, Snippet Generation and Snippet Translation.
- Enterprise Search
- Prashant, Vinay, Balaji, Kushal
Enterprise Search
We present a system for searching data in enterprise environment. System performance was improved using different techniques like search based on different roles in environment, semantic search using existing ontologies, use deeper NLP techniques (like named entities) and document categorization.
- Recognizing Textual Entailment
- Kiran Kumar .N, Santosh GSK, Sudheer .K
Recognizing Textual Entailment
Textual entailment is defined as: “a directional relationship between two text fragments, which we term the Text (T) and the Hypothesis (H)”, where “T entails H if the truth of H can be inferred from T within the context induced by T”. The recognition of textual entailment is one of the most complex tasks in natural language processing (NLP) and the progress on this task is the key to many applications such as Question Answering, Information Extraction, Information Retrieval, Text Summarization, and others. In multi-document summarization a redundant sentence or expression to be omitted from the summary should be entailed from other expressions in the summary.
|