This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English.
View Article and Find Full Text PDFWord senses are the fundamental unit of description in lexicography, yet it is rarely the case that different dictionaries reach any agreement on the number and definition of senses in a language. With the recent rise in natural language processing and other computational approaches there is an increasing demand for quantitatively validated sense catalogues of words, yet no consensus methodology exists. In this paper, we look at four main approaches to making sense distinctions: formal, cognitive, distributional, and intercultural and examine the strengths and weaknesses of each approach.
View Article and Find Full Text PDFMachine translation is one of the applications of natural language processing which has been explored in different languages. Recently researchers started paying attention towards machine translation for resource-poor languages and closely related languages. A widespread and underlying problem for these machine translation systems is the linguistic difference and variation in orthographic conventions which causes many issues to traditional approaches.
View Article and Find Full Text PDFBackground: The Salford Lung Study (SLS) programme, encompassing two phase III pragmatic randomised controlled trials, was designed to generate evidence on the effectiveness of a once-daily treatment for asthma and chronic obstructive pulmonary disease in routine primary care using electronic health records.
Objective: The objective of this study was to describe and discuss the safety monitoring methodology and the challenges associated with ensuring patient safety in the SLS. Refinements to safety monitoring processes and infrastructure are also discussed.
BMC Bioinformatics
March 2008
Background: Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst 1, but it is still not clear how to automatically construct such patterns for different semantic relations and domains. In particular it is not certain which patterns are useful for capturing synonymy.
View Article and Find Full Text PDF