Natural Language Processing (NLP) & Text Mining

The Natural Language Processing and Text Mining Group

The NLP and TM Group has consistently achieved high quality research outputs, attracted significant funding and  trained outstanding PhD students. Its roots lie in the pioneering research in NLP conducted between 1980 and 2000 at the Centre for Computational Linguistics of UMIST (one of the two founding universities of the University of Manchester). Since 2004, the Group has focussed its activities around the interplay of NLP and TM. Its pre-eminence in TM was recognised in 2004 by the award of major funding from JISC/BBSRC/EPSRC to set up the world’s first publicly-funded National Centre for Text Mining (NaCTeM), which immediately became an international centre of text mining expertise.

National Centre for Text Mining (NaCTeM)

Led by Prof Sophia Ananiadou, NaCTeM’s ethos has always been to drive forward the state of the art in research, with results then being fed into the development of tools, services and resources (annotated corpora, computational lexica) of benefit to the wider research community. NaCTeM researchers have excelled in community shared tasks and challenges, notably in BioCreAtIvE III, IV and V, in BioNLP 2011 and 2013 (for the most complex task of event extraction) and most recently obtained 2 first places in tasks of the 5th CL-SciSumm Shared Task 2019. Moreover, NaCTeM’s participation in DARPA’s $45m Big Cancer Mechanism initiative, in a consortium led by the University of Chicago, saw it produce in 2015  the top performing system for extracting information to support cancer pathway modelling. NaCTeM’s academic and industrial research projects range over many domains from biology and biomedicine to biodiversity, toxicology, neuroscience, materials, history, social sciences, insurance, and health and safety in the construction industry, with funding coming from EPSRC, ESRC, MRC, AHRC, Wellcome Trust, NIH, Pacific Life Re, Lloyd’s Register Foundation, AstraZeneca, DARPA, EC Horizon 2020, JST, the cosmetics and extracts industry, among others. Applications arising from such research include Thalia, a semantic search engine over more than 20m biomedical abstracts;  Facta+, to find unsuspected associations in the biomedical literature; HoM, allowing semantic search of historical medical and public health archives; and RobotAnalyst, supporting the hitherto laborious screening stage of systematic reviewing through active learning techniques. NaCTeM also collaborates closely with the Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Japan. Recently AIRC and NaCTeM obtained funding from the Japan Agency for Medical Research and Development (AMED) for the development of novel biomarkers stratifying cancer patients.

The research group is also involved in the Discovery Safety Programme, funded by Lloyd’s Register Foundation in collaboration with Health and Safety Executive (HSE), developing new methodologies to analyse and aggregate data to prevent harm in the workplace, based on information extraction, summarisation, crowdourcing and machine learning.

The research group leads the UK healthcare text analytics network (Healtex), is part of the Farr Institute’s Health eResearch Centre (HeRC) and has pioneered the creation of the ACL SIGBIOMED special interest group featuring the BioNLP workshops since 2002. 

Part of the research group has also delved into text mining applied to social sciences. Our work on social media analytics underpinned by text mining techniques (e.g., text classification, sentiment analysis, topic modelling, named entity recognition) has been providing insights into the social "pulse" on issues ranging from customer satisfaction, through to fair work and human rights. Additionally, we seek to enhance civic engagement with our work on the text mining-based analysis of Parliamentary data (e.g., UK Hansard archives). 


Prof. Sophia Ananiadou

Sophia is the Deputy-Director of the Data Science Institute and the Director of NaCTeM. She also leads the Text Mining Research Group, and has led the development of the text mining tools and services currently used in NaCTeM with the aim to provide scalable text mining services: information extraction, intelligent searching, association mining, etc. Her main contributions are in the area of natural language processing, and in particular computational terminology and biomedical text mining. The work in computational terminology and term recognition led to the development of the C-value method for automatic term recognition which has been adopted as a standard method internationally.

Software Development Highlights:

She has been a principal investigator on a number of projects, such as EMPATHY, which aimed to support metabolic pathway model curation through the integration of text mining methodologies into a pathway reconstruction platform. In collaboration with the University of Liverpool and the National Institute for Health and Care Excellence, she was the principal investigator for Mining for Public Health which aims to conducting novel research in text mining and machine learning to transform the way in which evidence-based public health reviews are conducted.

John McNaught

John is an honarary lecturer in the School of Computer Science and the Deputy Director of NaCTeM. He has worked on machine translation (MT) aspects, specifically on MT software design, on sublanguage-based MT, and on computational dictionaries. Multilingual issues and sublanguage concerns also brought me to develop strong interests in computational terminology and the representation of special knowledge. He also became involved in various language engineering standardisation initiatives such as EAGLES and ISLE. These focused particularly on issues of reusability of language resources such as text corpora and electronic dictionaries, and design for reuse. Recently, the reusability issue has been exercising researchers in the context of semantic web ontologies.

Prof Jun-ichi Tsuji

Before taking up the post of Director at the Artificial Intelligence Research Centre (AIST), J. Tsujii was a professor at the University of Tokyo in Japan as well as a professor at University of Manchester in UK. He was appointed as the first director of National Centre for Text Mining (NaCTeM) in UK at 2005, and he is now the scientific advisor of NaCTeM, part-time professor at the School of Computer Science. His research achievements include the development of the HPSG-based parser (Enju), its application to pathway extraction from text, and construction of the GENIA corpus. The GENIA corpus has been used as one of the gold standard corpora for tasks in Bio Text mining such as event extraction, named entity recognition, and pathway extraction.

He has received a number of awards such as IBM Faculty Award (2005), Achievement Award of Japan Society for Artificial Intelligence (2008), Fellow of Information Processing Society Japan (2010), etc. He received the Medal of Honor with Purple Ribbons from the Japanese government for his contribution to Bio Text mining, Machine Translation (2010), the Funai achievement award (2014), he has been named ACL fellow (2014) and the Okawa price (2016) He was President of ACL (Association for Computational Linguistics, 2006), and is a permanent member and chair of ICCL (International Committee of Computational Linguistics).

Dr Riza Batista-Navarro

Riza is a Lecturer at the School of Computer Science of the University of Manchester, and a member of the NLP & Text Mining research group. She obtained her PhD in Computer Science from the University of Manchester, with Biomedical Text Mining as her specialisation. During her time at Manchester, Riza has conducted research into using Natural Language Processing to extract meaningful information from numerous scientific documents. Along with using NLP in her work, she optimises machine learning algorithms to derive meaning from texts. 

She will soon begin research as part of a Newton Fund project on social media analytics in the Philippines. It will aim to automatically analyse social media posts to detect and extract the emotions of the poster and find any potential distress. She will work alongside two Universities and an NGO within the Philippines to assist in identifying those who might be experiencing mental distress so that the correct assistance can be provided for them.

Professor Goran Nenadic

Goran’s research focuses on unstructured data science, specifically on making sense of large-scale free text data by combing rule-based and data-intensive approaches. His work mainly aims at engineering deep features to train machine-learning algorithms to process free-text documents. He is based in the School of Computer Science, and affiliated with the Manchester Institute of Biotechnology (MIB) and The Farr Institute’s Health eResearch Centre (HeRC). His recent and current research projects (funded by NIHR, EPSRC, BBSRC, Welcome, AZ, Pfizer) include large-scale extraction and curation of biomedical information from the literature (including processing table data) and understanding patient free-text data. He has worked with a number of local hospitals, charities and industry on unlocking evidence contained in clinical narratives and healthcare social media in many areas including oncology, stroke, rheumatology, veterinary science, mental health, radiology, complementary medicine, pain management, etc. His team has also worked on semi-automated anonymisation of clinical free text and has taken a major part in many clinical text mining challenges. Goran leads the UK healthcare text analytics network (Healtex), which has been funded by EPSRC to identify the main challenges in processing healthcare free-text. He is also the Editor-in-Chief of Journal of Biomedical Semantics.

In addition to biomedicine, Goran is also interested in integrative data analytics that combines multi-modal data streams to uncover new patterns in other domains (e.g. Connected Health Cities focusing on linking wellbeing and citizen data; the EPSRC-funded HOME-Offshore project to analyse renewable energy data signals).