- Project Title: From natural language understanding to content recommendations
- 20 word summary: Build a recommender system that derives semantics from textual interactions between humans and smart assistants and uses them to surface relevant BBC content
- Length of internship project: 3 months (June-September 2019)
- Stipend: £1500/month (£4500 total)
Research/Activity to be undertaken
We want to understand what users mean when they request content from smart assistants (for example Voice-controlled agents) and how that request translates into surfacing relevant BBC content. With this project we aim to develop a recommender system that extracts semantics from short text input and uses this information to search and surface content from a database of millions of items. Natural language understanding methods and especially entity recognition will be used to process textual input. Content-based approaches will be used to link semantics derived from text requests to properties of items in the database. A dataset gathering examples of user requests and their expected content results will also be created for benchmarking as part of this project. This is part of a greater effort at BBC Datalab to develop machine learning systems that are focused on, and empowering, the user.
How does this project fit with BBC’s strategy
Products served by voice-controlled agents are increasingly on demand and Datalab is planning to work closely together with the Voice team in the upcoming months. This project is an opportunity to experience what content recommendations for Voice could look like and identify the biggest challenges in this domain. As this is a relatively new research area, any developments and datasets created through this project will be used to benchmark future models. What is more, text mining approaches developed for this project could also benefit other BBC products such as recommender systems for News and online search and discovery products. This project will also be an opportunity to test and feedback on the Machine Learning platform recently developed by Datalab to facilitate data science work in the BBC. This project is an opportunity to learn at a small scale before we take over recommendations for complex systems that serve millions of users.
This project fits into the Universal Recommendations initiative and aligns with the OKRs of Datalab. Building a recommender system for Voice-controlled agents increases the exposure of BBC content beyond TV and online products and offers personalised experiences to our audiences. This is in line with the goal to reinvent the BBC for a new generation and increase engagement with BBC content.
- Python programming skills
- Research experience on the topics of Natural Language Understanding, Machine Learning, Information Retrieval, Recommender Systems
- Knowledge of cloud computing services (GCP) and large data processing is a plus
- Good communication and presentation skills
Knowledge sharing opportunity
- Increase our knowledge on natural language understanding including state of the art and benchmarking approaches
- Extend our experience working with text-based recommender systems
- Test and improve our recently developed machine learning frameworks
- Understand the challenges of recommender systems serving Voice-controlled agents before applying knowledge to large-scale projects
Expected Outputs & Key Deliverables
- The expected outcome of this project is a recommender system suitable to serve content through Voice-controlled agents.
- Besides the model itself and the datasets created for benchmarking, the intern is also expected to deliver a presentation and a blog post explaining how the model works. The presentation will be used for internal communication and documenting whereas the blog post will be shared with external communities.
- Depending on interest and time from both sides, the blog post could be extended into a research publication submitted to a conference or journal on the topic of recommender systems and natural language processing (eg RecSys, ACM Multimedia, or ACM SIGIR conferences).