What is Reinforcement Learning and what is it capable of?

An image of reinforcement learning, AI, What is Reinforcement Learning and what is it capable of?

Credit: CIO.com

Insilico Medicine created a drug in just 21 days: what usually takes eight years was reduced to three weeks with reinforcement learning. But how?

“We’ve got AI strategy combined with AI imagination,” Insilico CEO Alex Zhavoronkov, told Forbes. The Hong Kong-based medicine company recently posted research that claimed their GENTRL system could identify potential treatments for fibrosis in just 21 days. That’s a level of efficiency that any industry dreams of, let alone healthcare.

Zhavoronkov reportedly became interested in Ian Goodfellow’s work in machine learning. This informed the direction of the company, researching and developing a reinforcement learning AI capable of creating a drug in just three weeks.

The traditional process to develop drug candidates takes over eight years. It costs millions of dollars too, compared to Insilico’s method, which is approximately $150,000 to implement. In order to develop drugs, molecules have to be screened: Insilico’s vision was that if a machine could do this, it would save a lot of time and effort all round.

Insilico Medicine is not the example of what Zhavoronkov describes as a marriage between imagination and strategy. AlphaGo Zero successfully taught itself to improve at the game of Go, combining a neural network with a search algorithm to predict moves. In the paper, ‘Reinforcement learning-based multi-agent system for network traffic signal control’, researchers tested multi-agent reinforcement learning for a more efficient traffic light system.

Even Twitter is set to use reinforcement learning to cut down on fake news.

How does reinforcement learning work?

Reinforcement learning is a seriously powerful AI method and it’s quite independent in comparison to supervised learning. Unlike supervised learning, you needn’t present labelled input or output pairs: a balance between the exploration and exploitation of data is instead the focus.

Consider Pac-Man, for a minute. In the iconic 80s arcade game, the titular character has to collect dots, avoid ghosts and select rewards that flash up on the screen.

Pac-Man is in a perpetual battle of exploration and exploitation. He can choose to exploit the small dots near to him to rack up points and even aim for the bigger dots if they are near to him. However, should he explore the maze a little further, he can pick up even more points from eating the ghosts when he’s energised: this is a risky strategy as, for a while, he’s chasing his predator and could be in danger when the energiser wears off.

An image of reinforcement learning, AI, What is Reinforcement Learning and what is it capable of?
The game of Pac-Man has similarities with the essentials of reinforced learning. / Credit: KnowYourMeme

This is an example of the exploitation/exploration trade-off: the idea that a gamble to explore may reward you more. It’s a cornerstone of computer science philosophy.

Supervised learning relies on the data provided: in reinforcement learning, the AI has to pick up the data itself as it goes, rather like Pac-Man has to eat his way through flashing dots. So the actions your AI takes, like Pac-Man, inform the data that gets collected: sometimes it’s worth considering new actions to gather new data – exploring – whereas other times, an AI will exploit the data it has.

To exploit or explore

Choosing whether to exploit or explore randomly is not the most efficient way to produce results. Wouldn’t it be better if an AI could be more accurate – more greedy, in fact – and find the highest value of an action without having to explore so much?

This is what’s known as a Markov Decision Process.

Say the AI is faced with a choice among a number (k) of different actions. After each choice, depending on the action, the AI may get a reward. It’s the AI’s aim to try and receive the biggest reward as possible. This is what’s known as the k-armed bandit problem, a reference to slot machines and a continuation of the arcade theme. The AI keeps pulling on the lever to maximise its jackpot, so to speak.


Reinforcement learning demands an enormous skillset, gargantuanly complex algorithms and accurate simulations of real-world environments.


So, if we can work out the value of a k action, we can always select the action with the highest value. It’s fair to assume we don’t know action values but we can estimate. At any one time, one action must have the greatest estimated value.

These are what are known as “greedy actions”: when you select one of these actions, you are exploiting its knowledge of the values of the actions. If you choose to gamble and go “non-greedy”, this is exploring. Exploitation maximises expected reward, but exploration may produce greater reward in the long run. Exploration is necessary because we can never be sure how precise action-value estimates are. 

Exploration and exploitation revolve around reward and regret; this is true of computer science, ordering something new from the menu or leaving a job you’re happy in for more money. An AI wants to maximise cumulative reward and minimise total regret.

We want algorithms that bring regret closer towards zero: deep neural nets can process extremely complex functions like this.

Reinforced learning is entering the fray

Supervised learning is still the dominant technique in artificial intelligence. Examples of big companies employing reinforcement learning are still pretty rare but are growing steadily: reinforcement learning has long been an academic research subject, shunned in favour of more straightforward frameworks.

If reinforcement learning sounds complex, that’s because it is: very. It demands an enormous skillset, gargantuanly complex algorithms and accurate simulations of real-world environments.

The crux of reinforcement learning is an accessible one, though: a dilemma similar to the ones we face as individuals in our everyday lives. Do we stick or twist? It’s a question we ask ourselves regularly, yet until now, very few have been willing to invest in what has long been seen as a risky technique.

Insilico Medicine is just one recent example of how reinforced learning can lead to incredible new discoveries. Just like with the technique itself, the journey will be formative. Reinforcement learning may be a complex topic, only just stepping into its spotlight, but with risk, there always comes a lot of reward.

An image of reinforcement learning, AI, What is Reinforcement Learning and what is it capable of?

Luke Conrad

Technology & Marketing Enthusiast

The critical role of data integrity in generative AI

Anjan Kundavaram • 23rd November 2023

The quest to harness the full potential of generative AI relies on finding trustworthy data to achieve outstanding results for diverse use cases. With the continued growth and transformative impact of generative AI, business leaders need to ensure that the data being fed into it has integrity.

Navigating a CTO-as-a-Service arrangement

Cyril Samovskiy • 21st November 2023

Attracting a top-tier Chief Technology Officer (CTO) can be challenging at the best of times, but for tech startups – who often have limited resources, a yet-to-be-proven product-market fit, and financial instability – it can be even more so. Add tech’s ongoing talent shortage to the mix, and it’s easy to see why CTO-aaS is...

The Importance of SBOM and CVE in Medical

Diego Buffa • 18th November 2023

This article explores the critical landscape of medical device cybersecurity, focusing on the IMDRF’s “Principles and Practices for Medical Device Cybersecurity.” It advocates for a holistic approach throughout the product life cycle, with particular emphasis on the vital role of the Software Bill of Materials (SBOM). The article addresses the FDA’s stringent postmarket vulnerability reporting...

AI powered fused spurs unveiled by measurable.energy

Diana Kamkina • 15th November 2023

measurable.energy, experts in eliminating wasted energy, are proud to announce the launch of their latest innovation – fused spurs. This highly anticipated addition to their product line is set to transform the landscape of energy management in construction and commercial buildings.

AI powered fused spurs unveiled by measurable.energy

Diana Kamkina • 15th November 2023

measurable.energy, experts in eliminating wasted energy, are proud to announce the launch of their latest innovation – fused spurs. This highly anticipated addition to their product line is set to transform the landscape of energy management in construction and commercial buildings.

Technology for a Sustainable Tomorrow

Mark Robison • 09th November 2023

We currently face the critical challenge of reducing carbon emissions in an effort to reach net zero targets. This is the challenge of our lifetime and for many more generations to come. Fortunately, this challenge has ushered in a new era of innovation, where technology plays a leading role in creating a sustainable future.

Preparing UK Businesses for the Coming PSTN Switch Off

Chris Wade • 01st November 2023

The PSTN Switch Off will require a robust framework of action as all business sectors will be impacted. In order to stay ahead of this significant change, businesses must start considering new, digital alternatives such as VoIP based communication technology.

Dark Fibre’s Role in Supercharging Edge Data Centers

Sean Lowry • 18th October 2023

In response to Proximity Data Centre’s e-book, Glide’s CTO, Sean Lowry explores the impact of low latency on gaming, the Metaverse, and AI. He explains how dark fibre and Glide’s “Fibre Cities” are primed to support the evolving needs of edge data centres and seamless connectivity.