What is Reinforcement Learning and what is it capable of?

Credit: CIO.com

Insilico Medicine created a drug in just 21 days: what usually takes eight years was reduced to three weeks with reinforcement learning. But how?

“We’ve got AI strategy combined with AI imagination,” Insilico CEO Alex Zhavoronkov, told Forbes. The Hong Kong-based medicine company recently posted research that claimed their GENTRL system could identify potential treatments for fibrosis in just 21 days. That’s a level of efficiency that any industry dreams of, let alone healthcare.

Zhavoronkov reportedly became interested in Ian Goodfellow’s work in machine learning. This informed the direction of the company, researching and developing a reinforcement learning AI capable of creating a drug in just three weeks.

The traditional process to develop drug candidates takes over eight years. It costs millions of dollars too, compared to Insilico’s method, which is approximately $150,000 to implement. In order to develop drugs, molecules have to be screened: Insilico’s vision was that if a machine could do this, it would save a lot of time and effort all round.

Insilico Medicine is not the example of what Zhavoronkov describes as a marriage between imagination and strategy. AlphaGo Zero successfully taught itself to improve at the game of Go, combining a neural network with a search algorithm to predict moves. In the paper, ‘Reinforcement learning-based multi-agent system for network traffic signal control’, researchers tested multi-agent reinforcement learning for a more efficient traffic light system.

Even Twitter is set to use reinforcement learning to cut down on fake news.

How does reinforcement learning work?

Reinforcement learning is a seriously powerful AI method and it’s quite independent in comparison to supervised learning. Unlike supervised learning, you needn’t present labelled input or output pairs: a balance between the exploration and exploitation of data is instead the focus.

Consider Pac-Man, for a minute. In the iconic 80s arcade game, the titular character has to collect dots, avoid ghosts and select rewards that flash up on the screen.

Pac-Man is in a perpetual battle of exploration and exploitation. He can choose to exploit the small dots near to him to rack up points and even aim for the bigger dots if they are near to him. However, should he explore the maze a little further, he can pick up even more points from eating the ghosts when he’s energised: this is a risky strategy as, for a while, he’s chasing his predator and could be in danger when the energiser wears off.

The game of Pac-Man has similarities with the essentials of reinforced learning. / Credit: KnowYourMeme

This is an example of the exploitation/exploration trade-off: the idea that a gamble to explore may reward you more. It’s a cornerstone of computer science philosophy.

Supervised learning relies on the data provided: in reinforcement learning, the AI has to pick up the data itself as it goes, rather like Pac-Man has to eat his way through flashing dots. So the actions your AI takes, like Pac-Man, inform the data that gets collected: sometimes it’s worth considering new actions to gather new data – exploring – whereas other times, an AI will exploit the data it has.

To exploit or explore

Choosing whether to exploit or explore randomly is not the most efficient way to produce results. Wouldn’t it be better if an AI could be more accurate – more greedy, in fact – and find the highest value of an action without having to explore so much?

This is what’s known as a Markov Decision Process.

Say the AI is faced with a choice among a number (k) of different actions. After each choice, depending on the action, the AI may get a reward. It’s the AI’s aim to try and receive the biggest reward as possible. This is what’s known as the k-armed bandit problem, a reference to slot machines and a continuation of the arcade theme. The AI keeps pulling on the lever to maximise its jackpot, so to speak.


Reinforcement learning demands an enormous skillset, gargantuanly complex algorithms and accurate simulations of real-world environments.


So, if we can work out the value of a k action, we can always select the action with the highest value. It’s fair to assume we don’t know action values but we can estimate. At any one time, one action must have the greatest estimated value.

These are what are known as “greedy actions”: when you select one of these actions, you are exploiting its knowledge of the values of the actions. If you choose to gamble and go “non-greedy”, this is exploring. Exploitation maximises expected reward, but exploration may produce greater reward in the long run. Exploration is necessary because we can never be sure how precise action-value estimates are. 

Exploration and exploitation revolve around reward and regret; this is true of computer science, ordering something new from the menu or leaving a job you’re happy in for more money. An AI wants to maximise cumulative reward and minimise total regret.

We want algorithms that bring regret closer towards zero: deep neural nets can process extremely complex functions like this.

Reinforced learning is entering the fray

Supervised learning is still the dominant technique in artificial intelligence. Examples of big companies employing reinforcement learning are still pretty rare but are growing steadily: reinforcement learning has long been an academic research subject, shunned in favour of more straightforward frameworks.

If reinforcement learning sounds complex, that’s because it is: very. It demands an enormous skillset, gargantuanly complex algorithms and accurate simulations of real-world environments.

The crux of reinforcement learning is an accessible one, though: a dilemma similar to the ones we face as individuals in our everyday lives. Do we stick or twist? It’s a question we ask ourselves regularly, yet until now, very few have been willing to invest in what has long been seen as a risky technique.

Insilico Medicine is just one recent example of how reinforced learning can lead to incredible new discoveries. Just like with the technique itself, the journey will be formative. Reinforcement learning may be a complex topic, only just stepping into its spotlight, but with risk, there always comes a lot of reward.

Luke Conrad

Technology & Marketing Enthusiast

Birmingham Unveils the UK’s Best Emerging HealthTech Advances

Kosta Mavroulakis • 03rd April 2025

The National HealthTech Series hosted its latest event in Birmingham this month, showcasing innovative startups driving advanced health technology, including AI-assisted diagnostics, wearable devices and revolutionary educational tools for healthcare professionals. Health stakeholders drawn from the NHS, universities, industry and front-line patient care met with new and emerging businesses to define the future trajectory of...

Why DEIB is Imperative to Tech’s Future

Hadas Almog from AppsFlyer • 17th March 2025

We’ve been seeing Diversity, Equity, Inclusion, and Belonging (DEIB) initiatives being cut time and time again throughout the tech industry. DEIB dedicated roles have been eliminated, employee resource groups have lost funding, and initiatives once considered crucial have been deprioritised in favour of “more immediate business needs.” The justification for these cuts is often the...

The need to eradicate platform dependence

Sue Azari • 10th March 2025

The advertising industry is undergoing a seismic shift. Connected TV (CTV), Retail Media Networks (RMNs), and omnichannel strategies are rapidly redefining how brands engage with consumers. As digital privacy regulations evolve and platform dynamics shift, advertisers must recognise a fundamental truth. You cannot build a sustainable business on borrowed ground. The recent uncertainty surrounding TikTok...

The need to clean data for effective insight

David Sheldrake • 05th March 2025

There is more data today than ever before. In fact, the total amount of data created, captured, copied, and consumed globally has now reached an incredible 149 zettabytes. The growth of the big mountain is not expected to slow down, either, with it expected to reach almost 400 zettabytes within the next three years. Whilst...

What can be done to democratize VDI?

Dennis Damen • 05th March 2025

Virtual Desktop Infrastructure (VDI) offers businesses enhanced security, scalability, and compliance, yet it remains a niche technology. One of the biggest barriers to widespread adoption is a severe talent gap. Many IT professionals lack hands-on VDI experience, as their careers begin with physical machines and increasingly shift toward cloud-based services. This shortage has created a...

Tech and Business Outlook: US Confident, European Sentiment Mixed

Viva Technology • 11th February 2025

The VivaTech Confidence Barometer, now in its second edition, reveals strong confidence among tech executives regarding the impact of emerging technologies on business competitiveness, particularly AI, which is expected to have the most significant impact in the near future. Surveying tech leaders from Europe and North America, 81% recognize their companies as competitive internationally, with...