Training Deep Learning AI to Predict microRNA-Gene Interactions

Apr 3, 2025 | Life Sciences & Biology, Medical & Health Sciences

Non-coding microRNAs (miRNAs) have important regulatory functions but are also implicated in various diseases. Mr Seung-won Yoon, PhD candidate at Chungnam National University, Republic of Korea, is training deep learning AI models to predict miRNA-gene associations. His research has implications for understanding disease pathogenesis, particularly cancer, and repurposing drugs for untreatable diseases.

The Diverse miRNAome

Think of RNA, and you probably think of messenger RNA (mRNA), transfer RNA (tRNA), or perhaps even ribosomal RNA. However, RNA’s scope extends far beyond protein synthesis. Cells contain a plethora of microRNA (miRNA) molecules – small non-coding single-stranded RNAs, around 22 nucleotides long. miRNAs regulate gene expression by hybridising to mRNA gene transcripts, usually to the mRNA’s 3’-UTR. This leads to gene silencing via various mechanisms – repressing translation, mRNA deadenylation, or activating mRNA cleavage.

The miRNAome is turning out to be more diverse than was ever thought possible. At least 2,000 human miRNAs have been identified as being important to survival. It is discovered that miRNAs are pivotal to a host of biological processes, including cell division, differentiation and death, nervous system development, immunity, and signal transduction. Conversely, miRNAs have been implicated in cellular dysfunction, leading to diseases such as cancers. Despite the physiological significance of miRNAs, their interactions with gene mRNA transcripts are not fully known. Hoping to elucidate these interactions is Mr Seoung-Won Yoon, a graduate student at Chungnam National University, South Korea.

Probing the miRNAome with Deep Learning AI

Mr Yoon is a PhD candidate in Professor Kyu-Chul Lee’s group at Chungnam’s Department of Computer Science and Engineering. The group has previously applied advanced computational approaches to various real-world applications, including databases and the internet-of-things. In recent years, they have turned their computational expertise to bioinformatics applications.

As miRNAs and gene transcripts cannot be observed directly, wet lab experiments are inadequate for studying their interactions, being complicated, time-consuming, and expensive. Instead, Mr Yoon and the group are deploying deep learning models to predict human miRNA-gene associations. Given the diversity of the miRNAome and putative gene targets, data-heavy computational methods are needed. Deep learning lends itself well to this, as it can identify sequence and mechanistic features of miRNAs and genes, and predict relationships.

Machine learning (ML) is a branch of AI with enormous potential in the biological sciences. In ML, algorithms take in data, and generalise to unknown data. Artificial neural networks (NNs) are a type of ML consisting of a network of connected nodes. In NNs, a signal is taken in via an input layer, passed through hidden layers, and then out through an output layer. A common analogy is that of the brain, with nodes representing neurons, and links between nodes representing synapses. Deep learning refers to an advanced type of NN with multiple hidden layers. Just like a brain, it’s capable of advanced learning.

Positive Training

Much like our own brains, AI must be trained for optimum ‘cognitive’ performance – no mean feat, as massive datasets are needed! To train their deep learning models, Mr Yoon and the group used three datasets. Firstly, they used a dataset of proven positive miRNA-gene associations. These were taken from miRTarBase, a curated database of over half a million miRNA-gene association pairs. After filtering out duplicates, 380,634 pairs remained. They used two additional databases, miRBase and biomaRt. miRBase contains sequence information and annotation information for miRNAs. biomaRt is an open dataset of gene sequences from the European Bioinformatics Institute. Of the 380,634 pairs, miRNAs and genes lacking sequence information on miRBase or biomaRt were excluded. This resulted in 2656 miRNA and 14,319 gene sequences, generating a total of 38,031,264 datasets (2656 × 14,319) and 358,864 positive miRNA-gene relationships.

The next step is ‘data embedding’. This involves extraction and vectorising of sequence features of the 2656 miRNAs and 14,319 genes. Each data element was embedded in 64 dimensions.

Balancing the Positive with the Negative

When training deep learning models, it’s beneficial to have a balanced dataset – not only positive data but also negative data (with no interaction). As negative data do not exist in nature, they must be curated. Negative data may be generated randomly, but this is not the most robust way. It’s better to generate negative data methodically using sophisticated criteria. To generate the negative data, the 358,864 positive relationships were removed from the 38,031,264 datasets. Further data were filtered out using ‘distance’ criteria (Euclidean distance, cosine similarity, and Mahalanobis distance) – this works as the embedded datasets exist in vector space. After filtering, 4,932,554 negative candidates were obtained. From these, 358,864 negative datasets were randomly selected to exactly balance the positive datasets (1:1 ratio). This yielded 717,728 data elements (358,864 + 358,864), each representing 124 dimensions. This is the largest sequence dataset ever constructed for miRNA-gene associations.

Unidirectional and Bidirectional Deep Learning

Mr Yoon and the group investigated two different types of deep learning model – long short-term memory (LSTM) and bidirectional LSTM (Bi-LSTM). Both are recurrent NNs (RNNs), a type of NN suitable for sequential data. LSTMs have a feature known as cell state that allows them to ‘remember’ and predict future data points. Traditional LSTMs input data unidirectionally front-to-back. In contrast, Bi-LSTMs input data both forwards and backwards.

The group tested an LSTM with three layers and a Bi-LSTM with two layers. The 717,728 data elements were divided into training data and test data in an 8:2 ratio, and fed into both models. Which model has the better performance in predicting miRNA-gene associations? In principle, this should be the Bi-LSTM, as the bidirectional information flow provides a richer representation. However, this takes up computational power, and the group found that the bi-LSTM had a slower training time than the LSTM. Instead, they considered the simpler yet faster LSTM as being more appropriate for the miRNA-gene dataset and selected it as their deep learning model.

Assessing Model Performance

A metric known as the area under the receiver operating characteristic curve (AUC) may be used to assess the performance of deep learning models. This takes into account ‘true positives’ and the avoidance of ‘false negatives’. Using a statistical method called K-fold cross-validation, the group determined that the LSTM model’s average AUC was 0.98 – close to 1.0 (the maximum), indicating a very good generalisation performance. Finally, they validated the model, confirming that it is able to uncover novel miRNA-gene association pairs not present in their positive training dataset.

Predicting BRCA2-associated miRNAs

The influence of miRNAs on cancer is a lively field of research. Variants or mutations of certain genes are implicated in various cancers. How do miRNA-gene interactions contribute to this? This is an important research question for Mr Yoon. BRCA2 is a human gene encoding a protein that repairs DNA replication errors and regulates the cell cycle. BRCA2 mutations can lead to malignant cells with unrepaired DNA damage, implicated in breast, ovarian, and prostate cancers. The group used their deep learning model to predict miRNAs that associate with BRCA2. Curiously, among the top 10 predicted candidates were miRNAs with known associations with prostate, ovarian, breast, and cervical cancers. 

Going forward, Mr Yoon and the group want to further train the deep learning model for better prediction of miRNA-gene associations, and understand how these translate to disease phenotypes. Beyond BRCA2, they will focus on incurable and intractable diseases, deploying deep learning approaches to elucidate the pathogenetic entities involved. They hope to apply their deep learning insights to repurposing drugs to treat currently incurable diseases.

SHARE

DOWNLOAD E-BOOK

REFERENCE

https://doi.org/10.33548/SCIENTIA1185

MEET THE RESEARCHERS


Mr Seoung-Won Yoon
Chungnam National University, Yuseong-Gu, Daejeon, Republic of Korea

Mr Seoung-Won Yoon is a PhD Candidate at Chungnam National University, Republic of Korea, under the supervision of Professor Kyuchul Lee in the Computer Science & Engineering Department. Mr Yoon’s research involves developing deep learning models, primarily for bioinformatics and drug repurposing. He has so far participated in more than 10 nationally funded projects. His work on deep learning has included model development for pancreatic cancer-related genes, miRNA and mRNA prediction, and predicting the relationships between bio-genetic data. Additionally, he has conducted research projects in risk prediction, evaluating the performance of user movement path predictions, and assessing similarity patterns in user data. Furthermore, he has worked on an internet-of-things-based wearable flu vaccine project aimed at preventing the spread of influenza.

CONTACT

E: yoonenoch11@gmail.com


Professor Kyuchul Lee
Chungnam National University, Yuseong-Gu, Daejeon, Republic of Korea

Professor Kyuchul Lee received his Bachelor’s degree in Computer Science from Seoul National University in 1984 and a PhD in the same field from the same university in 1990. He has been a faculty member at Chungnam National University since 1989 and has also held the positions of visiting researcher at IBM Almaden Center and visiting professor at Syracuse University. At the time of writing, he has successfully led 144 national research projects, published 56 international journal papers, 97 international conference presentations, 125 domestic journal papers, 206 domestic conference presentations, and holds 46 patents and intellectual property records. Additionally, he has facilitated 12 technology transfers. As an advisor in the Data Artificial Intelligence Research Lab (formerly Database Systems Lab), he has mentored over 110 Master’s and PhD students, developing core technologies in the fields of databases and AI. His research spans across multimedia data, XML, semantic web, IoT, AI, and big data. Through courses like File Processing, Database Systems, and Web Programming, he has equipped students with the knowledge to excel in their professional careers.

FUNDING

National Research Foundation of Korea (NRF)

Korean Government (MSIT)

FURTHER READING

S Yoon, I Hwang, J Cho, et al., miGAP: miRNA–Gene Association Prediction Method Based on Deep Learning Model, Applied Sciences, 2023, 13(22), 12349. DOI: https://doi.org/10.3390/app132212349

REPUBLISH OUR ARTICLES

We encourage all formats of sharing and republishing of our articles. Whether you want to host on your website, publication or blog, we welcome this. Find out more

Creative Commons Licence (CC BY 4.0)

This work is licensed under a Creative Commons Attribution 4.0 International License. Creative Commons License

What does this mean?

Share: You can copy and redistribute the material in any medium or format

Adapt: You can change, and build upon the material for any purpose, even commercially.

Credit: You must give appropriate credit, provide a link to the license, and indicate if changes were made.

SUBSCRIBE NOW


Follow Us

MORE ARTICLES YOU MAY LIKE

Shirley C. Strum | Learning from baboons

Shirley C. Strum | Learning from baboons

Shirley C. Strum has spent over 50 years studying wild baboons in Kenya. During that time, she has pioneered new ideas about baboons, about society, about nature, about science and about evolution. As she recounts in her new book Echoes of Our Origins: baboons, humans and nature, she was the first to suggest that baboon society is not based on male aggression and dominance, but that both males and females have effective non-aggressive alternatives. These “social strategies of competition and defence” rely on social relationships that create a “social contract” based on social sophistication, social intelligence, and collaboration, even without the benefits of human characteristics like symbols, language and culture.

Dr JoLee Sasakamoose – Dr Mamata Pandey | Empowering Indigenous Health: The Indigenous Wellness Research Collaborative in Saskatchewan

Dr JoLee Sasakamoose – Dr Mamata Pandey | Empowering Indigenous Health: The Indigenous Wellness Research Collaborative in Saskatchewan

The Indigenous Wellness Research Collaborative is a transformative alliance dedicated to advancing health systems and service delivery for Indigenous communities across Saskatchewan. Founded a decade ago and co-led by Dr Mamata Pandey, a research scientist at the Saskatchewan Health Authority, and Dr JoLee Sasakamoose (M’Chigeeng First Nation), Canadian Institute of Health Research Chair in Indigenous Wellness and Health Equity at the University of Regina, their team’s work is rooted in a commitment to Indigenous leadership and community-defined wellness goals. Guided by the Cultural Responsiveness Framework, the Collaborative prioritises creating ethical spaces that serve as a middle ground for respect, reciprocity, and authentic partnerships. The team employs a strengths-based approach to health research, centering Indigenous methodologies that respect the interconnectedness of spiritual, mental, emotional, and physical well-being.

Professor Jaya Krishnan | Revolutionary Gene Therapy Helps Hearts Regenerate After Heart Attacks

Professor Jaya Krishnan | Revolutionary Gene Therapy Helps Hearts Regenerate After Heart Attacks

Myocardial infarction, commonly termed as a heart attack, is a major cause of death and poor health worldwide. Regenerating heart tissue is an exciting and promising concept that can have significant benefits in myocardial infarctions and related diseases, but this has not yet been achieved in real-life clinical treatments. In a collaboration between Goethe University Frankfurt and Goethe University Hospital, Professor Jaya Krishnan and colleagues address this by controlling pathologic genes involved in the development of heart failure that develops after heart attacks. The researchers demonstrate a new way of treating heart disease by aiding in the division and regrowth of heart cells after a heart attack.

James J. Driscoll, MD, PhD | Immunoproteasome Activation Enhances the Recognition of Tumour Cells and Boosts Anticancer Immune Responses

James J. Driscoll, MD, PhD | Immunoproteasome Activation Enhances the Recognition of Tumour Cells and Boosts Anticancer Immune Responses

The correct functioning of the human immune system depends on its ability to recognise danger, such as tumour cells, viruses, and bacteria. Scientists are learning how immunoproteasome activation can overcome the mechanisms by which cancer cells escape immune responses. Immunoproteasomes are small high molecular weight protein-degrading machines that signpost abnormal proteins made by cancer cells, directing the immune system against them. Dr James Driscoll at University Hospitals Cleveland Medical Center is using novel proprietary small molecules to selectively boost the catalytic activity of immunoproteasomes, which increases the tumour killing (or cytotoxic) effect of a group of white cells called T-cells. These findings provide a strong rationale for developing personalised therapeutics that target immunoproteasomes, for cancer and other immune-mediated conditions.