Training Deep Learning AI to Predict microRNA-Gene Interactions

Apr 3, 2025 | Life Sciences & Biology, Medical & Health Sciences

Non-coding microRNAs (miRNAs) have important regulatory functions but are also implicated in various diseases. Mr Seung-won Yoon, PhD candidate at Chungnam National University, Republic of Korea, is training deep learning AI models to predict miRNA-gene associations. His research has implications for understanding disease pathogenesis, particularly cancer, and repurposing drugs for untreatable diseases.

The Diverse miRNAome

Think of RNA, and you probably think of messenger RNA (mRNA), transfer RNA (tRNA), or perhaps even ribosomal RNA. However, RNA’s scope extends far beyond protein synthesis. Cells contain a plethora of microRNA (miRNA) molecules – small non-coding single-stranded RNAs, around 22 nucleotides long. miRNAs regulate gene expression by hybridising to mRNA gene transcripts, usually to the mRNA’s 3’-UTR. This leads to gene silencing via various mechanisms – repressing translation, mRNA deadenylation, or activating mRNA cleavage.

The miRNAome is turning out to be more diverse than was ever thought possible. At least 2,000 human miRNAs have been identified as being important to survival. It is discovered that miRNAs are pivotal to a host of biological processes, including cell division, differentiation and death, nervous system development, immunity, and signal transduction. Conversely, miRNAs have been implicated in cellular dysfunction, leading to diseases such as cancers. Despite the physiological significance of miRNAs, their interactions with gene mRNA transcripts are not fully known. Hoping to elucidate these interactions is Mr Seoung-Won Yoon, a graduate student at Chungnam National University, South Korea.

Probing the miRNAome with Deep Learning AI

Mr Yoon is a PhD candidate in Professor Kyu-Chul Lee’s group at Chungnam’s Department of Computer Science and Engineering. The group has previously applied advanced computational approaches to various real-world applications, including databases and the internet-of-things. In recent years, they have turned their computational expertise to bioinformatics applications.

As miRNAs and gene transcripts cannot be observed directly, wet lab experiments are inadequate for studying their interactions, being complicated, time-consuming, and expensive. Instead, Mr Yoon and the group are deploying deep learning models to predict human miRNA-gene associations. Given the diversity of the miRNAome and putative gene targets, data-heavy computational methods are needed. Deep learning lends itself well to this, as it can identify sequence and mechanistic features of miRNAs and genes, and predict relationships.

Machine learning (ML) is a branch of AI with enormous potential in the biological sciences. In ML, algorithms take in data, and generalise to unknown data. Artificial neural networks (NNs) are a type of ML consisting of a network of connected nodes. In NNs, a signal is taken in via an input layer, passed through hidden layers, and then out through an output layer. A common analogy is that of the brain, with nodes representing neurons, and links between nodes representing synapses. Deep learning refers to an advanced type of NN with multiple hidden layers. Just like a brain, it’s capable of advanced learning.

Positive Training

Much like our own brains, AI must be trained for optimum ‘cognitive’ performance – no mean feat, as massive datasets are needed! To train their deep learning models, Mr Yoon and the group used three datasets. Firstly, they used a dataset of proven positive miRNA-gene associations. These were taken from miRTarBase, a curated database of over half a million miRNA-gene association pairs. After filtering out duplicates, 380,634 pairs remained. They used two additional databases, miRBase and biomaRt. miRBase contains sequence information and annotation information for miRNAs. biomaRt is an open dataset of gene sequences from the European Bioinformatics Institute. Of the 380,634 pairs, miRNAs and genes lacking sequence information on miRBase or biomaRt were excluded. This resulted in 2656 miRNA and 14,319 gene sequences, generating a total of 38,031,264 datasets (2656 × 14,319) and 358,864 positive miRNA-gene relationships.

The next step is ‘data embedding’. This involves extraction and vectorising of sequence features of the 2656 miRNAs and 14,319 genes. Each data element was embedded in 64 dimensions.

Balancing the Positive with the Negative

When training deep learning models, it’s beneficial to have a balanced dataset – not only positive data but also negative data (with no interaction). As negative data do not exist in nature, they must be curated. Negative data may be generated randomly, but this is not the most robust way. It’s better to generate negative data methodically using sophisticated criteria. To generate the negative data, the 358,864 positive relationships were removed from the 38,031,264 datasets. Further data were filtered out using ‘distance’ criteria (Euclidean distance, cosine similarity, and Mahalanobis distance) – this works as the embedded datasets exist in vector space. After filtering, 4,932,554 negative candidates were obtained. From these, 358,864 negative datasets were randomly selected to exactly balance the positive datasets (1:1 ratio). This yielded 717,728 data elements (358,864 + 358,864), each representing 124 dimensions. This is the largest sequence dataset ever constructed for miRNA-gene associations.

Unidirectional and Bidirectional Deep Learning

Mr Yoon and the group investigated two different types of deep learning model – long short-term memory (LSTM) and bidirectional LSTM (Bi-LSTM). Both are recurrent NNs (RNNs), a type of NN suitable for sequential data. LSTMs have a feature known as cell state that allows them to ‘remember’ and predict future data points. Traditional LSTMs input data unidirectionally front-to-back. In contrast, Bi-LSTMs input data both forwards and backwards.

The group tested an LSTM with three layers and a Bi-LSTM with two layers. The 717,728 data elements were divided into training data and test data in an 8:2 ratio, and fed into both models. Which model has the better performance in predicting miRNA-gene associations? In principle, this should be the Bi-LSTM, as the bidirectional information flow provides a richer representation. However, this takes up computational power, and the group found that the bi-LSTM had a slower training time than the LSTM. Instead, they considered the simpler yet faster LSTM as being more appropriate for the miRNA-gene dataset and selected it as their deep learning model.

Assessing Model Performance

A metric known as the area under the receiver operating characteristic curve (AUC) may be used to assess the performance of deep learning models. This takes into account ‘true positives’ and the avoidance of ‘false negatives’. Using a statistical method called K-fold cross-validation, the group determined that the LSTM model’s average AUC was 0.98 – close to 1.0 (the maximum), indicating a very good generalisation performance. Finally, they validated the model, confirming that it is able to uncover novel miRNA-gene association pairs not present in their positive training dataset.

Predicting BRCA2-associated miRNAs

The influence of miRNAs on cancer is a lively field of research. Variants or mutations of certain genes are implicated in various cancers. How do miRNA-gene interactions contribute to this? This is an important research question for Mr Yoon. BRCA2 is a human gene encoding a protein that repairs DNA replication errors and regulates the cell cycle. BRCA2 mutations can lead to malignant cells with unrepaired DNA damage, implicated in breast, ovarian, and prostate cancers. The group used their deep learning model to predict miRNAs that associate with BRCA2. Curiously, among the top 10 predicted candidates were miRNAs with known associations with prostate, ovarian, breast, and cervical cancers. 

Going forward, Mr Yoon and the group want to further train the deep learning model for better prediction of miRNA-gene associations, and understand how these translate to disease phenotypes. Beyond BRCA2, they will focus on incurable and intractable diseases, deploying deep learning approaches to elucidate the pathogenetic entities involved. They hope to apply their deep learning insights to repurposing drugs to treat currently incurable diseases.

SHARE

DOWNLOAD E-BOOK

REFERENCE

https://doi.org/10.33548/SCIENTIA1185

MEET THE RESEARCHERS


Mr Seoung-Won Yoon
Chungnam National University, Yuseong-Gu, Daejeon, Republic of Korea

Mr Seoung-Won Yoon is a PhD Candidate at Chungnam National University, Republic of Korea, under the supervision of Professor Kyuchul Lee in the Computer Science & Engineering Department. Mr Yoon’s research involves developing deep learning models, primarily for bioinformatics and drug repurposing. He has so far participated in more than 10 nationally funded projects. His work on deep learning has included model development for pancreatic cancer-related genes, miRNA and mRNA prediction, and predicting the relationships between bio-genetic data. Additionally, he has conducted research projects in risk prediction, evaluating the performance of user movement path predictions, and assessing similarity patterns in user data. Furthermore, he has worked on an internet-of-things-based wearable flu vaccine project aimed at preventing the spread of influenza.

CONTACT

E: yoonenoch11@gmail.com


Professor Kyuchul Lee
Chungnam National University, Yuseong-Gu, Daejeon, Republic of Korea

Professor Kyuchul Lee received his Bachelor’s degree in Computer Science from Seoul National University in 1984 and a PhD in the same field from the same university in 1990. He has been a faculty member at Chungnam National University since 1989 and has also held the positions of visiting researcher at IBM Almaden Center and visiting professor at Syracuse University. At the time of writing, he has successfully led 144 national research projects, published 56 international journal papers, 97 international conference presentations, 125 domestic journal papers, 206 domestic conference presentations, and holds 46 patents and intellectual property records. Additionally, he has facilitated 12 technology transfers. As an advisor in the Data Artificial Intelligence Research Lab (formerly Database Systems Lab), he has mentored over 110 Master’s and PhD students, developing core technologies in the fields of databases and AI. His research spans across multimedia data, XML, semantic web, IoT, AI, and big data. Through courses like File Processing, Database Systems, and Web Programming, he has equipped students with the knowledge to excel in their professional careers.

FUNDING

National Research Foundation of Korea (NRF)

Korean Government (MSIT)

FURTHER READING

S Yoon, I Hwang, J Cho, et al., miGAP: miRNA–Gene Association Prediction Method Based on Deep Learning Model, Applied Sciences, 2023, 13(22), 12349. DOI: https://doi.org/10.3390/app132212349

REPUBLISH OUR ARTICLES

We encourage all formats of sharing and republishing of our articles. Whether you want to host on your website, publication or blog, we welcome this. Find out more

Creative Commons Licence (CC BY 4.0)

This work is licensed under a Creative Commons Attribution 4.0 International License. Creative Commons License

What does this mean?

Share: You can copy and redistribute the material in any medium or format

Adapt: You can change, and build upon the material for any purpose, even commercially.

Credit: You must give appropriate credit, provide a link to the license, and indicate if changes were made.

SUBSCRIBE NOW


Follow Us

MORE ARTICLES YOU MAY LIKE

Professor Nicola Curtin | Potential for Improving Cancer Treatment by Optimising Drug Scheduling

Professor Nicola Curtin | Potential for Improving Cancer Treatment by Optimising Drug Scheduling

Cancers often develop because of faulty DNA repair systems. PARP inhibitors (PARPi) are a class of targeted anti-cancer drugs that exploit this weakness, by inhibiting a complementary DNA repair system, to selectively target the tumour. However, these medicines need to be taken every day, creating a burden on patients and reducing the options for combination with other anticancer therapies. Professor Nicola Curtin and her team at Newcastle University investigated how long different PARPi stay active in cancer cells after a single dose and how this influences their effectiveness when combined with another anti-cancer drug.

Dr Richard Marchant | Restoring the Flow: Stream Life Slowly Returns After Rabbit Eradication on Macquarie Island

Dr Richard Marchant | Restoring the Flow: Stream Life Slowly Returns After Rabbit Eradication on Macquarie Island

The remote streams of subantarctic Macquarie Island are home to low diversity freshwater invertebrate communities with an unusual taxonomic composition. However, over a century of grazing by introduced rabbits dramatically degraded surrounding vegetation, increasing erosion and disturbing stream ecosystems. Following rabbit eradication in 2016, Dr Richard Marchant of Museums Victoria and colleagues from the University of Canberra and the Australian Antarctic Division investigated whether the island’s streams were recovering ecologically. Their study reveals a slow but measurable resurgence of invertebrate taxa, particularly in areas with moderate prior damage and vegetation regrowth, though full recovery remains uncertain.

Feeding Bovine Colostrum to Chickens Creates Healthier Guts and Better Growth

Feeding Bovine Colostrum to Chickens Creates Healthier Guts and Better Growth

A research team at the University of Maryland School of Medicine has discovered that a simple food supplement, specifically defatted bovine colostrum, can significantly enhance chicken gut health, reduce inflammation, and improve growth efficiency. The findings have implications for both animal welfare and human health under the One Health concept.

Dr Jiexin Deng | Optimising Warfarin Treatments for Chinese Patients

Dr Jiexin Deng | Optimising Warfarin Treatments for Chinese Patients

Warfarin is a commonly prescribed oral blood thinner used for the prevention and treatment of thromboembolic conditions. The wide variability in these conditions, that may range from deep vein thrombosis to heart valve replacement, adds to the complexity in determining dosing requirements among patients. Dr Jiexin Deng and colleagues at Zhengzhou Cardiovascular Hospital and Huaihe Hospital of Henan University in China have investigated the suitability of various pharmacogenetic algorithms based on different ethnicities to assist with warfarin dosing for the Chinese population, hoping to improve clinical outcomes and reduce the incidence of unwanted side effects.