How big data can help us understand mental illness

Big data could lead to personalized medicine.

Although mental health is currently one of the top global health priorities, there has been no major progress in treating psychiatric disorders since the 60s. The available therapies are ineffective, and prophylaxis is practically non-existent. To find a way out of this deadlock, psychiatry needs to consider different research approaches.

One of the major obstacles for moving psychiatry forward is the enormous inter-individual variability. Patients with somewhat similar symptoms receive the same diagnosis, although the symptoms might have very different biological backgrounds. This is comparable to including migraine, hypertension and brain cancer into one diagnostic category, because all of them are characterised by headache. As a result, at the moment, we are blindly trying to use same treatment for a number of different diseases. Although DSM-5* was intended to reflect the biology of the diseases, it remains largely based on clinical symptoms. This is not going to change any time soon, because the mechanisms underpinning mental illnesses are still largely unknown. However, as long as psychiatric research relies on a symptom-based diagnostic system, our studies will remain inconclusive.

A potential way to end this impasse it to move the focus from diseases to symptoms in order to stratify patients into more homogenous categories. We could achieve this by applying big data approaches pioneered by the rare disease field. Automated computer analyses of big clinical datasets can reveal tendencies that would otherwise remain undetected, while reducing the time and effort required. By transforming how we classify psychiatric disorders, big data can help us avoid a lengthy process of therapeutic trial and error, and enable more personalised treatment. In fact, this approach could push forward not only psychiatry, but also other medical fields, such as diabetes and cancer.

The existing data, such as medical records kept by healthcare providers, are a valuable source of information that could be analysed using big data approaches. However, this requires revamping the way in which information is collected and stored. Automated analysis relies on whether data is readable for machines. To help researchers and healthcare providers store data in an optimal way, the FAIR (Findable, Accessible, Interoperable and Reusable) data principles (1) provide guidance on good data stewardship practice. One of the FAIR ways is to use ontologies. Ontologies are hierarchically organised terminology lists that allow encoding information in a standardised way. For example, OMIM (2), which is an ontology for genetic disorders, encodes schizophrenia as 181500. On the other hand, the Human Phenotype Ontology (HPO) (3) focuses on symptoms, e.g. “Hallucinations”, encoded as HP:0000738, belong to the wider category “Behavioural abnormality”, encoded as HP:0000708. The advantage of using ontologies is that the codes are the same even when the medical records are written in different languages or use different words to describe the same symptom.

Paper medical records contain clinical descriptions in sentences rather than in lists, so they cannot be analysed by computers. Translating thousands of medical records to digital ontology codes would be unfeasible, but several existing tools help to automatize this process. For example, Health 29 Phenotyper (4) extracts HPO terms from electronic medical records in just a few clicks. The tool recognises medical terms in several languages. In case of paper medical records, a photo of the text can be automatically processed by any optical character recognition software to create its digital copy.

We should pay attention not only to the format but also to the content of medical records. Detailed and precise information, e.g. “auditory hallucinations”, will provide better results than just broad categories, e.g. “psychosis”. It is important to record all symptoms, even when they seem irrelevant, because further analysis may reveal unknown comorbidities. Based on comprehensive data, artificial intelligence is already able to assist clinicians in making diagnosis and even correct them in cases of mistakes. Large-scale analysis of medical records can also help discover previously overlooked prodromal symptoms. Since many psychiatric diseases have a developmental component, early identification of individuals at risk would be key for developing disease prevention strategies.

Although the examples above focus on phenotypic information, big data approaches can be applied to any data type used in clinical practice, including blood tests, brain scan images, genome sequencing data and even comprehensive records of patients’ environmental exposures history. By combining different data types, we can have a more holistic approach to diagnosis and treatment. Joint analyses could let us, for example, discover new gene-environment associations and identify early disease biomarkers.

Health data are sensitive data, and their large-scale analysis is associated with several ethical, legal and social issues. It is important to remember that patient benefit is the main priority of all studies, and patient communities should actively participate in determining what data are used, how and for what purposes. Together, patients, clinicians and data experts can steer psychiatry to a new direction, that might result in important discoveries, better diagnosis and new treatments.

*Diagnostic and Statistical Manual of Mental Disorders, 5th Edition, published in 2013

© Dorota Badowska


1. Wilkinson MD et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data, 2016; 3: 160018

2. OMIM – An Online Catalog of Human Genes and Genetic Disorders:

3. Human Phenotype Ontology:

4. Health 29 Phenotyper:

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Leave a reply