The AI and data solutions helping to fight the coronavirus
Nearly 400 Ericsson data scientists, data engineers, data visualizers and other volunteers recently took part in The White House Office of Science and Technology Policy’s COVID-19 Open Research Dataset Challenge. Hear from some of the participants and discover the innovative solutions submitted throughout the last four weeks.
The COVID-19 Open Research Dataset Challenge (CORD-19) aimed to gather the nation’s AI experts, and in less than one month, develop the tools and techniques to provide a means for the science and medical research community to efficiently and accurately utilize the ever-growing set of medical journal-based research to address the array of challenging questions posed by the COVID-19 pandemic. Late last week, leveraging automation and AI techniques, each Ericsson team completed and submitted a solution for all nine tasks included within the CORD-19 challenge.
Bhavika Jalli is a GAIA Data Scientist in Santa Clara, California. She has worked at Ericsson for one and a half years. She holds a Master’s Degree in Electrical and Computer Engineering from the University of Michigan, Ann Arbor.
Ericsson has provided us with the platform to participate in the COVID-19 Open Research Dataset Challenge (CORD-19) hosted on Kaggle. The task assigned to our team was to answer the question “What has been published about medical care?", such that we can enable easy data extraction from the given corpus for the healthcare providers.
Our task was further divided into 16 subtasks, each with a specific question about available medical care, such as “Resources to support skilled nursing facilities and long-term care facilities.” I am one of the Data Scientists on the team and volunteered as the lead Data Scientist for the task to help coordinate everyone’s efforts in developing a solution.
One of the challenges with this task was the need to precisely answer the input question while also providing a link to the relevant document(s). First, we noticed that the questions asked were short and quite general in nature. To address this, synonyms were added to the questions using the Wordnet corpus available within NLTK. It is important to note that only the words with verb, adjective and common noun Parts of Speech (POS) tags were selected. Second, we realized that the answer to these questions might not actually be the primary focus of the article. In order to highlight specifically relevant blocks of text, we trained both the BERT transformer and Word-to-Vec embeddings using the CORD-19 corups to enable presenting highlights from each article.
Ilyas Habeeb is a GAIA Data Engineer in Santa Clara, California. He has worked at Ericsson for nine months. He holds a Master’s Degree in Computer Science from New York University.
Our team focused on task 6 in the competition: “What do we know about non-pharmaceutical interventions (NPIs)?” The objective of the task was to find NPIs from the research papers that could help combat the spread of COVID-19. Being a Data Engineer, I was responsible for data collection and processing.
Before starting any Data Science task, it’s imperative to gather data and present it in a way that makes it easier for a Data Scientist to perform machine learning magic. As a first step, to ensure that we leverage the latest version of the frequently updated data set on Kaggle, it was necessary to automate this data extraction. Next, the data set – currently comprised of over 50,000 journal articles – needed to be parsed from its original JSON structure to a CSV file to facilitate the analysis steps.
Then, the text had to be processed, which included removing non-English papers, stop words, and punctuation, and performing tokenization and lemmatization. This step is necessary to remove noisy text and to enhance the performance of the model by improving the data quality.
As a final step, the data was merged with its metadata to provide access to relevant information such as authors, URL of the paper, and the publishing date.
Our team was able to build a machine/deep learning framework which included:
- A SciBERT model to extract the top 10 relevant journal papers to the task
- SciBERT + CountVectorizer to further rank the paragraphs of these 10 papers and isolate those that are most relevant
- SQuADBERT to summarize these paragraphs
- BLEU to improve the quality of the text answer
Dr. Jing Hu is a GAIA Data Scientist in Santa Clara, California. She has worked at Ericsson for 10 months, and holds a PhD in Electrical and Computer Engineering from the University of Florida.
Dr. Forough Yaghoubi is a GAIA Data Scientist in Stockholm, Sweden. She has worked at Ericsson for 10 months, and holds a PhD in Communication and Network Systems from KTH Royal Institute of Technology
Dr. Serveh Shalmashi is a GAIA Senior Data Scientist in Stockholm, Sweden. She has worked at Ericsson for 3 years, and holds a PhD in Communication Systems from KTH Royal Institute of Technology
Our task was, “What do we know about vaccines and therapeutics?” To determine the important embedded key phrases in a subtask such as, “efforts to develop prophylaxis clinical studies and prioritize in healthcare workers”, we filtered the two primary phrases, prophylaxis and healthcare workers. We also had to identify and preserve the ‘semantic relationship’ between them, ensuring we addressed preventative methods for health workers, as opposed to the general population.
In pre-processing for data cleaning, we removed all non-English, pre-Dec, 2019 and shorter (<150 words) papers, as well as those papers not containing COVID-19, SARS-CoV-2 or a variant. This step led to a subset of 3,800 (of 50,000+) papers. We then analyzed each question and extracted the key phrases and lists of related phrases through vocabulary analysis.
Next, Named Entity Recognition (NER) was used as a means of information extraction to locate key named entities in the corpus of each paper. Each key entity was defined based on the subtask question, such as drugs or Personal Protective Equipment (PPE). We used spaCy as a rule-based phrase matcher, which ensured the semantic relevance among detected phrases and sentences. Last, in post-processing, the relevant sentences were clustered based on similarity using BERT word embedding and NLKT clustering. We then used the genism summarizer to remove duplicate information and provide a coherent summary of each cluster.
Zeinab Sobhani is a GAIA Data Scientist in Montréal, Canada. She has worked at Ericsson for four months. She holds a master’s degree in Electrical Engineering from McGill University.
My role for my task was focused on preparing the raw and unstructured research papers and converting them into clean structured data, as inconsistent data leads to false conclusions. The degree to which the data is cleaned has a high impact on performance and results. Combining the brilliant ideas of my team and other data scientists with my own ideas, I defined a singular approach.
As a first step, the raw data – originally in a JSON format – was turned into a tabular format to provide easy access to the required information from the papers. Next, cleaning the content of the papers resulted in retaining the most important keywords, while removing uninformative and repetitive words. This is a very important step that greatly improves the performance of the models. The steps of the cleaning process are as follows:
- Discard non-English articles (less than five percent)
- Change words to lower case
- Remove stop words (for example, “the” and “it) and unwanted characters (reference symbols, webpage links, and so on)
- Lemmatization: grouping together inflected forms of a word so they can be analyzed as a single item, as identified by the word's lemma (dictionary form)
Dr. Aydin Sarraf is a GAIA Data Scientist in Montréal, Canada. He has worked at Ericsson for more than a year. He holds a PhD in Mathematics from The University of New Brunswick.
As a data scientist on my task, I attempted to answer the question, “What do we know about diagnostics and surveillance?” by applying machine learning techniques to leverage the COVID-19 Open Research Dataset. One of the technical problems in this research challenge was proper data pre-processing and cleaning. For example, the well-known Gensim library provides a neat function named preprocess_string(). Under the hood, this function uses a number of default filters, including strip_numeric(). However, this is not an ideal filter to use when COVID-19 is one of the crucial keywords. Furthermore, while it is true that lemmatization produces more accurate tokens than stemming, it is computationally intensive to the point that I found it unfeasible for the large dataset of this research challenge.
Venkata Snehith Reddy Govinda Reddy is a GAIA Data Scientist in Santa Clara, California. He has worked at Ericsson for fourteen months. He holds a Master’s Degree in Computer Software Engineering from Troy University.
As my task was related to vaccines and therapeutics. As an initial approach, I collected the list of drugs from Wikipedia’s List of Drugs by web scraping using the Python package, Beautiful Soup and by categorizing drugs into Antibiotics, Antiviral and All Other Drugs. Driving for further insights, we focused our investigations on the potential single drug and combinational drug interventions as discussed in the papers with different contexts (sentences). Incorporating the additional information into our model – including sentences that discussed drugs, types of drugs, and lists of drugs discussed in each sentence –this enabled us to generate a context-based Polarity Score from sentences that discussed a given drug such that we could emphasize it within the overall dataset. My lack of pharmaceutical domain knowledge challenged me when it came to the interpretation of results. For example, while looking for anomalies, regarding whether to consider them as anomalies, as their use spans multiple treatment applications like “insulin, etc.”
Dr. Saman Bashbaghi is a GAIA Data Scientist in Montreal, Canada. He has worked at Ericsson for nearly two years. He holds a Ph.D. in Applied Artificial Intelligence from Ecole de Technologie Superieure (ETS) Montreal.
The objective of this effort has been to leverage machine learning and natural language understanding capabilities to design an effective question-answering (Q-A) system that can perform as an intelligent search engine to find the most relevant answers to ethical and social considerations. To that end, the CORD-19 dataset is used to create a knowledge
base; features are extracted from its contents (abstracts and texts of the published articles) and are used to train a Q-A system via pre-processing, modeling and retrieval modules. During the design phase, LDA topic modeling has been applied to select several topics pertinent to the set of questions, which provides insights about the full perspective of the dataset using the most frequent keywords in each topic. Furthermore, the Doc2Vec approach was employed to convert the text into a meaningful feature vector, maintaining the spatiotemporal relation of the sentences within the document. Since the abstract of each article is the most informationally dense, it was utilized as the training data to help improve retrieval performance (less computational cost), while at the same time, training a lighter model. Finally, the resulting trained model was leveraged to retrieve the most relevant answers (summarized/prominent portions of the articles) and generate relevancy scores to the given questions using the Ball-tree unsupervised nearest neighbor technique during the operational phase.
Ericsson thanks all teams from various organizations that have taken part in the CORD-19 Challenge. This collective effort enhances the efficiency of medical researchers in their tireless efforts to expedite solutions for the many facets of the COVID-19 pandemic.
Read Paul McLachlan’s blog post, How can AI and data science help to fight the coronavirus?
To read more about Ericsson’s response to the coronavirus pandemic, you can read a blog post by our CEO: How we are responding to the coronavirus at Ericsson
We will also keep sharing information about our approach to the coronavirus and the measures we take on our coronavirus FAQ page.