The government wants researchers to use it to develop new text and data mining techniques that can help scientists answer key questions about the origins, transmissions and potential treatment of COVID-19: These questions have been published on Kaggle, a machine learning community owned by Google. They include:
What is known about transmission, incubation, and environmental stability? What do we know about COVID-19 risk factors? What do we know about virus genetics, origin, and evolution? What has been published about ethical and social science considerations? What do we know about diagnostics and surveillance? What do we know about non-pharmaceutical interventions? What has been published about medical care? What has been published about information sharing and inter-sectoral collaboration? What do we know about vaccines and therapeutics?
The entire COVID-19 Open Research Dataset (CORD-19) has been made available on SemanticScholar, a free, nonprofit, academic search engine. The collection will be updated whenever new research is published in archival services and peer-reviewed publications. [Read: OpenAI CEO offers funding for startups tackling coronavirus] “Decisive action from America’s science and technology enterprise is critical to prevent, detect, treat, and develop solutions to COVID-19,” said Michael Kratsios, the USA’s Chief Technology Officer.
Building the dataset
CORD-19 was constructed through a collaboration between a range of organizations after a request was made by The White House Office of Science and Technology Policy. The Chan Zuckerberg Initiative gave access to pre-publication content, while the National Library of Medicine (NLM) provided literature content. Microsoft provided literature curation tools that collated the research, before researchers from the Allen Institute for AI transformed the content into machine-readable form. All of this was coordinated by Georgetown University’s Center for Security and Emerging Technology. A number of other organizations have previously published scholarly articles on coronavirus that can be analyzed with AI, such as Chinese scientific journal database Chongqing VIP Information, which has made its academic papers free during the pandemic. However, the White House claims that CORD-19 is the world’s most extensive machine-readable coronavirus literature collection available for data and text mining. Researchers can check it out for themselves and submit their answers on Kaggle. And if the prospect of saving the world isn’t enough temptation to contribute, Kaggle is also offering a $1,000 prize for the best answer to each question.