Business Summary
With the occurance of the Covid-19 epidemic, it became obvious that epidemiological data are not uniformly recorded and made available.
Each country has its own data sources. Some of the data is hosted on Github (Johns Hopkins University, OpenZH for Switzerland, Statistikat for Austria), some of them are made available through other sources (e.g. the data of the Robert Koch Institute is hosted by ArcGis).
The data structure looks even worse. Although many data sources offer their data as CSV files, they do not have a uniform structure.
The Robert Koch Institute, for example, provides a file, which is extended daily with new and updated data. This file contains the newly reported cases for every German 'Landkreis' and every day.
Johns Hopkins University, on the other hand, provides a CVS file for each day with the total number of reported cases summed up form the beginning of the Covid-19 outbreak for each country.
Statistikat data ont the other hand are prepared in a different way. For each day these data are written in several CSV files and packed in ZIP archives.
The goal of the Applides Epidemic application is a uniform preparation of these data, so that they can be used for comparative statistics.
System Architecture
The system architecture describes the runtime environments of the components involved. The following graphic gives an overview of the components and their environments:
-
Ubuntu V-Server
On the Ubuntu V-Server at Strato run the Epidemic Adapters. These send the epidemic data in a uniform format to a Kafka Topic. The Epidemic Processor listens to this topic and prepares the data. -
Epidemic Adapter
An Epidemic Adapter reads epidemiological data from a specific data source. The task of the adapters is to read interpret the different formats of the various data sources and transform them into a uniform format. Each adapter sends the data to a Kafka Topic. -
Kafka Topic
We use Kafka as a messaging system to provide all interested parties with the uniform epidemiological data. Currently, the Epidemic Processor is the only Kafka listener. -
Epidemic Processor
The Edipemic Processor listens for new messages on the Kafka Topic. Its task is to summarise epidemiological data by region, to aggregate for superordinate regions and then save them into a MongoDB document. -
MongoDB Atlas
The epidemiological data of a MongoDB, which are summarised regionally, are stored in a MongoDB, which is hosted on MongoDB Atlaas. -
MongoDB
The data of each region (e.g. of a country or a county) is stored in a separate MongoDB document. -
Heroku
On Heroku we host the Spring Boot applications, which provide REST services with epidemiological statistics. -
Epidemic Statistic
The documents of MongoDB contain basic epidemiological data, such as the number of confirmed Covid-19 cases for the days of a year. The 'Epidemic Statistics' component uses the basic data to calculate statistics such as an estimate of the rate of reproduction and delivers the statistic through a Spring Boot REST service. -
Firebase
We host the Angular Epidemic UI at Google Firebase. -
Epidemic UI
Epidmeic UI is an Angular Web App, which retrieves statistical data from Epidemic Statistic REST. It presents statistic charts and tables of the different regions and from different data sources.
Component Architecture
The component architecture describes the component responsibilities, underlying data structures and their dependencies as well as main implementation technologies or frameworks used.
Epidemic Adapters
Responsibility
Each Covid-19 adpater is repsonsible for the transformation of the data from one data source into uniform data.
An adapter looks for new data on its data source every 30 minutes. Once it encounters new data, it fetches them from its data source, transforms it into the uniform format and puts it onto the Kafka topic.
Some data origns already provide their data on a daily basis (e.g. ECDC, RKI),
i.e. the data denote the new cases (confirmed, death or recovered) for each day.
Other data origns provide overall data starting from the beginnign of the Covid-19 epidemic
(e.g. JOHNS_HOPKINS, STATISTIKAT). For those origins, the adapter must calculate the daily data
as difference from the current day overall data and the previous day overall data.
The adapter should not aggregated data to superior regions or calculate any statistics.
Uniform data structure
The uniform data send by each adapter is very simple.
public class EpidemicData {
/**
* The epidemic data id
*/
private EpidemicDataId dataId;
/**
* The epidemic data value
*/
private double data;
}
The dataId uniquely identifies reported data values.
Some value is considered to be reported by the same data source,
for the same day and region and the same data type, when there
epdidemic data ids are equal.
public class EpidemicDataId {
/**
* This enumeration represents supported data origins (i.e. data sources),
* e.g. JOHNS_HOPKINS.
*/
private EEpidemicDataOrigin origin;
/**
* The data are reported for the region with this regionId.
*/
private String regionId;
/**
* The data are reported for this day.
*/
private LocalDate day;
/**
* This enumeration represents supported data types,
* e.g. CONFIRMED_CASES.
*/
private EEpidemicDataType dataType;
}
Dependencies
The Adapters only depend on Apache Kafka.
Adapter Implementation
Currently, all adapters are implemented as scheduled Spring Boot Applications running on an Ubuntu V-Server hosted by Strato.
Future Adapter implementation are not restricted to Spring Boot, they can be implemented using any technology that can send message to Kafka topics.
List of Covid-19 Adapters
| Origin | Description | URL |
|---|---|---|
|
The Johns Hopkins University (USA) collects Covid-19 data for several countries of the world as well as detailled data for the United States of America. Johns Hopkins University provides confirmed, death and recovered cases. |
|
|
|
|
|
|
|
|
|
|
|
|
Apache Kafka
Responsibility
We use Apache Kafka to store and deliver unified epidemic data. Kafka shopuld not store epdidemic data with the same id more than once - we are only interested in the most recent value of reported data.
System Environment
Apache Kafka is running on the same machine as the Covid-19 adapters - an Ubuntu V-Server hosted by Strato.
The topic, where unified Covid-19 datas are queued, is configured to use compaction, so that epidemic data with the same id are not stored more than once.
Dependencies
Apache Kafka has no dependencies to other epidemic system components.
Epidemic Processor
Responsibility
A suitable data structure for Covid-19 statistic collects all data for a region into one data structure, called Epidemic Region Data.
The main task of the Epidemic Processor is to create, update and store Epidemic Region Data based on arriving Kafka messages.
Epidemic Region Data
One Epidemic Region Data document collects the data of all types for one origin, one region and one year. When the region has subregion, the document also references the corresponding subregion datas.
public class EpidemicRegionDataEntity {
/**
* Identifier for Epidemic Region Data. We have one Epidemic Region Data
* for each origin, each region and each year.
*/
private EpidemicRegionId id;
/**
* Map of datas per data type, e.g. CONFIRMED_CASES.
*/
private Map<EEpidemicDataType, EpidemicDataEntity> datas;
/**
* Map of subregion datas per subregion.
*/
private Map<String, EpidemicRegionDataEntity> subRegionDatas;
}
The Epidemic Data for each data type has following structure:
public class EpidemicDataEntity {
/**
* The data type of this data, e.g. RECOVERED_CASES.
*/
private EEpidemicDataType dataType;
/**
* Recorded data states for each day of the year.
* A state is UNKNOWN, if the corresponding value was not PROVIDED nor AGGREGATED.
*/
private EEpidemicDataState[] dataStates;
/**
* Recorded datas for each day of the year.
*/
private double[] datas;
Dependencies
The Epidemic Processor depends on Apache Kafka, where it listens for incoming epidemic data messages.
It also depends on MongoDB, where it stores processed epidmeic documents.
Processor Implementation
The Epidemic Processor is implemented as Spring Boot Application listening to the Kafka topic running on an Ubuntu V-Server hosted by Strato.
MongoDB Storage
Responsibility
MongoDB is responsible for the storage of Epidemic data documents.
System Environment
The epidemic MongoDB runs on MongoDB Atlas.
Dependencies
MongoDB has no dependencies to other epidemic system components.
Epidemic Statistic
Responsibility
The Epidemic Statistics Component is responsible for providing basic and computed statistic epidemic datas.
The Epidemic Statistic Component builds basic timeseries based on stored Epidemic Region Data. Basic timeseries contain data provided or aggregated data, namely confirmed, recovered and death cases.
It also computes statistics based on these basic time series.
Timeseries
A timeseries represents epidemic data for a year, a region and an orign.
public class TimeSeries implements Comparable<TimeSeries> {
/**
* The origin that provided the data.
* Or the origin of the data where derived data are based on.
*/
private final EpidemicDataOrigin origin;
/**
* The data type, e.g. REPRODUCTION_RATE.
*/
private final EEpidemicDataType type;
/**
* The data states for each day of the year.
*/
private EEpidemicDataState[] dataStates;
/**
* The datas for each day of the year.
*/
private double[] data;
}
Epidemic Statistic Implementation
Epidemic Statistics runs as Spring Boot Application on Heroku.
Dependencies
Epidemic Statistics depends on MongoDB, where it reads basic epidemic data.
Epidemic UI
Responsibility
The Angular UI presents epidemic timeseries in diagrams and tables.
Dependencies
Angualr UI depends on the Epidemic Statistics REST-Service.
UI Implementation
Epidemic UI is implemented in Angularand runs on Google Firebase.
TDB