Using the Algorand Blockchain to Track COVID-19
Overview
Our response and resilience to the COVID-19 pandemic will define our time for generations to come. Today, individuals, researchers and institutions around the world are applying their skills to contain and understand COVID-19. To aid these efforts, Algorand is building the first global repository of self-reported COVID-19 data that is freely accessible, updatable and real time.
IReport-Covid leverages the Algorand blockchain to provide anyone with the means to inform their community and world-at-large of their experience with COVID-19 in a timely and anonymized way. With the limited scope of data coming directly from individuals, IReport-Covid can support research into pandemics for the future and demonstrates the potential of blockchain to transform how we respond to crises as they emerge. In this solution I will demonstrate how to scrape and process IReport-Covid survey data from the Algorand blockchain in Python and give tips on analyzing the data.
Set Up
- You will need a Python IDE and preferably Python version 2.7 or higher.
- Install the algosdk, math and time modules
- Install other modules as needed (pandas, numpy and datetime are recommended)
- You will also need your own PureStake API key. The PureStake API provides free, easy-to-use access to the Algorand TestNet and MainNet.
- Sign up for a PureStake developer account
- Generate your API Key (5000 requests per day, 5 requests a second limit)
- Download the py_algorand.py script or git clone the repo linked at the top of this solution from your local system.
Getting Started
After confirming the py_algorand.py script is located in your working directory, load the following in your Python IDE of choice (I use Jupyter):
# imports
import algosdk
import math
import pandas as pd
import numpy as np
from datetime import datetime
# import scraper class from script
from py_algorand import Algorand_IReportScrape
# optional configs for dataframe display in notebook
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# your PureStake API key
YOUR_API_KEY = ‘xxxxxxxxxxxxxxxxxxxxxxxxxxxx’
The py_algorand.py script scrapes survey transactions in bulk from the Algorand mainnet via PureStake API. It is largely transcribed from the JavaScript solution detailed here IReport-Covid Official Repo
Now that you have your environment configured, let’s move on to getting the data!
Scraping IReport-Covid survey transactions
Running:
covidData_scraper = Algorand_IReportScrape(YOUR_API_KEY)
Should give a similar output to
Finally, to store the survey transactions dictionaries in a local variable:
txns = covidData_scraper.get_txns()
and confirm the number of transactions with len(txns)
Processing Tips
Here is an example survey transaction dictionary contained in our txns
variable:
{'tx': 'UGY5YWRRYLLCDC5SD2BIAUEBXKZROG4N56VT6QRZBZD566RKEHVA', 'fee': 1000, 'from': 'COVIDR5MYE757XMDCFOCS5BXFF4SKD5RTOF4RTA67F47YTJSBR5U7TKBNU', 'type': 'pay', 'round': 5650155,
'noteb64': 'gqFk3gARol90pnJlcG9ydKJfdqExp2NvbnNlbnTDomdhN6JnY6JVU6JncqJOWaJnc6Fmo2d6cKMxMDCibXr/onEyw6NxZG7Do3Fkc6oyMDIwLTAzLTEwonFsAaJxegGic3r/onR0/6J0ev+hc9ksRHNvUU1GY0V3V0l0b1RzRStFMVRtcjJiK2w4U3RWVWlka1dwSVh6S0tLOD0=',
'payment': {'to': 'COVIDR5MYE757XMDCFOCS5BXFF4SKD5RTOF4RTA67F47YTJSBR5U7TKBNU', 'amount': 0, 'torewards': 0, 'closerewards': 0}, 'genesisID': 'mainnet-v1.0', 'last-round': 5651153, 'first-round': 5650153, 'fromrewards': 386000,
'genesishashb64': 'wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8='}
We want to decode the encoded noteb64 value in this dictionary for every transaction.
Here is an example decoded note dictionary:
{ "d": { "_t": "report", "_v": "1", "consent": true, "ga": 20, "gc": "US", "gr": "MD", "gs": "m", "gzp": "212", "m2": true, "mz": 1, "q1": true, "q2": true, "q4": true, "qdn": true, "qds": "2020-03-11", "qz": 1, "s4": true,
"sde": "2020-03-23", "sds": "2020-03-18", "sz": 1, "tt": 2, "tz": -1 }, "s": "6dcVNOgtO+4EPnfCaSfBI1dzW8BoNhKYd+dK1vfpWUA=" }
To create an easily interpretable data structure we will form a single pandas dataframe containing all the survey data we have. We will leverage an attribute array to map the key of every possible (key,value) pair in a survey dictionary to an interpretable column name. This array is defined as:
headers = [
'a',
'_t',
'_v',
# general demographic data
'gc', # string, country code (see Location Data section below)
'gr', # string, region code (see Location Data section below)
'gzp', # string, 3-digit zip code (US only)
'ga', # integer, age group, if present must be in 1,11,21,31,41,51,56,61,66,71,76,81,85
'gs', # string , gender, if present must be 'm','f'
# symptoms
'sz', # integer, is symptomatic, no-answer=0/no=-1/yes=1
's1', # boolean, fever
's2', # boolean, cough
's3', # boolean, difficulty breathing
's4', # boolean, fatigue
's5', # boolean, sore throat
'sds', # date, when symptoms started, yyyy-mm-dd
'sde', # date, when symptoms ended, yyyy-mm-dd
'sdn', # boolean, still symptomatic
#tested
'tz', # integer, tested, no-answer=0/no=-1/yes=1
'tt', # integer, tried to get tested, no=-1, yes=1, yes but was denied=2
'td', # date, test date, yyyy-mm-dd
'tr', # integer, test results, -1=negative,1=positive,2=waiting for result
'tl', # integer, test location, 1=Dr office/2=Hospital/3=Urgent care/4=Ad-hoc center/5=Other
# medical care
'mz', # integer, received care, no-answer=0/no=-1/yes=1
'm1', # boolean, doctor's office
'm2', # boolean, walk-in clinic
'm3', # boolean, virtual care
'm4', # boolean, hospital/ER
'm5', # boolean, other
'mh', # integer, hospitalized, no-answer=0/no=-1/yes=1
'mhs', # date, when admitted, yyyy-mm-dd
'mhe', # date, when discharged, yyyy-mm-dd
'mhn', # boolean, still in hospital
# quarantine
'qz', # integer, was quarantined, no-answer=0/no=-1/yes=1
'q1', # boolean, due to symptoms
'q2', # boolean, voluntarily
'q3', # boolean, personally required
'q4', # boolean, general quarantine
'qds', # date, when quarantine started, yyyy-mm-dd
'qde', # date, when quarantine ended, yyyy-mm-dd
'qdn', # boolean, still quarantined
'ql', # integer, left quarantine temporarily no-answer=0/no=-1/yes=1
'consent' # boolean' , user's consent, mandatory, must be 'true'
]
Now we are ready to decode our survey transaction notes and store them:
# initialize our dataframe object
data_df = pd.DataFrame()
# iterate through transactions
for i in range(len(txns)):
# stage print as transactions are processed
if (i%1000 == 0): print("{} transactions decoded".format(i))
# decoding notes
tx_dict = txns[i]
tx_code = tx_dict['tx']
encoded_note = tx_dict['noteb64']
decoded_note = algosdk.encoding.msgpack.unpackb(algosdk.encoding.base64.b64decode(encoded_note))
decoded_note = decoded_note[b'd']
decoded_note_data = {
key.decode() if isinstance(key, bytes) else key:
val.decode() if isinstance(val, bytes) else val
for key, val in decoded_note.items()
}
decoded_note_data.update({'a':tx_code})
cleaned_note_data = {key:None for key in headers}
cleaned_note_data.update(decoded_note_data)
data_df = data_df.append(cleaned_note_data, ignore_index=True)
To confirm that the notes were processed successfully we can check the first few rows of our new dataframe with data_df.head()
and then map the columns of our dataframe (which align with the ‘headers’ attribute array) to more interpretable names:
cols = ["_t","_v","tx_id","consent","age_group","country_code","region_code","gender","3_dig_zip","doctors_office","walk_in_clinic","virtual_care","hospital_or_ER","other","hospitalized","when_discharged","still_in_hospital",
"when_admitted","received_care","symptom_quarantine","voluntary_quarantine","personally_required_quarantine","general_quarantine","when_quarantine_ended","still_in_quarantine","when_quarantine_started","left_quarantine_temporarily",
"was_quarantined","fever","cough","difficulty_breathing","fatigue","sore_throat","when_symptoms_ended","still_symptomatic","when_symptoms_started","is_symptomatic","test_date","test_location","test_results","tried_to_get_tested",
"tested"]
And a quick check that should output True print(len(cols)==len(headers)==len(data_df.columns))
.
Finally we are ready to store our dataframe as a csv for further processing and analysis. We can achieve this with:
date = str(datetime.today().strftime('%Y-%m-%d'))
data_df.to_csv('covidData'+date+'.csv',index=False)
Here I store the dataframe with a date included in the file name to preserve the uniqueness of data stored in future notebook runs and eventually compare transaction data over time.
Next Steps
After examining features of the survey data you might want to further process and transform the data to answer questions like:
- How many men reported leaving quarantine temporarily in New York, USA?
- How many women reported having a fever from those that reported being symptomatic?
- How many individuals reported receiving virtual care?
I would suggest the following as a start:
- Ensure that each row (survey response) has a consent value of 1.0, indicating that consent was given by the respondent. This should be the case for every row.
- Drop columns “_t”,”_v” and “consent”
- Transform boolean columns (such as “doctors_office”, ”voluntary_quaratine” and ”fever” to name a few) to numeric columns.
- Impute missing state abbreviations in rows that have 3 digit zip code values by using another dataset with state and corresponding zip code information
- Remove rows or columns that contain little to no relevant information for the topic of exploration
- One-hot-encode columns with 2 or more values or non-numeric values such as “gender”,”age_group”,”test_location” and “test_results”, to name a few, as described in the ‘headers’ attribute array
- Group the data by country or subset the data to countries with states and then group by country and state.
Now you can create charts like this
1237 surveys with information…
- From 1224 valid responses: 61.11% reported being quarantined
- From those that reported being quarantined 68.72% left temporarily
- From 156 people who are symptomatic:
- 5.13% got care at the doctor’s office
- 1.92% got care at a walk in clinic
- 6.41% got virtual care
- 3.21% got care at the hospital/ER
- 46.79% reported a fever
- 67.95% reported a cough
- 34.62% reported difficulty breathing
- 67.95% reported fatigue
- 58.97% reported sore throat
or these
From 615 surveys in the US…
- 596 received no care [96.91%]
- 17 received care [2.76%]
- 590 not tested [95.93%]
- 21 tested [3.41%]
- 318 female [51.71%]
- 286 male [46.5%]
Final Thoughts
From understanding the evolution of individuals’ symptoms in real time to assessing the relationship between certain behaviors in a community and the spread of COVID-19, and even detecting emerging hotspots with little to no lag time, we are scratching the surface of what is possible with the Algorand blockchain and community tracking applications like IReport-Covid.
Coming together to ensure free access to self-reported, anonymized data like this is a powerful thing that can help us keep our community informed and enable much needed research into a pandemic like COVID-19. I can’t wait to see what else is possible, and if anyone is interested in working on a project involving IReport-Covid, feel free to contact me!