Methodology
A. Data Collection
Identifying Sources of Data
The Wilson Center developed a list of potential data sources and prioritized the inclusion of web pages that maximize the volume, relevance, and accuracy of records. Priority was given to the datasets of government bodies, independent research organizations, and NGOs, due to their role as aggregators of publicly available data and transparency in methods. Additionally, as the owners of public infrastructure, governments can be assumed to have the most authoritative dataset on public projects.
For the current release of AII (December 2020), the largest sources of data were:
- the United Nations Code for Trade and Transport Locations
- the Global Power Plant Database (World Resources Institute)
- the Arctic Marine and Aviation Transportation Infrastructure Initiative (Arctic Council, under an initiative of the Governments of the United States and Iceland)
- the Interagency Electronic Reporting System for Commercial Fishery Landings (State of Alaska)
- the Federal Agency of Sea and River Transport (Government of the Russian Federation)
Because coverage is currently limited to publicly available data, this list is not exhaustive. When possible, the websites of private owners of infrastructure were also consulted as the most authoritative source.
Extracting Data
Once a data source has been earmarked for inclusion in AII, its data must be extracted. The method chosen for data extraction depends on the format in which a web page stores data. Data used to build AII is stored in a variety of formats, ranging from .csv and .doc files to raw html. Where files are publicly available, they are downloaded in their raw form for further processing. Where data is stored in html, web scraping is employed.
Web scraping, a normal part of internet operations, allows users to grab the portions of a web page’s html that contain the raw form of desired data. Web scraping can be performed in multiple coding languages, but for the purposes of AII, scraping was performed in Python.
In accordance with industry standards, all data extraction and web scraping followed hypertext transfer protocol (HTTP) and proceeded with scraping only when issued a status code of 200 by the requested server of the data source. Users also followed the scraping protocols outlined by a server’s Robots.txt file, being careful not to exceed a server’s request limits.
B. Data Preparation
Data in its raw form may contain errors or be incompatible with the structure of AII, so it must be processed before integration. AII data is processed in 3 interconnected stages: cleaning, transformation, and validation.
Cleaning and Transformation
Though they are two distinct stages, cleaning and transformation often occur at the same time. Cleaning refers to dealing with erroneous records (those that are incomplete, inaccurate, irrelevant, or duplicated).
Transformation, in contrast, converts raw data into a desired format. Because AII is ultimately intended for public use, Wilson Center staff aim to convert raw data into formats more easily understood by human users.
Given that each AII data source contains unique records and encoding methods, cleaning and transformation was tailored to each source.
Validation
Data validation ensures the quality of records that have undergone cleaning and transformation. Whenever possible, AII records were cross-referenced with data from multiple sources to confirm their validity.
C. Data Reporting
Once AII records were prepared for integration, they were combined into a single .csv file using a Python script. The data from the file been uploaded to the online AII portal, where users may interact with it using a search function or download the file itself.
If users notice any missing or erroneous records, they are free to contribute to the data reporting process and submit projects for inclusion or correction. For more information about decisions to include and exclude certain records from the reporting process, please see the criteria section.
D. Limitations
There are several limitations to using the methodology described in this document. It is unlikely that all data will be captured by this methodology due to the following factors:
- not all infrastructure data is publicly available
- search engines may return incomplete data due to deficiencies and biases in algorithms, indices, and/or queries
- data collected from third-parties may be erroneous
The Wilson Center will make efforts to counter these limitations by:
- consulting governmental and nongovernmental entities directly to gain access to data that may not have appeared in search results or public web pages
- using multiple data sources to “plug” each other’s gaps
- cleaning and validating third-party data