The population of the United States grew from an estimated 5.3 million in 1800 to 309 million people in 2010 (1). On the basis of the definitions from the Census Bureau, the share of the U.S. population living in urban areas grew from 6 to 81% over this period. Urbanization occurred through population growth and the transformation of physical landscapes and ecological systems into developed land. Thus, researchers typically measure these changes through either population- or land-based methods [e.g., (2–6)]. While these two perspectives paint different but complementary pictures of urbanization, they are also sensitive to the scale of measurement. Thus, because of the absence of consistent and detailed, historical information on local land use and local population change, our knowledge of the historical development of the United States is far from complete. Advancing such knowledge would greatly improve our understanding of the broad impacts of urbanization and allow for refined projections of demographics and the built environment.
The absence of detailed historical population data before the mid-20th century severely constrains any population-based assessment of urban processes. Although the U.S. Census records are made publicly available after a period of 72 years, spatially registering and encoding these data are resource-intensive. While researchers have begun to transcribe and extract these data for fine-scale analysis [e.g., (7–9)], publicly available historical population data are accessible only at coarse spatial resolution [e.g., county boundaries; (10)]. This coarse resolution in combination with boundary changes over time (fig. S1) poses a major barrier to studying historical urbanization in the United States using census data [e.g., (11, 12)].
Studying urbanization from a land perspective typically includes land use or land cover data, or, more recently, settlement layers that provide consistent spatial data on the timing, location, and nature of land use. Although many historical maps contain detailed information on land use over long time periods, their extraction at fine resolution is prohibitively costly because of the volume, complexity, and low quality of such graphical documents (13). Most prior efforts to characterize historical fine-grained settlement or land cover changes rely on remote sensing imagery, which are constrained to the post-1970 era of satellite technology [e.g., (14–17)]. Such historical satellite-derived data are usually coarsely classified, provide limited information on the specific characteristics of built-up land, and are often less accurate for rural areas (18, 19).
In this study, we present a new means of understanding the speed, spread, and nature of urbanization in the United States from 1810 to 2015. We use gridded settlement layers from the Historical Settlement Data Compilation for the United States [HISDAC-US; (20)], which is derived from property records compiled in the Zillow Transaction and Assessment Dataset (ZTRAX). HISDAC-US describes the built environment of most of the conterminous United States back to 1810 at fine temporal (5 years) and spatial (250 m) granularity using different settlement measures. These measures include the number of built-up property records, which can refer to individual properties or units within built-up properties (BUILD) in a grid cell in a given year and the built-up intensity (BUI), or the sum of gross indoor area of all built-up properties. We also extracted for each grid cell the first built-up year (FBUY), which is the earliest construction year on record. For larger analytical units, such as counties, we derived the built-up area (BUA), or the number of grid cells overlapping with one or more built-up properties in a given year.
The principal goal of this analysis is to foreground the value of these novel data in providing insight into long-term settlement and urban development. Building on Leyk and Uhl (20) and other ongoing efforts (9), we leverage the HISDAC-US data to undertake an unprecedented multiscale analysis of the history of U.S. urban development and settlement. These new data can be leveraged to explore and characterize fundamental processes of urban growth through measurement of changes in the built environment, potentially providing insights into the fundamental drivers of development patterns. We anticipate that these measures and insights will provide vast new opportunities to study and understand the history of U.S. urbanization from a land-based perspective.
In addition to new land-based assessments of urban change and development, these novel data also unlock new opportunities to model the spatial distribution of population in the past. Our motivation in this regard is rooted in recent research. First, while recent work shows relationships between historical population counts and built-up property attributes, this analysis is confined to the national scale and lacks the spatial detail necessary for understanding variation and change (21). Second, data on developed or built-up land are often used as the main ancillary variable in population modeling using dasymetric refinement approaches. This refinement method is a form of areal interpolation that makes use of relationships between the target variable (population) and the ancillary variables used for subunit estimation [e.g., (22–25)]. Third, parcel data combined with population and road network data were used in recent efforts to study long-term urbanization processes within U.S. cities (26), but such data have not been available for the entire nation to date. Last, researchers applied similar principles of land availability and suitability to disaggregated historical census summary statistics and created fine-resolution population distributions (27). However, these approaches lack robust testing and validation for the years before 2000. Given this body of research, we argue that data products such as the HISDAC-US provide unique opportunities to model not only changes in the built environment but also, potentially, fine-grained historical population estimates. Progress in this area could unlock new possibilities for the spatiotemporal analysis of urbanization in the United States, combining both land- and population-based perspectives.
While our main findings confirm well-known broad diffusion patterns of urbanization, HISDAC-US settlement layers enable us to identify detailed building trajectories as well as expansion and densification patterns at various spatial scales. The fine granularity of the data is depicted by maps of the FBUY (Fig. 1) and the number of built-up property records (BUILD; Fig. 2, B to F) from 1810 to 2010. Finer-scale data break down broader national and regional development trends and describe, for example, local processes of expanding urban and suburban areas or infilling in built-up places during different time periods. We also demonstrate relationships between settlement measures and population growth. We estimate that, on average, each additional built-up property at the county-level is associated with around 2 to 2.25 additional people with some regional variation. This finding is notable as there are, at present, no reliable estimates of long-term, fine-resolution population growth for the United States. Thus, we argue and demonstrate that the novel HISDAC-US data provide an unprecedented opportunity to study and understand long-term urbanization and settlement processes at fine spatial and temporal granularity from the beginning of the 19th century to today.
Taking a land perspective on urbanization: Where, when, and how much land was built-up?
Using the fine temporal and spatial granularity as well as different built characteristics, we elucidate new spatiotemporal settlement patterns. With these patterns, we draw a detailed picture of the evolution of built-up land use in the conterminous United States from 1810 to 2015.
Mapping the earliest recorded built-up properties (FBUY) within boundaries of varying spatial scale, we find that urban development trends are strongly dependent on the size of the spatial unit used (broad national to local; Fig. 1). By using the contemporary county boundaries of the 2010 decennial census as consistent mapping units, we observe two primary, well-known national trends (Fig. 1, A and B). First, we find trends of urban development diffusing westward from Northeastern and coastal Southern states into the interior of the United States, including the eastern parts of Texas, Kansas, and Arkansas. These trends unfold later in the Appalachian Mountains and parts of Florida, likely because of the rough topography and limited habitability of these areas. Second, while buildings in counties in the central and western states of the United States tend to be newer than their Eastern counterparts, there are isolated counties across the western seaboard and interior regions (e.g., Denver and Wichita Falls; Fig. 1, D and E) that experienced particularly early waves of development relative to the rest of their respective states. In many instances, these nodes of early development predate the demarcation of these regions as U.S. states.
By assigning FBUY values to smaller spatial units such as individual grid cells of specific size (e.g., 2,500-m resolution, Fig. 1C; or 250-m resolution, Figs. 1D and 2A), we are able to assess local settlement trends within consistent spatial units that break down the county-level patterns. For example, early settlement and growth along Colorado’s Front Range emanates from a number of isolated centers, with Denver being the largest (Fig. 1, D and E, left). Also, earliest records of development in more rural settings of the state (in mountainous areas or in the plains) appear spatially related to streams and topographic conditions that facilitated development, livelihood, access to water, and transportation. In contrast, new development in Kansas spreads as a broader national pattern of westward expansion, rather than as sprawl from discrete larger urban hubs (Fig. 1, D and E, middle). In Ohio, urban centers begin to overlap over time as they expand into one another (Fig. 1, D and E, right). These patterns illustrate the opportunities provided by such multiresolution data for detecting local-to-regional scale settlement and land development trends over long time periods.
We used the built year of each property in combination with building attributes to compute time series of various settlement variables at different resolutions to more holistically measure local and regional development trends. As discussed above, these settlement measures include the total number of built-up properties (BUILD), the BUI of land derived from the sum of indoor floor area of existing built-up properties, and the number of grid cells built-up within a chosen unit or BUA (see Materials and Methods for details). We extracted these variables within consistent spatial units across time periods to generate long-term trajectories (e.g., fig. S2 at the state-level) and multitemporal spatial distributions (e.g., fig. S3 at the county-level) to characterize variation in settlement patterns over time. We illustrate county-level estimates of BUILD and its change every 5 years between 1810 and 2015 and spatial clusters for each point in time (movie S1). However, the full details of local settlement processes can only be uncovered at the finest granularity.
Taking Rockingham County, NH and the areas surrounding New Hampshire and Massachusetts as an example region, we trace spatial distributions of BUILD at the finest spatial resolution of 250 m over five points in time (1810, 1860, 1910, 1960, and 2010; Fig. 2, B to F). This analysis allows us to track the number of built-up property records at the grid cell-level and better understand local urban growth processes. In this particular case, the cities of Manchester, NH; Newburyport, MA; Amesbury, MA; and Portsmouth, NH grew as separate small urban hubs until 1860. Fewer built-up properties were established in rural parts of the area along roads during the early and mid-1800s. By 1910, Manchester grew substantially, in area and density, while the port cities developed at slower rates. This trend continued, and by 1960, low-density settlement in rural areas had expanded along roads to increasingly connect higher-density urban hubs. Furthermore, during this time period, development increased rapidly along the coastline. Last, by 2010, the area had experienced intensified sprawl in its southern and coastal regions, a continued expansion of urban hubs, and increasing densification in the South, which grew into a larger urban and suburban conglomerate. Such subcounty, temporal settlement patterns have the potential to yield vast new insight into the geographical unfolding and intensity of local urban development processes.
Land-based measures of change characterize types of urban development at varying scales
Fine-scale settlement layers provide unique opportunities to distinguish between land-based processes of urbanization such as expansion and densification. Expansion refers to the amount (or proportion) of new developed area over time, and densification is the ratio of the change in BUI to the change in BUA over time (see details in Materials and Methods).
Coarser-scale, county-level maps of expansion and densification reveal notable regional variation (see fig. S4 and movie S1 for a complete sequence of those maps and their spatial cluster maps). These results complement the observed regional settlement patterns but provide more details about the underlying processes of urban growth, often a function of time, infrastructure, and access to technology. Maps of peak timing of densification and expansion (fig. S5) reveal that both processes are temporally associated and vary regionally. For example, along the coastlines of the Southeast and the Southwest of the United States, the vast majority of counties have expansion peaks earlier than densification, indicative of land expansion maxima followed by maximum infilling in already built-up areas. We found the opposite process in the noncoastal Northeast, the Midwest, and parts of the Mountain West. In these areas, development and peak densification occurred over the early to mid-1900s, and expansion—often in the form of sprawl—subsequently unfolded and peaked during the second half of the 20th century.
We assessed city-level measures of expansion and densification for San Francisco, CA; Atlanta, GA; and Boston, MA (Fig. 3A). With the exception of the time period from 1920 to 1950, which was a period of rapid rise and decline in terms of expansion and densification, in Boston, both measures trended gradually upward over time but took opposite trends after 2000 (expansion declining and densification rising). Atlanta and San Francisco exhibit more notable variation. In the sprawling city of Atlanta, densification has remained modest (with some recent increases), but expansion has markedly increased since the mid-20th century. Over the past decade, Atlanta had decreasing expansion and increasing densification. For San Francisco, in contrast, we find the opposite pattern: Expansion remained relatively low over time, but densification continued to rise sharply. These trends are consistent with the widely held view of Atlanta as a sprawling metropolitan region and the greater compactness of San Francisco.
From these trajectories, we can track the development of a city at fine temporal resolution over 200 years and visualize accompanying spatial change patterns at the grid cell-level. By assessing the change in BUA (∆BUA; i.e., locations that were developed during a given time period) and the change in BUI (∆BUI; i.e., interior area added per grid cell during a given time period) in detailed maps (Fig. 3, B and C, respectively), we gain insight into the development mechanisms generating differences across cities. The spatial patterns suggest that San Francisco (Fig. 3B, left) developed under topographic constraints allowing limited new development and creating notable changes in density over the past 100 years. Atlanta (Figs. 3B, middle), in contrast, had a massive increase in developed area since the 1960s and developed into one of the most sprawling cities in the United States with low building density. Last, the spatial patterns for Boston (Fig. 3B, right) are illustrative of a city with an early-developing and high-density urban core. Continued new development and increases in density were more balanced in Boston over the past century than in the other two cities. Thus, these novel data products provide vast new opportunities for measuring and testing proximate patterns and determinants of urban spatial development (e.g., topographical influences on land use). To further illustrate the dynamism of these data products, we developed fine-grained distributions of BUI with a temporal resolution of 5 years for these three cities, as well as Los Angeles, CA; Dallas–Fort Worth, TX; and Philadelphia, PA between 1810 and 2015 (movie S2).
Dissecting and measuring forms of growth at fine scale in urban and rural areas
Across the conterminous United States, we find that land-based expansion and densification show converging and diverging trends over recent decades, particularly in more developed counties. We created trends of densification (Fig. 4A) and absolute expansion (Fig. 4B) for counties in two strata, which we refer to as rural and urban, over time. The rural stratum is composed of counties that have less than the 66th percentile of BUI across all counties (using the 2010 census boundaries), calculated individually for each year. The urban stratum is defined by counties with BUI values greater than the 66th percentile. This stratification allows us to assess how settlement in relatively more and less developed places changed over time.
The two strata have different trajectories for both measures with significantly higher values in the urban stratum. For urban counties, both measures have an increasing trend up to the 1930s (Fig. 4, A and B). After 1940, expansion increases markedly until the early 2000s but decreases notably during the past decade (Fig. 4B). Densification has varying trends since 1930: It levels off for a short time, increases between 1940 and 1960, then decreases until the 1980s, and since then, increases sharply until 2010 (Fig. 4A). The rural stratum shows continuous increases in both measures, steepest for densification between 1910 and 1960 and for expansion between the 1940s and 1980s, somewhat temporally offset to densification. Both measures remained relatively constant between 1980 and 2010. Counties in the urban stratum have significant variability indicating wide ranges of expansion and densification values, likely found in different regions. In general, we find compelling differences in comparing the two strata that appear to characterize the rural-urban divide in the development of the conterminous United States.
To better understand the observed growth patterns, we spatially decomposed the trends of built-up interior area (BIA), which is the BUI aggregated across the whole United States, within both rural and urban counties into different types of growth. The different growth categories include midrange expansion (i.e., the appearance of newly built-up cells), peripheral growth (i.e., in previously built-up cells at the edge of larger BUAs), and internal growth (i.e., in previously built-up cells in inner parts of BUAs; Fig. 4C). The resulting trends illustrate the magnitudes of BIA across and within strata (Fig. 4D). By 2010, BIA in the urban stratum is roughly 10 times greater than in the rural counties. Within the urban stratum, the dominant type of growth has been peripheral growth followed by midrange expansion. These two types of growth are very similar in the rural stratum. Internal growth has the lowest values of BIA, but its proportion has been notably higher in the urban stratum in the past.
To examine the relationships and changes between different types of growth across each stratum, we computed ratios of changes of BIA in previously built-up cells (i.e., peripheral and internal growth) to all changes in BIA (in previously and newly built-up cells; Fig. 4E). For urban counties, we see a steep increase of the proportion of previously built-up land to overall growth until a peak in the early 1930s, when approximately 85% of new growth happened as either peripheral or internal growth. This percentage declined to approximately 62% in 2010, likely as a result of increased expansion (newly built-up land, often in the form of sprawl). Peripheral growth, which is higher than that of internal growth, has a peak around 1900 at 55% and since declined to 42%. In contrast, internal growth increased steeply until it reached a peak in the early 1930s at 40% and declined until 2000 to 25%. During the past decade, internal growth shows a slight uptick, which corresponds to increasing densification, seen in urban counties. In rural counties, the proportion of internal and peripheral growth combined increases steeply to approximately 60% in the 1950s and since then shows varying trends between 55 and 65%. As internal growth never exceeded 20%, most of these trends are driven by peripheral growth.
The main trends in rural and urban counties converge over time, indicating that by 2010, the proportion of growth in previously built-up cells to growth in newly built-up cells is very similar in both strata (between 60 and 62%). This convergence also indicates that during the past seven decades, the proportion of growth due to expansion has been increasing in urban counties and slightly decreasing in rural counties. We expect to find significant regional variability in these patterns if evaluated for different geographic units (e.g., states or counties), describing deviating trajectories for different criteria used for defining urban and rural strata.
Supporting a population perspective of urban development: Settlement as a reliable predictor of historical and contemporary population
We conclude our results by using a panel analysis approach to illustrate that built characteristics can meaningfully capture human settlement and urbanization patterns (28). This method serves as a test for whether the settlement layers can support population modeling for the study of urban development. In this analysis, we predicted population counts from the decennial censuses of 1860, 1910, 1960, and 2010 by the number of built-up property records (BUILD) observed in the HISDAC-US data; we tested all land-use types together and residential land use only. We relied on BUILD because it has the highest overall correlation with population counts over time in comparison to other settlement measures (Fig. 5). Through this analysis, we attempted to accomplish two objectives. First, we examined how much of the temporal variation in population can be explained using BUILD. Second, we estimated the number of people associated with each additional built-up property in a county.
On the basis of the R2 values for a pooled ordinary least squares (OLS) regression model of all counties from 1860 (Table 1), BUILD based on all land-use types explains almost 93% of the variation in population across counties over time (column 1; R2 = 0.926). This result holds even when we restrict the sample to counties with consistent boundaries through time (<10% change in area measures; column 2; R2 = 0.898; see Materials and Methods for details). As these models include no other control variables, we conclude that BUILD appears to be highly effective in characterizing county-level changes in population over time. We suspect that much of the variation across these models is a function of changes in household size and the distribution of dwelling units by size over time and space. The estimates from the standard OLS models suggest that, on average, one unit increase in BUILD is associated with an increase of around 2.6 to 2.7 people. There are, however, many difficult-to-observe reasons for why counties with more or less built-up properties differ in population (e.g., many coastal cities have both economic opportunity and high-density building stock due to land constraints).
We also ensure the robustness of our results to omitted variables by presenting more conservative estimates when regressing changes in population on changes in BUILD (columns 3 to 5 of Table 1). We ran a least squares dummy variable (LSDV) model with county-level fixed effects (column 3) in which the model variation comes from population changes within counties over time [within estimator; (29)], revealing a consistent and significant relationship of around 2.2 people for each additional built-up property within a county. We ran a generalized least squares (GLS) estimator (column 4) and controlled for potential decadal trends in population and BUILD (column 5), producing generally consistent estimates. Thus, our analyses suggest that, on average, an additional built-up property in each county is roughly associated with a 2.2- to 2.25-person increase in the total population. Although the quality of HISDAC-US data is considerably poorer before 1860, our analyses using earlier starting points yield very similar results (table S1). Results were very similar for BUILD based on all land-use types (β = 2.246, R2 = 0.873), as well as residential land-use types only (β = 2.246, R2 = 0.875). We examined the effect of regional variation by running the same GLS estimator shown in column 5 of Table 1 for the four regions Northeast, South, Midwest, and West (table S2). Coefficients vary between 2.029 and 2.537, indicating low levels of regional variation in the statistical relationship at the county-level. While these results are robust and provide strong indication of predictive power of BUILD for population at the county-level, the observed effects of spatial and temporal variability have to be further investigated, particularly at finer spatial scales.
Fine-scale spatial and temporal data improve our understanding of long-term settlement patterns
Settlement patterns can only be fully understood from a multiscale perspective (30) that characterizes local, regional, and national patterns of urban development and land-use change. Through our unique data products with unprecedented temporal coverage and fine spatial and temporal resolution, we are able to provide new multiscale depictions of historical settlement. From these depictions, we identify time periods of slow or fast growth and characterize different urban processes that are only discoverable at very fine scales. Our results document and analytically evaluate regional and local patterns depicting rural-urban transformations, urban expansion and peripheral growth, as well as densification and infilling processes.
We envision that our new measures on when structures were built (FBUY), the number of built-up property records (BUILD), the BUI, and the BUA at a given point in time as well as derived process measures such as expansion or densification will enable new opportunities to answer scientifically and theoretically grounded questions in urban research. For example, in ongoing projects, we have started to deploy these data to better understand the changes in the built environment that unfolded in U.S. cities related to residential segregation, postwar suburbanization, and the more recent resurgence of central city areas (9). Thus, we see enormous potential in these data for examining landscape evolution, fragmentation, and the role of technology, economic, and social forces in shifting the contours of urban development.
Methodologically, the use of gridded settlement time series allows researchers to conduct their analyses consistently with studies that apply remote sensing images [e.g., (31, 32)] and extract urban or developed land within any spatial unit. However, the HISDAC-US data are less limited temporally than remote sensing products that cover time periods of no more than three to four decades. Furthermore, the HISDAC-US layers are more accurate (20), richer in attribution related to the built environment, and cover a time period of more than 150 years for most of the conterminous United States. By tackling critical process questions in urbanization, we use the new HISDAC-US data-derived measures of development to connect data-scientific analysis of large spatiotemporal data and substantive inquiry in urban geography, demography, and land use science.
Detailed built environment attributes enable holistic examination of settlement and urban development
Temporal trajectories of different settlement measures within spatial units of interest, such as counties, cities, or tracts, provide a detailed picture of the complexity of long-term development in the United States. This knowledge of development fuels our understanding of when, where, and how quickly humans have urbanized the country. Evaluating the interrelationship between different settlement measures is essential to understanding how the nature of urban development has differed across time and space. We demonstrate that settlement is difficult to describe in either univariate or linear terms, and different development attributes follow timelines that vary across urban strata and regions, which are likely dictated by existing infrastructure, technology, and the developability of land (such as in coastal ecosystems). Complementing other findings [e.g., (33)], we also demonstrate that processes such as densification and expansion are interrelated temporally. However, we found that the synchrony between the peaks of those processes varies greatly across regions and cities, pointing to different forms of historical settlement and urban development. These types of development vary markedly between coastal and interior areas, northern and southern regions, and with topographic constraints and environmental conditions. With these insights, researchers can draw an unprecedented picture of the nature and timing of rural, suburban, and urban development in the United States at varying scales. The advances in our understanding of settlement processes have the potential to inform ongoing discussions about the spread and compactness of urban areas (2, 34, 35).
Following the paradigm of “people are where people build,” this study demonstrates an effective way to estimate historical population at fine spatiotemporal granularity
There is a common understanding in the fields of rural studies, urban geography, and demography that the built environment is related to population and other demographic attributes (21, 27). These insights provide the basis for a population-based perspective on urban development assessments. Our panel analysis results demonstrate that historical settlement layers in HISDAC-US (20) are associated with population at relatively fine spatial granularity (i.e., counties). Such results are important in two distinct but related ways.
First, the predictive power of the population models indicates that the settlement-population relationship is highly robust. These models enable us to build county-level population data over more than 150 years at fine temporal resolution. These data help overcome the dependence on traditional decadal census surveys [e.g., (36)] and may support the creation of future population assessments to improve population projections. Such model outcomes can be used to create time series of consistent population estimates (e.g., within contemporary county boundaries from the 2010 census) to perform unprecedented temporal analysis. Using these analytical innovations, demographers and urban modelers can study demographic processes related to rural-urban transitions over long time periods at meaningful spatial scales and inform population projections.
Second, the robust settlement-population linkages indicate the potential for reproducing such population models within different spatial units including census units of finer spatial granularity (e.g., census tract boundaries of 2010) or alternative geographic units. For example, researchers might need to estimate population and its changes within certain land cover classes or zones of high vulnerability to natural or industrial hazards. The fine resolution of the settlement layers makes it possible to model population at fine scales using attributes such as the number of built-up property records or BUI allocated to such alternative analytical units. Such advances will greatly benefit research on coupled socio-environmental systems and improve our understanding of existing interrelationships and processes. However, variance in the relationship between population and built environment attributes across time and space requires further investigation. While our comparison of county-level relationships by region produces quite consistent results, we have yet to investigate the stationarity of these relationships at finer spatial scales. We suspect that land-use type will play a particularly crucial role in inferring small-area population quantities from built-environment data.
Novel and extensive spatial information necessitates serious investigation into uncertainty and potential data limitations
While the use of such novel data layers opens unprecedented research opportunities, it is important to instruct and educate the data user on existing uncertainties. In Leyk and Uhl (20), some of these temporal, positional, and thematic uncertainties are reported, assessed, and measured in detail. For our analysis, the settlement layers were systematically corrected on the basis of focal raster operations and adjusted using census data (see Materials and Methods). However, while these adjustments reduce some of the inherent bias and result in population models with high predictive power, the reported missingness in the original ZTRAX data will still potentially cause underestimation of settlement and has to be considered for critical use of the data products in subsequent analyses. We expect these issues to further improve as Zillow continues to update their database, but certain data gaps will always remain. Furthermore, it is important to note that temporal information, such as the FBUY, does not necessarily indicate the year of the first settlement but represents the earliest built years on record in the ZTRAX database of currently existing buildings. Thus, we may not know about earlier built units that have been demolished and rebuilt (or not rebuilt) or still exist but miss built-year records (Fig. 1B). This uncertainty varies across regions and can be addressed by sensitivity analysis and detailed case studies where high-quality data are available.
Future steps to leverage these new opportunities will explore the creation of settlement estimates at finer spatial and temporal resolution as well as the inclusion of demographic variables and ancillary data for alternative geographies to fully use the potential of these data layers for improved fine-scale urban and population modeling. Of particular interest is the estimation of alternative demographic and housing-related attributes to create a more insightful picture of the human-built environment and its population. These fundamental components will enable the research community to advance research and theory on urban studies, land use science, natural hazards, landscape ecology, and other interdisciplinary pursuits (9, 37, 38). Using the attribute richness of the settlement data, researchers can explore questions of great societal importance at spatial and temporal scales relevant to the operational scale of urban and human-environmental processes including local rural-urban transitions, changes in ecological services, and trends in land fragmentation.
MATERIALS AND METHODS
Settlement and census data
We use the ZTRAX to derive data products that can be used for the extraction of settlement measures at different points in time. ZTRAX is a geocoded housing and property-level database based on existing cadastral data sources that contains more than 374 million data records for approximately 200 million parcels in over 3100 counties in the United States (https://zillow.com/ztrax). Zillow Group is an online real estate database company that was founded in 2006. We extracted attributes such as the land-use class, the construction year of the structure on a parcel, and geolocation information (e.g., an approximate location for an address point) to create time series of raster layers. The workflow for creating the spatiotemporal database model, an SQLite database with spatial query extension, and the data products used in this study are described in full detail in Leyk and Uhl (20). The data layers are collected in the HISDAC-US, which is organized as a collection of datasets at the Harvard Dataverse repository (https://dataverse.harvard.edu/dataverse/hisdacus). First, we produced a series of semi-decadal raster layers representing the BUI, the sum of gross indoor area of all built-up properties in a grid cell (250 m by 250 m) in a given year between 1810 and 2010. Second, for the same time period, we also produced a series of semi-decadal raster layers representing the number of built-up property records (BUILD) in a cell in a given year. Third, we built a composite raster layer that indicates for each raster cell the first year a built unit was established (FBUY). Last, we derived the BUA as the number of grid cells in a spatial unit of interest (e.g., counties) with at least one built unit in a given year. The spatial resolution of all raster layers is 250 m, and the temporal resolution available in HISDAC-US is 5 years.
HISDAC-US also contains uncertainty layers at the pixel and county levels (20) that the data user is urged to use for the assessment of positional, temporal, and thematic uncertainty. First, there are proportions of records without a construction year in some counties. Also, in some instances, the year refers to the most recently built unit, and it remains unknown whether there has been a structure before; in other cases, there are several built years given, indicating the very first year and the most recent one, for example. Second, the land-use class attributes vary across counties and states but have been generalized and consolidated to some degree, making them more comparable across the nation. Third, the latitude/longitude records are missing for a portion of the records prohibiting fine-scale localization of the records but indicate the county. The geolocation records represent approximations for the corresponding address, and thus, there is inherent positional uncertainty that needs to be addressed.
Census data and boundary files at the county-level were collected from the National Historical Geographical Information System [NHGIS; (10)]. We used the contemporary county boundaries (2010 census) to extract ZTRAX measures at different points in time. To build our population models at the county-level, we used nominal population statistics (persons count) in 1810, 1860, 1910, 1960, and 2010 and the corresponding time-specific county boundaries from the NHGIS website (fig. S1) as well as the number of housing units in 2010 for our correction procedure, as described below.
Data correction and geoprocessing
We extracted the settlement measures (BUILD, BUI, and BUA) from the raster time series within contemporary county boundaries (2010 census) using zonal statistics geoprocessing functions to create settlement measures for different points in time within consistent spatial units. To mitigate some of the data quality issues, particularly the missingness of built-year records as described above, we applied a spatiotemporal correction procedure to improve county-level settlement measures at different points in time as follows. We carried out this procedure for all variables using built-up properties of all land-use types together and for the BUILD variable based on residential land-use type only to test both corrected data versions in the population model.
We assumed that records in the database without a built year exist at present if they indicate the presence of a built-up property (i.e., in 2015, which is the most recent year in the currently available ZTRAX database) and the likelihood of the actual built year is the same across all years.
For each county, we computed the proportion of missing built-year records (TMiss) in 2015
(1)where SumBYMiss is the sum of missing built-year records and SumBuilt2015 is the sum of built-up properties in 2015 with built-year records. Depending on the magnitude of TMiss (i.e., TMiss < 50%, TMiss > 50%, TMiss = 100%; these thresholds can vary as needed), we corrected the contemporary and earlier county-level settlement measures. Of the 3108 counties in 2010, 1636 counties had less than 25% TMiss; 2201 counties had less than 50% TMiss. The spatial and statistical distributions of county-level TMiss are shown in fig. S6.
First, for counties with TMiss < 50% (or another user-defined threshold), relative changes in BUILD, BUI, and BUA were considered reliable. Thus, assuming that records without built-year information existed in 2015, a corrected BUILD2015,corr per county was calculated as
A correction factor was calculated as
Then, each value of the county-level BUILD time series was multiplied with cBUILD, resulting in a corrected BUILD time series while preserving relative changes between years as observed in the uncorrected data.
To correct BUI, for each county, the average BUI per built-up property in 2015 was calculated as
(4)and then multiplied with the corrected BUILD value in 2015 resulting in the adjusted county-level BUI in 2015
(5)In analogy to Eq. 3, a correction factor cBUI was calculated and then applied to the whole BUI time series for each county. The BUA time series layers (with value 1 for grid cells with one or more records that had a built year and value 0 for all other cells) were corrected slightly differently. For each county in 2015, we created another binary layer, BUA0, with value 1 for those grid cells that contained at least one record without a built year and value 0 for all other cells. We then calculated the area of the spatial union of BUA2015 and BUA0, which results in the corrected 2015 BUA
(6)Earlier BUA layers were then corrected using a correction factor cBUA, calculated in analogy to cBUILD and cBUI. Second, for counties where TMiss > 50%, changes in BUILD, BUI, and BUA were not considered reliable. As before, BUILD in 2015 was corrected by SumBYMiss. We then derived relative change estimates in BUILD between different points in time based on the five nearest counties where TMiss < 50%. These average regional gradients of BUILD were used to retrospectively extrapolate BUILD to earlier points in time. To correct BUI in these unreliable counties, we interpolated the average BUI values per built-up property found in the five nearest counties where TMiss < 50%, multiplied them with the corrected BUILD values, and extrapolated the resulting BUI values to earlier data layers in the time series while preserving the average relative changes in the five reliable neighboring counties. Similar to the reliable counties above, the BUA in 2015 was corrected by the spatial union of BUA in 2015 and BUA0. These corrected values were then extrapolated retrospectively while preserving the relative change between years derived from BUA gradients within neighboring counties where TMiss < 50%.
Once the above correction steps were finalized, BUILD, BUI, and BUA values were estimated for those counties where there was no information at all. Using the corrected time series resulting from the steps above, BUILD, BUI, and BUA for each year were interpolated using the corresponding values from the nearest five counties where TMiss < 50%.
Last, we further adjusted the corrected and extrapolated settlement measures BUILD and BUI using the number of housing units published by the U.S. Census in 2010 as follows. First, for each county in 2010, we used the difference between census housing unit counts and BUILD to adjust BUILD in 2010. Then, we adjusted BUILD for the whole time series while preserving the relative changes between years. Using these adjusted values of BUILD in each year, we adjusted the BUI time series proportionally. The BUA time series could not be corrected using census data, because there is no reference information on the spatial distribution of census housing unit counts within counties and thus no BUA-compatible measure.
Expansion and densification calculation
We used the extracted settlement measures to derive variables that indicate more implicitly the process of change. We calculated relative and absolute expansion as the proportion and absolute value of new developed area, respectively
(8)where BUAt0 and BUAt1 are the BUA estimates for the beginning and ending year, respectively. This measure was used to evaluate the amount of change in developed area over a given number of years, reflecting how much development has been added, absolutely and proportionally to the initial condition, respectively.We also calculated densification, which is the change in BUI over the change in BUA
(9)where BUIt1 and BUIt0 are the built-up intensities for the beginning and ending year of the considered time period, respectively. This measure quantifies the increase in BUI in proportion to newly developed areas over a given number of years.
We created the maps of the local indicators of spatial association [LISA; (39)] to identify statistically significant spatial clusters in the county-level distribution of the target variable (e.g., change in BUILD; 999 permutations; P < 0.05). A hot spot is a statistically significant high-high (HH) cluster, i.e., a high value that is surrounded by other places with high values to constitute a statistically significant group of counties of higher values. Accordingly, a cold spot [low-low (LL) cluster] indicates a low value surrounded by other low values. Counties labeled with HL and LH represent statistically significant outliers from the spatial distribution.
Panel analysis allows us to control for individual-unit heterogeneity and thus variables that may explain differences across counties (e.g., cultural or architectural differences) unmeasured or variables that change over time but not across counties (time-invariant characteristics such as policies, technological advancement, or regulations). This way, panel analysis makes it possible to detect and measure effects that cannot be observed in either the modeling of cross-sectional data or purely descriptive time series analysis (28). To examine the relationship between the number of built-up property records (residential and all land use, corrected) as predictor, and population (person counts) as the outcome variable within an entity (county), we used two-way fixed-effects panel models to account for such forms of heterogeneity. Thus, we include county and time period fixed effects to help account for this bias, and assess the net effect of the predictors on the outcome variable by allowing the model intercept to vary across the spatial units as well as over time. The equation for the (time and entity) fixed-effects regression model is
(10)where Yit is the dependent variable with i = entity and t = time, Xk,it are the independent variables with coefficients βk, uit is the error term, En is the county n [n – 1 entities included as binary (dummies) in the model] with the coefficient γn for the binary regressors (entities), and Tt is the binary variable (dummy) for time (there are t – 1 time periods) with coefficient δt for the binary time regressors.
We compared OLS-based balanced panels with LSDV- and GLS fixed-effects models to better understand the impact of fixed effects on the estimators’ predictive power. We included all counties in the balanced panel that remained sufficiently compatible over time, i.e., counties whose areas do not change more than 10% compared to the contemporary county boundaries over the entire time period. All settlement variables were tested but because of multicollinearity issues, only individual ones could be used at a time. Data extraction, analysis, and statistical modeling have been carried out in Python and STATA; geoprocessing steps have been done using Feature Manipulation Engine (FME) and the ArcGIS 10.6 Arcpy Python package as well as NumPy, Pandas, and Matplotlib.