| Title: | Economic Entity Identifier Standardization |
|---|---|
| Description: | Provides utility functions for standardizing economic entity (economy, aggregate, institution, etc.) name and id in economic datasets such as those published by the International Monetary Fund and World Bank. Aims to facilitate consistent data analysis, reporting, and joining across datasets. Used as a foundational building block in the 'EconDataverse' family of packages (<https://www.econdataverse.org>). |
| Authors: | L. Teal Emery [cre], Christopher C. Smith [aut], Christoph Scheuch [ctb], Teal Insights [cph] |
| Maintainer: | L. Teal Emery <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.3.9000 |
| Built: | 2026-06-07 07:52:56 UTC |
| Source: | https://github.com/teal-insights/r-econid |
This function allows users to extend the default entity patterns with a custom entry.
add_entity_pattern( entity_id, entity_name, entity_type, aliases = NULL, entity_regex = NULL )add_entity_pattern( entity_id, entity_name, entity_type, aliases = NULL, entity_regex = NULL )
entity_id |
A unique identifier for the entity. |
entity_name |
The standard (canonical) name of the entity. |
entity_type |
A character string describing the type of entity ("economy", "organization", "aggregate", or "other"). |
aliases |
An optional character vector of alternative names identifying
the entity. If provided, these are automatically combined (using the pipe
operator, "|") with |
entity_regex |
An optional custom regular expression pattern. If
supplied, it overrides the regex automatically constructed from
|
Custom entity patterns can be added at the top of a script (or
interactively) and will be appended to the built-in patterns when using
list_entity_patterns(). This makes it possible for users to register
alternative names (aliases) for entities that might appear in their economic
datasets.
The custom entity patterns are kept separately and are appended to
the default patterns when retrieving the entity_patterns via
list_entity_patterns(). The custom patterns will only persist
for the length of the R session.
NULL. As a side effect of the function, the custom pattern is
stored in an internal tibble for the current session.
add_entity_pattern( "ASN", "Association of Southeast Asian Nations", "economy", aliases = c("ASEAN") ) patterns <- list_entity_patterns() print(patterns[patterns$entity_id == "ASN", ])add_entity_pattern( "ASN", "Association of Southeast Asian Nations", "economy", aliases = c("ASEAN") ) patterns <- list_entity_patterns() print(patterns[patterns$entity_id == "ASN", ])
A dataset containing patterns for matching entity names. This dataset is accessible through list_entity_patterns.
entity_patternsentity_patterns
A data frame with the following columns:
Unique identifier for the entity
entity name
ISO 3166-1 alpha-3 code
ISO 3166-1 alpha-2 code
Type of entity ("economy", "organization", "aggregate", or "other")
Regular expression pattern for matching entity names
Data manually prepared by Teal L. Emery
This function returns a tibble containing regular expression patterns for
identifying economic indicators. It combines the patterns from the built-in
entity_patterns dataset with any custom patterns stored in the
.econid_env environment.
list_entity_patterns()list_entity_patterns()
A data frame with the following columns:
entity id
entity name
ISO 3166-1 alpha-2 code
ISO 3166-1 alpha-3 code
entity type
Regular expression pattern for matching entity names
patterns <- list_entity_patterns()patterns <- list_entity_patterns()
This function resets all custom entity patterns that have been added during the current R session.
reset_custom_entity_patterns()reset_custom_entity_patterns()
Invisibly returns NULL.
add_entity_pattern("EU", "European Union", "economy") reset_custom_entity_patterns() patterns <- list_entity_patterns() print(patterns[patterns$entity_id == "EU", ])add_entity_pattern("EU", "European Union", "economy") reset_custom_entity_patterns() patterns <- list_entity_patterns() print(patterns[patterns$entity_id == "EU", ])
Standardizes entity identifiers (e.g., name, ISO code) in an economic data frame by matching them against a predefined list of regex patterns to add columns containing standardized identifiers to the data frame.
standardize_entity( data, ..., output_cols = c("entity_id", "entity_name", "entity_type"), prefix = NULL, fill_mapping = NULL, default_entity_type = NA_character_, warn_ambiguous = TRUE, overwrite = TRUE, warn_overwrite = TRUE, .before = NULL )standardize_entity( data, ..., output_cols = c("entity_id", "entity_name", "entity_type"), prefix = NULL, fill_mapping = NULL, default_entity_type = NA_character_, warn_ambiguous = TRUE, overwrite = TRUE, warn_overwrite = TRUE, .before = NULL )
data |
A data frame or tibble containing entity identifiers to standardize |
... |
Columns containing entity names and/or IDs. These can be
specified using unquoted column names (e.g., |
output_cols |
Character vector specifying desired output columns. Options are "entity_id", "entity_name", "entity_type", "iso3c", "iso2c". Defaults to c("entity_id", "entity_name", "entity_type"). |
prefix |
Optional character string to prefix the output column names. Useful when standardizing multiple entities in the same dataset (e.g., "country", "counterpart"). If provided, output columns will be named prefix_entity_id, prefix_entity_name, etc. (with an underscore automatically inserted between the prefix and the column name). |
fill_mapping |
Named character vector specifying how to fill missing
values when no entity match is found. Names should be output column names
(without prefix), and values should be input column names (from |
default_entity_type |
Character or NA; the default entity type to use for entities that do not match any of the patterns. Options are "economy", "organization", "aggregate", "other", or NA_character_. Defaults to NA_character_. This argument is only used when "entity_type" is included in output_cols. |
warn_ambiguous |
Logical; whether to warn about ambiguous matches |
overwrite |
Logical; whether to overwrite existing entity_* columns |
warn_overwrite |
Logical; whether to warn when overwriting existing entity_* columns. Defaults to TRUE. |
.before |
Column name or position to insert the standardized columns before. If NULL (default), columns are inserted at the beginning of the dataframe. Can be a character vector specifying the column name or a numeric value specifying the column index. If the specified column is not found in the data, an error is thrown. |
A data frame with standardized entity information merged with the input data. The standardized columns are placed directly to the left of the first target column.
# Standardize entity names and IDs in a data frame test_df <- tibble::tribble( ~entity, ~code, "United States", "USA", "united.states", NA, "us", "US", "EU", NA, "NotACountry", NA ) standardize_entity(test_df, entity, code) # Standardize with fill_mapping for unmatched entities standardize_entity( test_df, entity, code, fill_mapping = c(entity_id = "code", entity_name = "entity") ) # Standardize multiple entities in sequence with a prefix df <- data.frame( country_name = c("United States", "France"), counterpart_name = c("China", "Germany") ) df |> standardize_entity( country_name ) |> standardize_entity( counterpart_name, prefix = "counterpart" )# Standardize entity names and IDs in a data frame test_df <- tibble::tribble( ~entity, ~code, "United States", "USA", "united.states", NA, "us", "US", "EU", NA, "NotACountry", NA ) standardize_entity(test_df, entity, code) # Standardize with fill_mapping for unmatched entities standardize_entity( test_df, entity, code, fill_mapping = c(entity_id = "code", entity_name = "entity") ) # Standardize multiple entities in sequence with a prefix df <- data.frame( country_name = c("United States", "France"), counterpart_name = c("China", "Germany") ) df |> standardize_entity( country_name ) |> standardize_entity( counterpart_name, prefix = "counterpart" )