To view the full wiki, click here: Full healthyR.data Wiki
healthyR.data is a comprehensive R package that provides healthcare administrative datasets and tools for accessing CMS (Centers for Medicare & Medicaid Services) hospital data. The package serves two primary purposes:
- Built-in Healthcare Data: Provides a rich, realistic
administrative dataset (
healthyR_data) with 187,721 rows covering hospital visits, patient demographics, charges, payments, and quality metrics - CMS Data Access: Offers a suite of functions to fetch, download, and work with current CMS hospital data, including quality measures, outcomes, and provider information
Whether you’re testing healthcare analytics functions, teaching health informatics, or conducting research, healthyR.data provides the data infrastructure you need.
The healthyR_data dataset includes:
- Patient Information: Medical Record Numbers (MRN), visit IDs, and visit dates
- Financial Data: Charges, payments, adjustments, and amounts due
- Clinical Metrics: Length of stay, service lines, readmission flags
- Quality Indicators: Expected vs actual length of stay, outlier flags, readmission expectations
- Payer Information: Insurance classifications and payer groupings
The package provides multiple ways to access current CMS hospital data:
- Meta Data Functions: Search and explore available CMS datasets
get_cms_meta_data()- Search CMS data catalogget_provider_meta_data()- Search provider data
- Data Download Functions: Fetch current hospital data
current_hosp_data()- Download all current hospital datafetch_cms_data()- Fetch specific CMS datasetsfetch_provider_data()- Fetch provider data via API
- Specific Hospital Data Functions: Get targeted datasets
current_asc_data()- Ambulatory Surgery Center datacurrent_hcahps_data()- Hospital Consumer Assessment of Healthcare Providers and Systemscurrent_hai_data()- Healthcare-Associated Infectionscurrent_readmission_data()- Hospital readmissions- And 20+ more specific data extraction functions
is_valid_url()- Validate URLs before data fetchingcurrent_hosp_data_dict()- Get data dictionaries
Install the released version from CRAN:
install.packages("healthyR.data")Install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("spsanderson/healthyR.data")library(healthyR.data)
library(dplyr)
# Load the built-in dataset
df <- healthyR_data
# Explore the data structure
glimpse(df)
#> Rows: 187,721
#> Columns: 17
#> $ mrn <chr> "86069614", "60856527", "80673110", "55897373…
#> $ visit_id <chr> "3519249247", "3602225015", "3125290892", "38…
#> $ visit_start_date_time <dttm> 2010-01-04 05:00:00, 2010-01-04 05:00:00, 20…
#> $ visit_end_date_time <dttm> 2010-01-04, 2010-01-04, 2010-01-04, 2010-01-…
#> $ total_charge_amount <dbl> 25983.88, 22774.05, 10690.45, 8788.02, 7325.1…
#> $ total_amount_due <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 201.52, 20.00, …
#> $ total_adjustment_amount <dbl> -20799.61, -12978.37, -7596.09, -7663.57, -60…
#> $ payer_grouping <chr> "Medicare B", "Medicare HMO", "HMO", "Medicar…
#> $ total_payment_amount <dbl> -5184.27, -9795.68, -3094.36, -1124.45, -1269…
#> $ ip_op_flag <chr> "O", "O", "O", "O", "O", "O", "O", "O", "O", …
#> $ service_line <chr> "General Outpatient", "General Outpatient", "…
#> $ length_of_stay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ expected_length_of_stay <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ length_of_stay_threshold <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ los_outlier_flag <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ readmit_flag <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ readmit_expectation <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
# Analyze service lines by patient type
df %>%
count(ip_op_flag, service_line) %>%
arrange(ip_op_flag, desc(n)) %>%
rename(count = n) %>%
head(10)
#> # A tibble: 10 × 3
#> ip_op_flag service_line count
#> <chr> <chr> <int>
#> 1 I Medical 64435
#> 2 I Surgical 14916
#> 3 I COPD 4398
#> 4 I CHF 3871
#> 5 I Pneumonia 3323
#> 6 I Cellulitis 3311
#> 7 I Major Depression/Bipolar Affective Disorders 2866
#> 8 I Chest Pain 2766
#> 9 I GI Hemorrhage 2404
#> 10 I MI 2253# Analyze charges and payments by payer type
df %>%
group_by(payer_grouping) %>%
summarise(
visits = n(),
avg_charge = mean(total_charge_amount, na.rm = TRUE),
avg_payment = mean(abs(total_payment_amount), na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(visits)) %>%
head(10)
#> # A tibble: 10 × 4
#> payer_grouping visits avg_charge avg_payment
#> <chr> <int> <dbl> <dbl>
#> 1 Medicare A 52622 68452. 11861.
#> 2 Medicaid HMO 25484 37285. 5575.
#> 3 Blue Cross 24357 31561. 10374.
#> 4 Medicare B 22563 16136. 2531.
#> 5 Medicare HMO 18997 55526. 8443.
#> 6 HMO 17444 31407. 9405.
#> 7 Medicaid 8777 49428. 7602.
#> 8 Commercial 6567 35300. 12506.
#> 9 Self Pay 3649 24998. 662.
#> 10 Compensation 2502 40101. 6413.# Examine length of stay outliers
df %>%
filter(ip_op_flag == "I") %>% # Inpatient only
group_by(service_line) %>%
summarise(
total_visits = n(),
avg_los = mean(length_of_stay, na.rm = TRUE),
outlier_rate = mean(los_outlier_flag, na.rm = TRUE) * 100,
readmit_rate = mean(readmit_flag, na.rm = TRUE) * 100,
.groups = "drop"
) %>%
arrange(desc(total_visits)) %>%
head(10)
#> # A tibble: 10 × 5
#> service_line total_visits avg_los outlier_rate readmit_rate
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 Medical 64435 5.72 0.205 12.8
#> 2 Surgical 14916 9.35 0.436 10.8
#> 3 COPD 4398 5.28 0.0910 19.6
#> 4 CHF 3871 6.42 0.103 21.1
#> 5 Pneumonia 3323 5.89 0.120 14.2
#> 6 Cellulitis 3311 4.78 0.242 9.09
#> 7 Major Depression/Bipolar Affe… 2866 10.4 0.105 5.58
#> 8 Chest Pain 2766 2.05 0.145 8.28
#> 9 GI Hemorrhage 2404 5.86 0.416 14.6
#> 10 MI 2253 4.96 0.266 14.4library(healthyR.data)
# Search for datasets about hospital readmissions
meta_data <- get_cms_meta_data(
.keyword = "readmission",
.data_version = "current"
)
# View available datasets
meta_data %>%
select(title, modified, media_type) %>%
head()# Get metadata for a specific dataset
cms_meta <- get_cms_meta_data(
.title = "Unplanned Hospital Visits",
.data_version = "current",
.media_type = "API"
)
# Extract the data link
data_link <- cms_meta$data_link[1]
# Fetch the actual data
hospital_data <- fetch_cms_data(data_link)
glimpse(hospital_data)# Download all current hospital data files (requires user to select directory)
all_hosp_data <- current_hosp_data()
# The result is a list of tibbles, one for each data file
names(all_hosp_data)
# Extract specific datasets
asc_data <- current_asc_data(
all_hosp_data,
.data_sets = c("Facility", "State")
)# Search for provider datasets
provider_meta <- get_provider_meta_data(.keyword = "hospital")
# Fetch provider data using an identifier
provider_data <- fetch_provider_data("069d-826b", .limit = 100)
glimpse(provider_data)healthyR.data is ideal for:
- Healthcare Analytics: Test and develop healthcare analytics functions with realistic data
- Education: Teach health informatics and data analysis courses
- Research: Prototype healthcare research analyses before working with protected data
- Package Development: Test healthcare R packages (like the healthyR package)
- Quality Improvement: Analyze hospital quality metrics and performance indicators
- Financial Analysis: Study healthcare billing, payments, and reimbursement patterns
- Benchmarking: Compare your data against national hospital data from CMS
The healthyR_data dataset contains 187,721 rows and 17 variables:
| Variable | Description |
|---|---|
mrn |
Medical Record Number (unique patient identifier) |
visit_id |
Unique hospital visit identifier |
visit_start_date_time |
Visit start date and time |
visit_end_date_time |
Visit end date and time |
total_charge_amount |
Total charges for the visit (USD) |
total_amount_due |
Amount still owed for the visit (USD) |
total_adjustment_amount |
Total adjustments to the account (USD) |
payer_grouping |
Insurance classification |
total_payment_amount |
Total payments received (USD) |
ip_op_flag |
Patient type (I=Inpatient, O=Outpatient) |
service_line |
Hospital service line |
length_of_stay |
Total days admitted to hospital |
expected_length_of_stay |
Expected days for admission |
length_of_stay_threshold |
LOS threshold for outlier classification |
los_outlier_flag |
Binary indicator if visit exceeded LOS threshold |
readmit_flag |
Binary indicator if readmitted within 30 days |
readmit_expectation |
Expected readmission rate from benchmark |
- R >= 4.1.0
- Dependencies:
dplyr,rlang,utils,janitor,httr2,stringr,tidyr,stats
If you encounter a bug or have a feature request:
- Report issues on GitHub
- Check the function reference for detailed documentation
Contributions are welcome! Please feel free to submit a Pull Request. For major changes:
- Fork the repository
- Create your feature branch
(
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Please make sure to update tests as appropriate and follow the existing code style.
- healthyR - Hospital data analysis workflow tools
- healthyverse - Meta-package for healthcare analytics
MIT License - see LICENSE.md for details
Steven P. Sanderson II, MPH
Email: spsanderson@gmail.com
ORCID: 0009-0006-7661-8247
If you use this package in your research, please cite:
citation("healthyR.data")Note: The built-in healthyR_data dataset contains
synthetic/de-identified data for demonstration and testing purposes.
When working with CMS data functions, you’re accessing real, publicly
available CMS hospital data.
