Bloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents.

Bloatectomy takes in a list of notes or a single file (.docx, .txt, .rtf, etc) or single string to be marked for duplicates which can then be highlighted, bolded, or removed. Marked output and tokens are output.

https://github.com/MIT-LCP/bloatectomy

Install via Conda

conda install -c summerkrankin bloatectomy

Install via pip

python3 -m pip install bloatectomy

Roselie Bright (epidemiologist at the Food and Drug Administration), Kate Dowdy (data scientist at Booz Allen Hamilton) and I developed Bloatectomy as part of our work on Project Shakespeare (Roselie Bright’s FDA project where Booz Allen Hamilton provided data science support).

Duplicated sentences (“note bloat”) in unstructured electronic healthcare records hamper scientific research. Existing methods did not meet our needs. We adapted the LZW compression algorithm into a new method and designed parameters to allow customization for varying data and research needs. This resulted in the Bloatectomy package which identifies duplicate sentences in unstructured healthcare notes (or other documents), marks them for manual review, and removes them for statistical analysis.

The package allows for a high level of customization in the length and type of duplications (via regular expressions) and could also be used for plagiarism detection or other text pre-processing requirements for natural language processing (NLP). The Bloatectomy package works, is available for use, and can be adapted for other settings. Please contribute to our code or let us know if you have any questions.

Our use case was the MIMIC III Critical Care Database, and there is an example jupyter notebook if you would like to see how we concatenated notes and used the package for multiple records. The python examples show the simpler use case of a single document or string of text.

For details about how the package works and our reasons for developing it, read the paper here https://github.com/MIT-LCP/bloatectomy/blob/master/bloatectomy_paper.pdf

To acknowledge use of the software, please cite the DOI provided via Zenodo:

Summer K. Rankin, Roselie Bright, & Katherine Dowdy. (2020, June 26). Bloatectomy (Version v0.0.12). Zenodo. http://doi.org/10.5281/zenodo.3909030

or

@software{summer_k_rankin_2020_3909030,
  author       = {Summer K. Rankin and Roselie A. Bright and Kate Dowdy},
  title        = {Bloatectomy},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.0.12},
  doi          = {10.5281/zenodo.3909030},
  url          = {https://doi.org/10.5281/zenodo.3909030}
}