-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy pathREADME.Rmd
273 lines (188 loc) · 9.63 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# clinspacy
<!-- badges: start -->
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
<!-- badges: end -->
The goal of clinspacy is to perform biomedical named entity recognition, Unified Medical Language System (UMLS) concept mapping, and negation detection using the Python spaCy, scispacy, and medspacy packages.
## Installation
You can install the CRAN version of clinspacy with:
```
install.packages('clinspacy')
```
You can install the GitHub version of clinspacy with:
```
remotes::install_github('ML4LHS/clinspacy', INSTALL_opts = '--no-multiarch')
```
## How to load clinspacy
```{r}
library(clinspacy)
```
## Initiating clinspacy
*Note: the very first time you run `clinspacy_init()` or `clinspacy()` after installing the package, you may receive an error stating that `spaCy` was unable to be imported because it was not found. Restarting your R session should resolve the issue.*
Initiating clinspacy is optional. If you do not initiate the package using `clinspacy_init()`, it will be automatically initiated without the UMLS linker. The UMLS linker takes up ~12 GB of RAM, so if you would like to use the linker, you can initiate clinspacy with the linker. The linker can still be added on later by reinitiating with the `use_linker` argument set to `TRUE`.
```{r}
clinspacy_init() # This is optional! The default functionality is to initiatie clinspacy without the UMLS linker
```
## Named entity recognition (without the UMLS linker)
The `clinspacy()` function can take a single string, a character vector, or a data frame. It can output either a data frame or a file name.
### A single character string as input
```{r}
clinspacy('This patient has diabetes and CKD stage 3 but no HTN.')
clinspacy('HISTORY: He presents with chest pain. PMH: HTN. MEDICATIONS: This patient with diabetes is taking omeprazole, aspirin, and lisinopril 10 mg but is not taking albuterol anymore as his asthma has resolved. ALLERGIES: penicillin.', verbose = FALSE)
```
### A character vector as input
```{r}
clinspacy(c('This pt has CKD and HTN', 'Pt only has CKD but no HTN'),
verbose = FALSE)
```
### A data frame as input
```{r}
data.frame(text = c('This pt has CKD and HTN', 'Diabetes is present'),
stringsAsFactors = FALSE) %>%
clinspacy(df_col = 'text', verbose = FALSE)
```
### Saving the output to file
The `output_file` can then be piped into `bind_clinspacy()` or `bind_clinspacy_embeddings()`. This saves a lot of time because you can try different strategies of subsetting in both of these functions without needing to re-process the original data.
```{r}
if (!dir.exists(rappdirs::user_data_dir('clinspacy'))) {
dir.create(rappdirs::user_data_dir('clinspacy'), recursive = TRUE)
}
mtsamples = dataset_mtsamples()
mtsamples[1:5,]
clinspacy_output_file =
mtsamples[1:5, 1:2] %>%
clinspacy(df_col = 'description',
verbose = FALSE,
output_file = file.path(rappdirs::user_data_dir('clinspacy'),
'output.csv'),
overwrite = TRUE)
clinspacy_output_file
```
## Binding named entities to a data frame (without the UMLS linker)
Negated concepts, as identified by the medspacy cycontext flag, are ignored by default and do not count towards the frequencies. However, you can now change the subsetting criteria.
Note that you now need to re-provide the original dataset to the `bind_clinspacy()` function.
```{r}
mtsamples[1:5, 1:2] %>%
clinspacy(df_col = 'description', verbose = FALSE) %>%
bind_clinspacy(mtsamples[1:5, 1:2])
```
### We can also store the intermediate result so that bind_clinspacy() does not need to re-process the text.
```{r}
clinspacy_output_data =
mtsamples[1:5, 1:2] %>%
clinspacy(df_col = 'description', verbose = FALSE)
clinspacy_output_data %>%
bind_clinspacy(mtsamples[1:5, 1:2])
clinspacy_output_data %>%
bind_clinspacy(mtsamples[1:5, 1:2],
cs_col = 'entity')
clinspacy_output_data %>%
bind_clinspacy(mtsamples[1:5, 1:2],
subset = 'is_uncertain == FALSE & is_negated == FALSE')
```
### We can also re-use the output file we had created earlier and pipe this directly into bind_clinspacy().
```{r}
clinspacy_output_file
clinspacy_output_file %>%
bind_clinspacy(mtsamples[1:5, 1:2])
clinspacy_output_file %>%
bind_clinspacy(mtsamples[1:5, 1:2],
cs_col = 'entity')
clinspacy_output_file %>%
bind_clinspacy(mtsamples[1:5, 1:2],
subset = 'is_uncertain == FALSE & is_negated == FALSE')
```
## Binding entity embeddings to a data frame (without the UMLS linker)
With the UMLS linker disabled, 200-dimensional entity embeddings can be extracted from the scispacy Python package. For this to work, you must set `return_scispacy_embeddings` to `TRUE` when running `clinspacy()`. It's also a good idea to write the output directly to file because the embeddings can be quite large.
```{r}
clinspacy_output_file =
mtsamples[1:5, 1:2] %>%
clinspacy(df_col = 'description',
return_scispacy_embeddings = TRUE,
verbose = FALSE,
output_file = file.path(rappdirs::user_data_dir('clinspacy'),
'output.csv'),
overwrite = TRUE)
clinspacy_output_file %>%
bind_clinspacy_embeddings(mtsamples[1:5, 1:2])
```
## Adding the UMLS linker
The UMLS linker can be turned on (and off) even if `clinspacy_init()` has already been called. The first time you turn it on, it takes a while because the linker needs to be loaded into memory. On subsequent removal and addition, this occurs much more quickly because the linker is only removed/added to the pipeline and does not need to be reloaded into memory.
```{r}
clinspacy_init(use_linker = TRUE)
```
## Named entity recognition (with the UMLS linker)
By turning on the UMLS linker, you can restrict the results by semantic type. In general, restricting the result in `clinspacy()` is not a good idea because you can always subset the results later within `bind_clinspacy()` and `bind_clinspacy_embeddings()`.
```{r}
clinspacy('This patient has diabetes and CKD stage 3 but no HTN.')
clinspacy('This patient with diabetes is taking omeprazole, aspirin, and lisinopril 10 mg but is not taking albuterol anymore as his asthma has resolved.',
semantic_types = 'Pharmacologic Substance')
clinspacy('This patient with diabetes is taking omeprazole, aspirin, and lisinopril 10 mg but is not taking albuterol anymore as his asthma has resolved.',
semantic_types = 'Disease or Syndrome')
```
## Binding UMLS concept unique identifiers to a data frame (with the UMLS linker)
This function binds columns containing concept unique identifiers with which scispacy has 99% confidence of being present with values containing frequencies. Negated concepts, as identified by the medspacy cycontext is_negated flag, are ignored and do not count towards the frequencies. However, this behavior can be changed using the `subset` argument.
Note that by turning on the UMLS linker, you can restrict the results by semantic type.
```{r}
clinspacy_output_file =
mtsamples[1:5, 1:2] %>%
clinspacy(df_col = 'description',
return_scispacy_embeddings = TRUE, # only so we can retrieve these below
verbose = FALSE,
output_file = file.path(rappdirs::user_data_dir('clinspacy'),
'output.csv'),
overwrite = TRUE)
clinspacy_output_file %>%
bind_clinspacy(mtsamples[1:5, 1:2])
clinspacy_output_file %>%
bind_clinspacy(
mtsamples[1:5, 1:2],
subset = 'is_negated == FALSE & semantic_type == "Diagnostic Procedure"'
)
```
## Binding concept embeddings to a data frame (with the UMLS linker)
The default embeddings are from the scispacy R package. If you want to use the cui2vec embeddings (only available with the linker enabled), you ned to set the `type` arguement to `cui2vec`. Up to 500-dimensional embeddings can be returned.
Note that by turning on the UMLS linker, you can restrict the results by semantic type (with either type of embedding).
### Scispacy embeddings (with the UMLS linker)
With the UMLS linker enabled, you can restrict by semantic type when obtaining scispacy embeddings.
Note: The mean embeddings may be slightly different than if the linker was disabled because entities may be captured twice (as entities may map to multiple concepts). Thus, if you do not need to restrict by semantic type, the recommended setting is to turn the UMLS linker off by re-running `clinspacy_init(use_linker = FALSE)` (note that `use_linker = FALSE` is the default in `clinspacy_init()`).
```{r}
clinspacy_output_file %>%
bind_clinspacy_embeddings(mtsamples[1:5, 1:2])
clinspacy_output_file %>%
bind_clinspacy_embeddings(
mtsamples[1:5, 1:2],
subset = 'is_negated == FALSE & semantic_type == "Diagnostic Procedure"'
)
```
### Cui2vec embeddings (with the UMLS linker)
These are only available with the UMLS linker enabled.
```{r}
clinspacy_output_file %>%
bind_clinspacy_embeddings(mtsamples[1:5, 1:2],
type = 'cui2vec')
clinspacy_output_file %>%
bind_clinspacy_embeddings(
mtsamples[1:5, 1:2],
type = 'cui2vec',
subset = 'is_negated == FALSE & semantic_type == "Diagnostic Procedure"'
)
```
# UMLS CUI definitions
```{r}
cui2vec_definitions = dataset_cui2vec_definitions()
head(cui2vec_definitions)
```