Subtracting a data column from another yields a timedelta column with a very fine grained resolution. If all you need is the difference in days, the first instinct is to use dt.days
npr['diff_since_last'].dt.days
But this was quite slow, and it turns out that the following does the same and it is a lot quicker:
npr['diff_since_last']/np.timedelta64(1, 'D')
260 times faster!
It is a bit courious since one would think dt.days could use division, but there is probably a good reason for it. Stil, in many cases it seems division is all you need and with repeated use in large dataframes it becomes almost a requirement.
Here are some stats (dataframe with 1.6 million observations):
% timeit a= npr['diff_since_last']/np.timedelta64(1, 'D')
36.5 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit a=npr['diff_since_last'].dt.days
9.54 s ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
len(npr)
Out[69]: 1632061
Health registry research with Python
Sunday, 9 September 2018
Common temporal query statements in health research
We would like to have a query language that can easily express and answer the following:
- How many individuals received pharmaceutical A before B?
(How many IBD patients were treated with steroids before biologics?)
- How many individuals experienced B within x days after A?
- How many hip patients were readmitted to hospital within 30 days after surgery?
These can be made more complex:
- (A and B) before (C or D)
We may also add qualifiers:
- How many patients always get an X within 100 days of receiving Y?
(Do the patients always receive a checkup within 100 days)
- How many patients always get at X within 100 days of receiving Y at least 50% of the time?
(Do the patients always receive a checkup within 100 days)
- How many patients receive every pharmaceutical A within 100 days from pharmaceutical B
- How many individuals received at least five events with pharmaceutical A before receiving at least two events with pharmaceutical B?
One interesting source:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC61469/
- How many individuals received pharmaceutical A before B?
(How many IBD patients were treated with steroids before biologics?)
- How many individuals experienced B within x days after A?
- How many hip patients were readmitted to hospital within 30 days after surgery?
These can be made more complex:
- (A and B) before (C or D)
We may also add qualifiers:
- How many patients always get an X within 100 days of receiving Y?
(Do the patients always receive a checkup within 100 days)
- How many patients always get at X within 100 days of receiving Y at least 50% of the time?
(Do the patients always receive a checkup within 100 days)
- How many patients receive every pharmaceutical A within 100 days from pharmaceutical B
- How many individuals received at least five events with pharmaceutical A before receiving at least two events with pharmaceutical B?
One interesting source:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC61469/
Subscribe to:
Posts (Atom)
Difference in days between two date columns: Dividing by np.timedelta64(1, 'D') is way faster than dt.days
Subtracting a data column from another yields a timedelta column with a very fine grained resolution. If all you need is the difference in d...
-
We would like to have a query language that can easily express and answer the following: - How many individuals received pharmaceutical A ...
-
Subtracting a data column from another yields a timedelta column with a very fine grained resolution. If all you need is the difference in d...