Sunday 9 September 2018

Difference in days between two date columns: Dividing by np.timedelta64(1, 'D') is way faster than dt.days

Subtracting a data column from another yields a timedelta column with a very fine grained resolution. If all you need is the difference in days, the first instinct is to use dt.days

npr['diff_since_last'].dt.days

But this was quite slow, and it turns out that the following does the same and it is a lot quicker:

npr['diff_since_last']/np.timedelta64(1, 'D')

260 times faster!

It is a bit courious since one would think dt.days could use division, but there is probably a good reason for it. Stil, in many cases it seems division is all you need and with repeated use in large dataframes it becomes almost a requirement.


Here are some stats (dataframe with 1.6 million observations):

% timeit a= npr['diff_since_last']/np.timedelta64(1, 'D')
36.5 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit a=npr['diff_since_last'].dt.days
9.54 s ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

len(npr)
Out[69]: 1632061



Common temporal query statements in health research

We would like to have a query language that can easily express and answer the following:

- How many individuals received pharmaceutical A before B?
(How many IBD patients were treated with steroids before biologics?)

- How many individuals experienced B within x days after A?
- How many hip patients were readmitted to hospital within 30 days after surgery?

These can be made more complex:

- (A and B) before (C or D)

We may also add qualifiers:

- How many patients always get an X within 100 days of receiving Y?
(Do the patients always receive a checkup within 100 days)

- How many patients always get at  X within 100 days of receiving Y at least 50% of the time?
(Do the patients always receive a checkup within 100 days)

- How many patients receive every pharmaceutical A within 100 days from pharmaceutical B

- How many individuals received at least five events with pharmaceutical A before receiving at least two events with pharmaceutical B?

One interesting source:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC61469/


Difference in days between two date columns: Dividing by np.timedelta64(1, 'D') is way faster than dt.days

Subtracting a data column from another yields a timedelta column with a very fine grained resolution. If all you need is the difference in d...