Health registry research with Python: Difference in days between two date columns: Dividing by np.timedelta64(1, 'D') is way faster than dt.days

Sunday, 9 September 2018

Difference in days between two date columns: Dividing by np.timedelta64(1, 'D') is way faster than dt.days

Subtracting a data column from another yields a timedelta column with a very fine grained resolution. If all you need is the difference in days, the first instinct is to use dt.days

npr['diff_since_last'].dt.days

But this was quite slow, and it turns out that the following does the same and it is a lot quicker:

npr['diff_since_last']/np.timedelta64(1, 'D')

260 times faster!

It is a bit courious since one would think dt.days could use division, but there is probably a good reason for it. Stil, in many cases it seems division is all you need and with repeated use in large dataframes it becomes almost a requirement.

Here are some stats (dataframe with 1.6 million observations):

% timeit a= npr['diff_since_last']/np.timedelta64(1, 'D')
36.5 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit a=npr['diff_since_last'].dt.days
9.54 s ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

len(npr)
Out[69]: 1632061

Health registry research with Python

Sunday, 9 September 2018

Difference in days between two date columns: Dividing by np.timedelta64(1, 'D') is way faster than dt.days

No comments:

Post a Comment

Difference in days between two date columns: Dividing by np.timedelta64(1, 'D') is way faster than dt.days

Report Abuse