Pandas Multilevel Index to the rescue

As a quick follow up to my last post on speeding up pandas, I also had the following interesting problem. I have the log data that I mentioned yesterday, and every time a student runs the code, that is logged along with any error messages or a simple success indication. In addition I have a code table that contains the actual code for every run along with an indication of success or error messages. But for this particular application I wanted to join those two tables into one. So that the log of the run would also contain the code. This is also important because unittest results are logged and I wanted the code for that unit test run to be included as well.

What you would really like to tell Pandas is join these two data frames on their user id, and the id of the coding problem and the closest timestamp. Unfortunately the timestamps are not guaranteed to be an exact match, so joining on them would not work in many cases. In addition the merge method doesn’t give you anything like an approximate match.

The solution is to add a column to the log that contains the closest timestamp for this student and problem in the code table. Then we can use this new column to do our merge.

So here is a function that we can apply to every row in the log table that will search out the closest timestamp.

def find_closest(sid, acid, time, code):
"""
Given an sid, acid, time triple find the closest match in time from the code DF
"""
poss = code[(code.sid == sid)
& (code.acid == acid)
& (code.timestamp > time-pd.Timedelta(1, 'minute'))
& (code.timestamp < time+pd.Timedelta(1, 'minute'))].sort_values('timestamp')
curr_best = pd.Timedelta(1, 'minute')
best_match = np.nan
for ix, row in poss.iterrows():
diff = abs(time - row.timestamp)
if diff < curr_best:
curr_best = diff
best_match = row.timestamp
return best_match

Again this was incredibly slow! I was able to achieve a pretty big speedup by eliminating the loop as follows:

def find_closest(sid, acid, time, code):
"""
Given an sid, acid, time triple find the closest match in time from the code DF
"""
tmin = time - td_1min
tmax = time + td_1min
poss = code.query("sid == @sid and acid == @acid and timestamp > @tmin and timestamp < @tmax" )
if len(poss) > 0:
poss['tdiff'] = (poss.timestamp - time).abs()
return poss.loc[poss.tdiff.idxmin()]
return np.nan

This was much faster due to the fast computation of the time differences poss.timestamp - time).abs() But there was still a ridiculous amount of time spent with the query — essentially remaking a new little data frame for all 8 million rows.

Here is where the MultiLevel index saved the day! By taking the code data frame and creating a three level index with code.set_index(['sid','acid','timestamp']) I now have those mini DataFrames pre-computed one time! I’ve never been very comfortable with multi level indexes until the lightbulb when on in this project. Once you have a multilevel index you can just grab the rows with code.loc[(sid, acid)] No need for an expensive query when you can look up what you want from the index. Notice that I only used the first two levels of the multi level index. This gives you what you want INDEXED by the timestamp. Which opens the door to using the get_loc method, which Amazingly lets you do an approximate index lookup! You can specify the previous, next, or nearest index to use when there is no exact match!

Here is where I ended up:

def find_closest(sid, acid, time, poss):
"""
Given an sid, acid, time triple find the closest match in time from the code DF
"""
try:
candidates = poss.loc[(sid,acid)]
idx = candidates.index.get_loc(time, method='nearest')
return candidates.iloc[idx].name
except:
return pd.NaT

Since candidates is indexed by timestamp and get_loc returns a positional index we use candidates.iloc[idx].name where name is the value of the non-positional index (timestamp) Confusing for sure!

Here is how I call this inside an apply:

useinfo_w_answers['closest_time'] = useinfo_w_answers.progress_apply(lambda row: 
find_closest(row['Anon Student Id'],
row['Problem Name'],
row['Time'],
poss), axis=1)

Unfortunately for an 8 million row dataset this still takes about 30 minutes, but that is way better than the previous best time which was more than 50 hours. Luckily these are mostly run a few times a year and not every day notebooks. I would love to hear from you if you have better ideas on how to speed this up further.