It's interesting to me to see pandas used in this application. I'd be curious to see a more fully featured implementation.
I'm a scientist by profession and I've been working on building out several different generalized data processing pipelines for some specific problems in my sub-field, to make gathering and formatting in-situ data easier and more standardized/open/version-controlled. It's going great, worlds better than the smattering of matlab code strewn across the hard drives in the lab written in a non-collaborative manner and shared by email...
... but. I'll admit, I've run into a lot of footguns in the pandas API in terms of efficiency. You'll do something it what seems to be the logical way, or in a way that the API funnels you towards (like the groupby calls in the OP), and you'll quickly realize that if you're working on large-ish tables (>10Gb in memory) that it was the stupid way to do things. In terms of readable code to share with colleagues, pandas can't be beat, but things get wonky when you reach significant complexity and I would be surprised if it made any sense to use in a 'real' recommendation engine when considering developer productivity.
Look into Dask if you are attempting to process an entire data table that's larger than memory.
For your last paragraph, you're conflating the need to share code with the need to build a robust scalable service. Most research code are only needed for the paper and rarely touched again.
I'm a scientist by profession and I've been working on building out several different generalized data processing pipelines for some specific problems in my sub-field, to make gathering and formatting in-situ data easier and more standardized/open/version-controlled. It's going great, worlds better than the smattering of matlab code strewn across the hard drives in the lab written in a non-collaborative manner and shared by email...
... but. I'll admit, I've run into a lot of footguns in the pandas API in terms of efficiency. You'll do something it what seems to be the logical way, or in a way that the API funnels you towards (like the groupby calls in the OP), and you'll quickly realize that if you're working on large-ish tables (>10Gb in memory) that it was the stupid way to do things. In terms of readable code to share with colleagues, pandas can't be beat, but things get wonky when you reach significant complexity and I would be surprised if it made any sense to use in a 'real' recommendation engine when considering developer productivity.