I spent a year in a role where 50% of my duties was writing sql reports. These r...

chubot · on Feb 13, 2019

Once your SQL gets into 500-1000 lines, and hours of runtime, I would suggest using data frames instead (in R or Python).

I wrote this post to introduce the idea:

What Is a Data Frame? (In Python, R, and SQL) https://www.oilshell.org/blog/2018/11/30.html

It's often useful to treat SQL as an extraction/filtering language, and then use R or Pandas as a computation/reporting language.

I think of it as separating I/O and computation. SQL does enough filtering to cut the data down to a reasonable size, and maybe some preliminary logic. And then you compute on that smaller data set in R or Pandas -- iterating in SECONDS instead of hours. The code will likely be shorter as well, so it's a win-win (see examples in my blog post).

I can't think of many situations where hours of runtime is "reasonable" for an SQL query. In 2 hours you could probably do a linear scan over every table in most production databases 10-100 times.

For example, if your database is 10 GB, you could cat all of its files in less than 5 minutes (probably much less on a modern SSD). In 2 hours, you can do a five minute operation 24 times. I can't think of many reports that should take longer than 24 full passes over all the data in the database (i.e. pretending that you're not bothering to use indices in the most basic way). If it takes longer than that, the joins should be expressible with something orders of magnitude more efficient.

I've mainly worked with big data frameworks, but I think that almost any SQL database (sqlite, MySQL, Postgres) should be able to do a scan of a single table with some trivial predicates within 10x the speed of 'cat' (should be within 2x really). They can probably do better on realistic workloads because of caching.

busterarm · on Feb 13, 2019

This isn't meant to offend, rather as a point of consideration, but seeing your example use case being 10GB and then talk about big data frameworks makes it hard for me to take this advice seriously.

I might reach for that kind of tooling at the hundreds of TB to PB scale, but in our production applications we have _tables_ that are multiple terabytes. SQL is just fine.

Yes, we also have have queries that run in the timescale of hours and they are always of the reporting/scheduled task variety and absolutely vital to our customers. Long running reporting queries are pretty acceptable (and pretty much the norm since forever) outside of the tech industry and your customers won't balk at it.

chubot · on Feb 13, 2019

It seems like you misunderstood what I wrote. I'm saying you should consider using R or Python if your reports are taking a long time, not big data frameworks.

Big data was a reference to thinking about the problem in terms of the speed of the hardware. If it's 1000x slower than what the hardware can do, that's a sign you're using the wrong tool for the job.

Getting within 10x is reasonable, but not 100x or 1000x, which is VERY COMMON in my experience. These two situations are very common:

1) SQL queries that are orders of magnitude slower than a simple offline computation in Python or R (let alone C++). The underlying cause is usually due to bad query planning of joins / lack of indices.

You might not have the ability to add indices easily, and even if you did, that has its drawbacks for one-off queries.

2) You need to do some computation that's awkward inside SQL. Statistics beyond basic aggregations, iterative computations (loops), and tree structures are common problems.

bonesss · on Feb 13, 2019

ETL pipelines that hop between different kinds of storage/computing platforms to exploit local maxima, like you're pointing out with R and SQL working in concert, is pretty common in companies working their way up to BigData and academia.

From the Enterprise side I think too many developers have an unfounded expectations around data storage technology. There's this unchallenged belief that monolithic datastorage that will solve thier problems across the entire time/storage/complexity spectrum. By bringing multiple tools to bear, instead, you end up with more purpose built storage but far less domain impedence.

Slapping a denormalized NoSQL front-end for webscale onto a legacy RDBMS can be a cheap win/win to maximize the capabilities of both. SQL + R is oodles better than R or SQL in isolation.

hobs · on Feb 13, 2019

Your advice is pretty good, but I would definitely say that SQL doesnt have a strong relationship between lines of code and runtime, you can line break (and many do) your wide table's select into many columns and get there pretty quick.

If you are writing SQL regularly, understanding the basics of how the queries you write is not that hard for you engine of choice, and everyone should be required to understand the basics of reading an execution plan so they can find the right inflection points between data gathering and processing.

I regularly sigh write and maintain SQL procedures that are >10k LOC, and their runtime never would exceed minutes, much less hours.

max76 · on Feb 13, 2019

If catting 10gb of files takes 5 minutes than catting 500gb of files takes 250 minutes. This is, of course, an over estimation of how long it takes to look at data on a disk. It's also the most trivial thing you can do with data, read it once.

I think we have a misunderstanding about the scale of data.

EDIT: Reading your post you mention that dataframes stores data in memory. Working with data in ram would provide a significant speedup. It just wasn't possible.

chubot · on Feb 13, 2019

I'm not denying that you could have a query that takes hours and is within 10x the speed it should be, just saying that it's very common to have inefficient SQL queries for reporting. I would say it's almost the "default" state unless you do a bit of work.

Some more detail here: https://news.ycombinator.com/item?id=19150971

There are many interesting queries that don't touch all the data in the database. The 'cat' thing was basically assuming the worst case, e.g. as a sanity check.

dwd · on Feb 13, 2019

Knowing how to optimise SQL (and also database indexes) is a valuable skill.

Reducing a highly used query's execution time by several orders of magnitude can be quite gratifying.

faceplanted · on Feb 13, 2019

Stick the --+ORDERED flag on a random select sometime and look at what happens to the estimated query cost to see how easy it would be to fuck this up if you had to make all the optimisation choices yourself.

DrPhish · on Feb 13, 2019

I had a similar role where I was writing boring LOB apps in a very gross language, but since we were using an SQL backend for all the data, I instead challenged myself to using bare templates in the actual programming language and writing all the extraction, transforms and logic into SQL selects and (when unavoidable) programmatic bits.

I also learned a crapload about obscure SQL since I would go to extreme lengths to achieve this. There was a lot of meta-SQL programming, where I would use SQL to generate more SQL and execute that within my statement, sometimes multiple layers deep. It was beautiful in its own way, expanding out in intricate patterns.

veritas3241 · on Feb 13, 2019

You might be interested in this project then https://www.getdbt.com/ it's a SQL compiler and executor which has many features like you're talking about (macros to generate SQL and so on). It makes it pretty easy to build up a complex DAG of SQL queries and transformations.

max76 · on Feb 13, 2019

I never personally wrote meta-SQL but it was used at that office and I read it. I didn't see beauty. When I read the code and understood it I felt terror.

justinclift · on Feb 13, 2019

Any recommendations of a place for sharing "extremely advanced" SQL skills?

Asking from wanting to make use of such a place, and haven't seen anything like it. So, probably need to bootstrap one instead (etc).

andy_boot · on Feb 13, 2019

I made this: https://www.windowfunctions.com

CRConrad · on Feb 14, 2019

Are you asking for “sharing skills” as in something like StackOverflow, but for SQL?

I would think there already is a Stack Exchange for SQL, possibly several (for different RDBMSes/ dialects); go have a look there, if this is what you meant.

veritas3241 · on Feb 13, 2019

Are you asking for examples of advanced SQL skills?

In my experience, if you can grok lateral joins (aka cross apply), recursive CTEs, window functions, and fully understand all the join types, that's a gold star for understanding SQL!

SamuelAdams · on Feb 13, 2019

I would add indexes to that list. Knowing the different types and how they impact performance can be very valuable.

veritas3241 · on Feb 13, 2019

I didn't add indexes mainly b/c for analytic warehouses that are columnar, indexes are less important / meaningless. Partitions though, that's important!

saint_abroad · on Feb 13, 2019

> Most people are perfectly happy to let the orm do all the work, and never care to dig into the data directly.

ORM all-too-often defines data structures from code.

Linus Torvalds wrote,

> I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

https://lwn.net/Articles/193245/

weka · on Feb 13, 2019

> I've moved to writing backend code. I'm surprised most of my peers cannot write anything more complicated than a join.

I'd wager more don't even know what a join is.

Scaevolus · on Feb 13, 2019

Do you do much with Excel's SQL connectors? Sheets populates with database results are powerful and reasonably user friendly.

hobs · on Feb 13, 2019

A thing I find even more user friendly is Import-Excel https://github.com/dfinke/ImportExcel

It's a powershell module that allows you to easily dump things directly to excel files, does pretty decent datatables, multi-tabs, etc.

bas · on Feb 13, 2019

From a performance perspective, most ORMs are trash if used naively.