Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Common Mistakes and Missed Optimization Opportunities in SQL (hakibenita.com)
179 points by haki on Nov 22, 2019 | hide | past | favorite | 66 comments


While counting columns will not include NULL columns, how about counting joined tables?

  SELECT a.id, COUNT(b.*) FROM a JOIN b ON b.a_id = a.id GROUP BY a.id
is not permitted in Postgres.

Sure, I could just use COUNT(b.a_id) since that's what I join on, but a more complicated example might not allow for that. For instance if it was a virtual table.


I'm sorry, if you want to count NULL-rows in b.* , how will it ever be different from just COUNT( * )? Maybe I'm misunderstanding what you're after?


You're right, it's a bad example, imagine if I joining two tables:

  SELECT a.id, COUNT(b.*), COUNT(c.*)
    FROM a JOIN b ON b.a_id = a.id JOIN c ON c.a_id = a.id
    GROUP BY a.id
I want to know how many occurrences a_id has in both table b and c. Again in this simple example, I could just count on b.a_id and c.a_id, respectively, but imagine if b and c were complex virtual tables:

  JOIN (SELECT NULL AS foo, 1 AS bar 
        UNION SELECT 1 AS foo, NULL AS bar) b ON b.foo = a.id OR b.bar = a.id
This would be useful if we are aggregating data together, where essentially, there are two ways to join the data with the main table, and both columns can be null.

Of course, in this example, you could count by going COUNT(b.foo) + COUNT(b.bar), but that's a bit awkward, or a column in table b you know to never be null. But what if you don't? And still have table c next to it?

Yes, in all cases, there would be a way out. In the extreme case, you could wrap it in a virtual table, where you add a column that is just always 0 (not null), so you can count on it. It would just be neat if b.* was possible.


I might misunderstand, but I think the join just isn't the right approach for that sort of child record counting, ie. counting records from two or more independent child tables associated with your rows (if that is what you're wanting).

You're grouping and counting after the three way join. That join will involve all combinations of child records between the two child tables associated with any given parent row (almost never what is wanted). So any given non-null thing you're counting from one child record will appear multiple times, = the number of child records in the other table associated with the parent row.

I think you just want to use correlated subqueries to count the child records: select a.id, (select count(whatever) from child1 c1 where c1.a_id = a.id), (select count(whatever) from child2 c2 where c2.a_id = a.id) ...

TLDR: You almost never want to join independent children to a common parent, use independent correlated subquery expressions instead.


Except, that's potentially super slow if the optimiser does not realise what to do. In its default state, it will make two table look ups for each row in table a. So that's 1+N*2 look ups compared to 3 look ups in my example.

For little data, that's probably fine, but for a big database, it will be slow. However, the optimiser may be able to handle that? I know Sybase's and MSSQL's had trouble with it, but I've heard Postgres' might be able to.


First, correctness first. Second, correlated subqueries in selects really aren't noticably slow at all in my experience (Postgres, Oracle) when they're really needed, meaning alternatives are as slow or slower. Especially compared to the alternative of actually doing a cartesian product across independent children of any size. I don't think the optimizer could help much with avoiding that cartesian product to be actually realized, either - the grouping after that join is going to be a big sort in general with results that depend on the distribution of child values on a row-by-row basis. But in any case it would give the wrong answers for both children regardless of speed and memory used!


I guess I didn't make myself clear either.

If there's an inner join, then there is a matching row. It will be counted by a normal COUNT( * ). On an outer join, columns in a 'missing' row will be represented as NULL.

You're saying you wish there were two NULLS, 'missing value in table null' and 'missing row in join null', and that you could count the first one?


No, not that complex. Just that I want to be able to specify which of my joined tables I am counting on in a null-safe way, in case the joining table could have null values in all its columns.


Any reason a regular count() wouldn't work?

SELECT a.id, count() FROM ...


It would, I realised after I read the replies, that I should have used the two joined table scenario as per my reply to your sibling.


Some of these have been learned through trial and error over the years, but a few were new and great to know.

On a related note, is the MCSE the gold standard for SQL education? Have been looking for a way to brush up and formalize my SQL skills.


Would someone please confirm whether this article is misrepresenting a subquery as an inline CTE? It is my understanding that as of Postgresql 12, a programmer denotes a CTE as "AS MATERIALIZED", "AS NOT MATERIALIZED", or neither and allow the default operation to happen: the CTE subquery will default to inline if its result is used once.

for reference: https://sudonull.com/posts/998-Important-changes-in-the-CTE-...

Generally speaking, some clarification would be helpful!


I think the article and you are correct - the article is worded a little oddly though and ignores the fact that in Postgres 12 CTEs that are referenced multiple times are MATERIALIZED by default.

Before Postgres 12 CTEs were always materialized so you did not get any query optimization benefits of CTEs acting like inline subqueries.

After Postgres 12 all CTEs default to NOT MATERIALIZED if only referenced once or MATERIALIZED if referenced more than once. You can override via MATERIALIZED or NOT MATERIALIZED when defining the CTE.

Their example is showing that you can let Postgres (before 12) optimize a CTE for you by writing it as an inline subquery instead of a CTE:

  SELECT *
  FROM (
    SELECT *
    FROM sale
  ) AS inlined
  WHERE created_by_id = 1

But with Postgres 12 their "don't" example would result in an index scan without refactoring to the "do" example. Basically their advice on do vs don't applies to before Postgres 12.

https://www.postgresql.org/docs/12/queries-with.html is pretty thorough on this


Thanks for confirming


This is a pretty trivial list. Useful for beginners I guess.

I seriously take issue with "Reference Column Position in GROUP BY and ORDER BY" though. If it is restricted to ad-hoc (AKA messing-about) queries I'd be fine with it, but it won't be. Just don't do it.


It's especially egregious in the ORDER BY, since there you have the option of using column aliases.


Are you saying you can't use column aliases in group by? What version of Postgres are you using? I just tried it in 11.5 and it worked:

    # select cust_id as c, sum(avail_balance) as b from account group by c order by b;


I was thinking of Oracle, where aliases are evaluated at column protection time, so after grouping but before ordering.


Interesting. It doesn't work in MSSQL, and I understand that's correct (ie. isn't allowed) per the standard.


Huh - I guess I never thought about it. It makes sense to disallow it, though - column aliases are there to rename complex expressions, which you probably _shouldn't_ be grouping on anyway.


> which you probably _shouldn't_ be grouping on anyway.

This is frequently unavoidable, though. Or more precisely: it could be avoided with a sane database design, but the databases on which I have to work for my day job are the precise opposite of "well-designed", so grouping on complex expressions is unfortunately an inevitability.


That's true - I can definitely imagine having to group on something like "concat(lastname + ', ' + firstname)".


I think it's for other reasons (and grouping on expressions is quite reasonable anyway).

It's (IIRC!) something to do with the situation of

  select x + y as x 
  from ...
  group by x
which x are we talking about? (Logically that example is crap because only the alias x makes sense, but something like that anyway).


I always, always forgot what column aliases I can use where. Thanks for the reminder.


It's useful for people new to Postgres, since many of these things are particular to Postgres.


Not really. Most is pretty well standard SQL (CTE optimisation fence pre PG12 being one exception, and there are a couple more, but really it's mostly standard stuff).


The :: syntax for CAST() is also a psql-ism.


Maybe that is what it is now, but I still have muscle memory using it in Informix.


Maybe. I come from the Oracle world and the majority of these don't apply to Oracle.


I'd never run across coalesce before. I usually end up doing nested NVL calls if I'm trying to find the first non-null in a series of expressions (I'm on Oracle, btw). I've now added this function to my toolbox.


Coalesce and NVL are synonyms for each other.


They don't seem to be in oracle. Giving more than two parameters to nvl gives me an error but works fine with coalesce. Granted they are basically the same thing if you are giving both two parameters.


I would add to the common mistakes (should be generic, but I have more xp with sql server) :

not indexing, most often, tables are not or poorly indexed.

Implicit conversion can generate a lot of io/leads to poor perf or just not using indexes.

Sql function:sorry but they are most often crap and useless, better to in-line or use TVF, and no its not code logic duplication.

Read uncommitted unless you enjoy not reading rows, multiple times or half of a value (page split and/or LOB values)


> Sql function:sorry but they are most often crap and useless, better to in-line or use TVF, and no its not code logic duplication.

Functions have helped me tremendously in SQL server, but you do have to know the issues, which can be taken advantage of to some degree.

Code reuse is the obvious use case, but due to lack of inlining up to SQL Server 2019 meant you could reduce performance compared to hand inlined case statement or whatever. Hopefully now that functions can be inlined in 2019 this will be a non issue going forward

They are an optimization barrier which can be a good thing. I have used to this my advantage to stabilize tricky queries that where using views for code reuse. The performance becomes consistent and predictable rather than going pathological on some databases even though it may be slightly slower on others.


Can an MS SQL Server expert explain why creating table variables inside sprocs is so popular? I've done lots of Postgres & MySQL (and long ago lots of Oracle), but now I find myself maintaining some MS SQL Server projects, and I see `DECLARE @intermediateResults` all over the place. I thought it was just the past developers being more familiar with imperative style than SQL, but it seems to be part of the broader MS SQL Server culture, and somewhere I came across some passing reference to read locking being a reason, but it was too brief for me to understand. Is there any reason to do this so often in MS SQL Server vs other RDMBSes?


A great thing about Table Variables is that you can pass them as parameters to stored procedures, which gives a straight forward way to have procs do set based operations, avoiding the need to call procs in a loop.

I find this approach allows me to write smaller procs that become reusable in different contexts, effectively composable.

Be careful that for over a thousand records table variables perform poorly, so I usually dump the data into a temp table inside the proc, or a permanent table with a key unique to this execution of the proc, this could be the SPID


I use them all the time as well.

Basically they are just very convenient compared to temp tables (they are basically function / procedure scoped temp tables rather than connection scoped) and yes allow you to go for an imperative execution you can count on. They also allow you to share results across statements not just the same statement like a CTE.

I haven't used Postgres very much but SQL's declarative style is great until it isn't. When the optimizer is being dumb its real nice to have the tools to go imperative and just tell it what to in what order like you know a normal programming language.

Maybe its because many SQL server programmers are also .Net programmers and are used to just throwing results in a list or dictionary to process data. This is partially why I use them, I let SQL do it thing, but when it doesn't I can just stuff it into a table variable and do the next step with a separate isolated statement or cursor or whatever thing your not supposed to do in SQL to actually get the job done.

This goes for query hints too, which seems to be a big no no in the PG world.


SQL functions can be inlined in Postgres And Postgres does not support read uncommitted to begin with.

But I agree that implicit conversion is the root of a lot of evil


> Read uncommitted unless you enjoy not reading rows, multiple times or half of a value (page split and/or LOB values)

Please elaborate.


https://www.mssqltips.com/sqlservertip/6072/sql-server-noloc...

The first time I experienced the lob issue was in 2009. At that time, I didn't know how to really reproduce the issue due to my lack of knowledge (which triggered also deep diving will to sql server internals _not my article,nor his but more generally Paul White /sql kiwi has published a load of great articles on internals, maybe some of the best technical articles I have read. )


In the "Avoid Transformations on Indexed Fields" part, I fail to understand how the example can work if you're applying the timezone computation on the right-hand side.

I'm not familiar with MS SQL (I've only worked with MySQL / PostgreSQL), can someone explain me how it works?


Your only failure is because it's just wrong. It's about the same as trying to change "if((a + b) > c)" to "if(a > (c + b))". If this weren't time zones, it'd obviously be "if(a > (c - b))", because you have to balance the equation by applying the same operation to boths sides. But because this is dealing with timezones, the offset of "b" is different depending on the value of "a", so we won't know what to subtract from "c" to get the right comparison. So the right transformation for this "gotcha" is not even possible.


I think the advice will still work, but you'd need to switch from "named" timezones to number-specific one, so for example replace `PST` with `-08:00` and then apply the opposite conversion on the right side (as you and I suggested).


I don't think it works the way author expects it to work, as the math is not correct. Think about `a+1 < 2` comparison. To remove +1, you need to change it to `a < 2-1`, not to `a < 2+1`; the operation needs to be transformed to the opposite one, which in this case would imply shifting the timezone in the opposite direction.

If you are asking about the timezone shift applied to a date, I think the engine converts the date to 00:00:00 timestamp and then does the timezone conversion.


I think the advice is correct, but the examples are not. When the transformation switches to the other side of the comparison it has to be inverted.


In regards to formatting sql, I used to do it the way shown, but a coworker formatted the columns in the select with the commas in front. This seemed strange to me until I tried it. I realized that this solved the problem of sometimes a query would be changed and the last item in the select list would be removed, but the last comma would not be removed. Or, a new item was added to the end of the select list, but they neglected to add in a comma at the end of the previous last item.

SELECT

  col1

  ,col2

  ,COUNT(col3)
FROM

  t1

  JOIN t2 ON ta.pk = t2.fk
WHERE

  col1 = col2

  AND col3 > col4
GROUP BY

  col1

  ,col2
HAVING

  COUNT(col3) > 1


I hate hate hate the way it looks, but unfortunately it is objectively superior. This is why I prefer more languages go out of their way to make their syntax tolerant of optional trailing commas.


I don't bother with this but instead try to keep column names alphabetical. I try to do the same thing in other languages with property names, etc. This gives greater stability in version control, as it removes the temptation to change the order for subjective reasons. Even better if it can be automated with linting.

This "fix" only helps with the case where new items are inserted at the end. By using alphabetic ordering you increase the chances that a new item will be inserted at the beginning or the middle, in which case it makes no difference where you put the commas.

It only really helps at all because inserting new items at the end is common, whereas removing items from the start is rare, at least in SQL (all it really does is displace the dangling comma problem from the former case to the latter). However, always inserting new items at the end tends to lead to unintuitive ordering, which in turn leads to additional VCS churn when that becomes seen as technical debt.

Commas at the start does give objectively better readability in one sense: it ensures they are aligned vertically. That makes it easier to spot errors at a glance. You might call this "ease of formal reviewability".

However, in practice it seems to be worse for general readability, for the entirely subjective reason you already pointed to. Since code is typically read more often than it is written, it's important to optimise for that first.

Since an error here will always cause a hard fail, it falls into a different category than say, omitting braces in a C-style if statement, which can introduce subtle bugs through bad merges. In the latter case, ease of formal reviewability has to take precedence over subjective aesthetics.


This also avoids the majority of merge conflicts. Which is an even better reason to do it.


Well, this simply moves the problem of "commenting out" one column from the end of the list to the start of the list. I find myself doing both equally frequent.


where 1=1 and

  ,first=condition


Similarly, by having the AND start in each subclause instead of being at the end of the previous clause, it's easier to "--" comment out specific clauses during development. Again, harder to read, but easier to work with.


for whatever reason I actually find the leading "and/or" to be more readable than the trailing and/or, as long as there are more than two. The leading commas look too noisy to me, though


Personally I find that one easier to read and tend to write e.g.:

  WHERE foo
  AND   bar
Or with long, complex ones:

  WHERE
      foo
  AND
      bar


Interesting. Can you please provide an example of what you mean?


WHERE

T.first_name = "Ender" AND

T.last_name = "Wiggins"

versus

WHERE

T.first_name = "Ender"

AND T.last_name = "Wiggins"

you could also do

WHERE 1=1

AND T.first_name = "Ender"

AND T.last_name = "Wiggins"


Thanks a lot, I thought it might have been WHERE 1=1

I will now use that to easily remove filters.

An extension for the SELECT based on the very insightful top comment could be

SELECT NULL

,something1

,something2


My OCD simply refuses to allow me to write it that way. I also tend to write SQL in lowercase which is apparently semi-controversial.


My (diagnosed) OCD compels me to use leading commas. I also use "WHERE 1=1" so each where constraint begins with AND/OR. I tend to write a lot of "investigatory" SQL that requires frequent commenting. I rarely need to comment out the first column or constraint, so commenting in general is much cleaner & faster.

I did not learn SQL this way. I was shown these tricks and their adoption was near immediate BECAUSE it just made more sense to my brain.

This isn't to devalue/argue against your comment. I just find it interesting that the common denominator for OCD-ish compulsions is that the "right" way for an individual OCD brain is usually specific to that individual brain.


If you write a lot of SQL, writing it in lowercase is OK. That's what I do anyway. If I insert simple prepared statements into code, I usually write keywords with uppercase, but larger code blocks, I write in lowercase. I also tend to write in much fewer lines than the above example, and indent FROM and other clauses further in than the SELECT keyword (so I know they belong to that SELECT statement).


I wish I could use ON for the selection criteria for the first table instead of a where clause:

Select A.value, B.valuue

from tableA A on A.id = 77

join tableB B on B.id = A.bId


> 2019-22-11: Fixed the examples in the "Faux Predicate" section after several keen eyed readers noticed it was backwards.

What abomination of a date format is this? I can only assume this is a bug, a typo, or an easter egg for those paying attention. Please let it be one of those. The last thing the world needs is people pushing yet another crazy date format into use.


It looks like a typo. Hebrew date style is yyyymmdd[1].

[1] https://www.ibm.com/support/knowledgecenter/en/SSS28S_8.1.0/...


Edit: “Don’t use an ORM” should be point 1


The opposite. Point 1 should be don't use an ORM unless you don't know SQL. But you should know SQL so don't use an ORM.

An ORM only works until the point where you need to join tables. As soon as that's needed the ORM just causes you endless trouble.


Edited, i meant don’t use one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: