While counting columns will not include NULL columns, how about counting joined tables?
SELECT a.id, COUNT(b.*) FROM a JOIN b ON b.a_id = a.id GROUP BY a.id
is not permitted in Postgres.
Sure, I could just use COUNT(b.a_id) since that's what I join on, but a more complicated example might not allow for that. For instance if it was a virtual table.
You're right, it's a bad example, imagine if I joining two tables:
SELECT a.id, COUNT(b.*), COUNT(c.*)
FROM a JOIN b ON b.a_id = a.id JOIN c ON c.a_id = a.id
GROUP BY a.id
I want to know how many occurrences a_id has in both table b and c. Again in this simple example, I could just count on b.a_id and c.a_id, respectively, but imagine if b and c were complex virtual tables:
JOIN (SELECT NULL AS foo, 1 AS bar
UNION SELECT 1 AS foo, NULL AS bar) b ON b.foo = a.id OR b.bar = a.id
This would be useful if we are aggregating data together, where essentially, there are two ways to join the data with the main table, and both columns can be null.
Of course, in this example, you could count by going COUNT(b.foo) + COUNT(b.bar), but that's a bit awkward, or a column in table b you know to never be null. But what if you don't? And still have table c next to it?
Yes, in all cases, there would be a way out. In the extreme case, you could wrap it in a virtual table, where you add a column that is just always 0 (not null), so you can count on it. It would just be neat if b.* was possible.
I might misunderstand, but I think the join just isn't the right approach for that sort of child record counting, ie. counting records from two or more independent child tables associated with your rows (if that is what you're wanting).
You're grouping and counting after the three way join. That join will involve all combinations of child records between the two child tables associated with any given parent row (almost never what is wanted). So any given non-null thing you're counting from one child record will appear multiple times, = the number of child records in the other table associated with the parent row.
I think you just want to use correlated subqueries to count the child records: select a.id, (select count(whatever) from child1 c1 where c1.a_id = a.id), (select count(whatever) from child2 c2 where c2.a_id = a.id)
...
TLDR: You almost never want to join independent children to a common parent, use independent correlated subquery expressions instead.
Except, that's potentially super slow if the optimiser does not realise what to do. In its default state, it will make two table look ups for each row in table a. So that's 1+N*2 look ups compared to 3 look ups in my example.
For little data, that's probably fine, but for a big database, it will be slow. However, the optimiser may be able to handle that? I know Sybase's and MSSQL's had trouble with it, but I've heard Postgres' might be able to.
First, correctness first. Second, correlated subqueries in selects really aren't noticably slow at all in my experience (Postgres, Oracle) when they're really needed, meaning alternatives are as slow or slower. Especially compared to the alternative of actually doing a cartesian product across independent children of any size. I don't think the optimizer could help much with avoiding that cartesian product to be actually realized, either - the grouping after that join is going to be a big sort in general with results that depend on the distribution of child values on a row-by-row basis. But in any case it would give the wrong answers for both children regardless of speed and memory used!
If there's an inner join, then there is a matching row. It will be counted by a normal COUNT( * ).
On an outer join, columns in a 'missing' row will be represented as NULL.
You're saying you wish there were two NULLS, 'missing value in table null' and 'missing row in join null', and that you could count the first one?
No, not that complex. Just that I want to be able to specify which of my joined tables I am counting on in a null-safe way, in case the joining table could have null values in all its columns.
Would someone please confirm whether this article is misrepresenting a subquery as an inline CTE? It is my understanding that as of Postgresql 12, a programmer denotes a CTE as "AS MATERIALIZED", "AS NOT MATERIALIZED", or neither and allow the default operation to happen: the CTE subquery will default to inline if its result is used once.
I think the article and you are correct - the article is worded a little oddly though and ignores the fact that in Postgres 12 CTEs that are referenced multiple times are MATERIALIZED by default.
Before Postgres 12 CTEs were always materialized so you did not get any query optimization benefits of CTEs acting like inline subqueries.
After Postgres 12 all CTEs default to NOT MATERIALIZED if only referenced once or MATERIALIZED if referenced more than once. You can override via MATERIALIZED or NOT MATERIALIZED when defining the CTE.
Their example is showing that you can let Postgres (before 12) optimize a CTE for you by writing it as an inline subquery instead of a CTE:
SELECT *
FROM (
SELECT *
FROM sale
) AS inlined
WHERE created_by_id = 1
But with Postgres 12 their "don't" example would result in an index scan without refactoring to the "do" example. Basically their advice on do vs don't applies to before Postgres 12.
This is a pretty trivial list. Useful for beginners I guess.
I seriously take issue with "Reference Column Position in GROUP BY and ORDER BY" though. If it is restricted to ad-hoc (AKA messing-about) queries I'd be fine with it, but it won't be. Just don't do it.
Huh - I guess I never thought about it. It makes sense to disallow it, though - column aliases are there to rename complex expressions, which you probably _shouldn't_ be grouping on anyway.
> which you probably _shouldn't_ be grouping on anyway.
This is frequently unavoidable, though. Or more precisely: it could be avoided with a sane database design, but the databases on which I have to work for my day job are the precise opposite of "well-designed", so grouping on complex expressions is unfortunately an inevitability.
Not really. Most is pretty well standard SQL (CTE optimisation fence pre PG12 being one exception, and there are a couple more, but really it's mostly standard stuff).
I'd never run across coalesce before. I usually end up doing nested NVL calls if I'm trying to find the first non-null in a series of expressions (I'm on Oracle, btw). I've now added this function to my toolbox.
They don't seem to be in oracle. Giving more than two parameters to nvl gives me an error but works fine with coalesce. Granted they are basically the same thing if you are giving both two parameters.
> Sql function:sorry but they are most often crap and useless, better to in-line or use TVF, and no its not code logic duplication.
Functions have helped me tremendously in SQL server, but you do have to know the issues, which can be taken advantage of to some degree.
Code reuse is the obvious use case, but due to lack of inlining up to SQL Server 2019 meant you could reduce performance compared to hand inlined case statement or whatever. Hopefully now that functions can be inlined in 2019 this will be a non issue going forward
They are an optimization barrier which can be a good thing. I have used to this my advantage to stabilize tricky queries that where using views for code reuse. The performance becomes consistent and predictable rather than going pathological on some databases even though it may be slightly slower on others.
Can an MS SQL Server expert explain why creating table variables inside sprocs is so popular? I've done lots of Postgres & MySQL (and long ago lots of Oracle), but now I find myself maintaining some MS SQL Server projects, and I see `DECLARE @intermediateResults` all over the place. I thought it was just the past developers being more familiar with imperative style than SQL, but it seems to be part of the broader MS SQL Server culture, and somewhere I came across some passing reference to read locking being a reason, but it was too brief for me to understand. Is there any reason to do this so often in MS SQL Server vs other RDMBSes?
A great thing about Table Variables is that you can pass them as parameters to stored procedures, which gives a straight forward way to have procs do set based operations, avoiding the need to call procs in a loop.
I find this approach allows me to write smaller procs that become reusable in different contexts, effectively composable.
Be careful that for over a thousand records table variables perform poorly, so I usually dump the data into a temp table inside the proc, or a permanent table with a key unique to this execution of the proc, this could be the SPID
Basically they are just very convenient compared to temp tables (they are basically function / procedure scoped temp tables rather than connection scoped) and yes allow you to go for an imperative execution you can count on. They also allow you to share results across statements not just the same statement like a CTE.
I haven't used Postgres very much but SQL's declarative style is great until it isn't. When the optimizer is being dumb its real nice to have the tools to go imperative and just tell it what to in what order like you know a normal programming language.
Maybe its because many SQL server programmers are also .Net programmers and are used to just throwing results in a list or dictionary to process data. This is partially why I use them, I let SQL do it thing, but when it doesn't I can just stuff it into a table variable and do the next step with a separate isolated statement or cursor or whatever thing your not supposed to do in SQL to actually get the job done.
This goes for query hints too, which seems to be a big no no in the PG world.
The first time I experienced the lob issue was in 2009. At that time, I didn't know how to really reproduce the issue due to my lack of knowledge (which triggered also deep diving will to sql server internals _not my article,nor his but more generally Paul White /sql kiwi has published a load of great articles on internals, maybe some of the best technical articles I have read. )
In the "Avoid Transformations on Indexed Fields" part, I fail to understand how the example can work if you're applying the timezone computation on the right-hand side.
I'm not familiar with MS SQL (I've only worked with MySQL / PostgreSQL), can someone explain me how it works?
Your only failure is because it's just wrong. It's about the same as trying to change "if((a + b) > c)" to "if(a > (c + b))". If this weren't time zones, it'd obviously be "if(a > (c - b))", because you have to balance the equation by applying the same operation to boths sides. But because this is dealing with timezones, the offset of "b" is different depending on the value of "a", so we won't know what to subtract from "c" to get the right comparison. So the right transformation for this "gotcha" is not even possible.
I think the advice will still work, but you'd need to switch from "named" timezones to number-specific one, so for example replace `PST` with `-08:00` and then apply the opposite conversion on the right side (as you and I suggested).
I don't think it works the way author expects it to work, as the math is not correct. Think about `a+1 < 2` comparison. To remove +1, you need to change it to `a < 2-1`, not to `a < 2+1`; the operation needs to be transformed to the opposite one, which in this case would imply shifting the timezone in the opposite direction.
If you are asking about the timezone shift applied to a date, I think the engine converts the date to 00:00:00 timestamp and then does the timezone conversion.
In regards to formatting sql, I used to do it the way shown, but a coworker formatted the columns in the select with the commas in front. This seemed strange to me until I tried it. I realized that this solved the problem of sometimes a query would be changed and the last item in the select list would be removed, but the last comma would not be removed. Or, a new item was added to the end of the select list, but they neglected to add in a comma at the end of the previous last item.
I hate hate hate the way it looks, but unfortunately it is objectively superior. This is why I prefer more languages go out of their way to make their syntax tolerant of optional trailing commas.
I don't bother with this but instead try to keep column names alphabetical. I try to do the same thing in other languages with property names, etc. This gives greater stability in version control, as it removes the temptation to change the order for subjective reasons. Even better if it can be automated with linting.
This "fix" only helps with the case where new items are inserted at the end. By using alphabetic ordering you increase the chances that a new item will be inserted at the beginning or the middle, in which case it makes no difference where you put the commas.
It only really helps at all because inserting new items at the end is common, whereas removing items from the start is rare, at least in SQL (all it really does is displace the dangling comma problem from the former case to the latter). However, always inserting new items at the end tends to lead to unintuitive ordering, which in turn leads to additional VCS churn when that becomes seen as technical debt.
Commas at the start does give objectively better readability in one sense: it ensures they are aligned vertically. That makes it easier to spot errors at a glance. You might call this "ease of formal reviewability".
However, in practice it seems to be worse for general readability, for the entirely subjective reason you already pointed to. Since code is typically read more often than it is written, it's important to optimise for that first.
Since an error here will always cause a hard fail, it falls into a different category than say, omitting braces in a C-style if statement, which can introduce subtle bugs through bad merges. In the latter case, ease of formal reviewability has to take precedence over subjective aesthetics.
Well, this simply moves the problem of "commenting out" one column from the end of the list to the start of the list. I find myself doing both equally frequent.
Similarly, by having the AND start in each subclause instead of being at the end of the previous clause, it's easier to "--" comment out specific clauses during development. Again, harder to read, but easier to work with.
for whatever reason I actually find the leading "and/or" to be more readable than the trailing and/or, as long as there are more than two. The leading commas look too noisy to me, though
My (diagnosed) OCD compels me to use leading commas. I also use "WHERE 1=1" so each where constraint begins with AND/OR. I tend to write a lot of "investigatory" SQL that requires frequent commenting. I rarely need to comment out the first column or constraint, so commenting in general is much cleaner & faster.
I did not learn SQL this way. I was shown these tricks and their adoption was near immediate BECAUSE it just made more sense to my brain.
This isn't to devalue/argue against your comment. I just find it interesting that the common denominator for OCD-ish compulsions is that the "right" way for an individual OCD brain is usually specific to that individual brain.
If you write a lot of SQL, writing it in lowercase is OK. That's what I do anyway. If I insert simple prepared statements into code, I usually write keywords with uppercase, but larger code blocks, I write in lowercase. I also tend to write in much fewer lines than the above example, and indent FROM and other clauses further in than the SELECT keyword (so I know they belong to that SELECT statement).
> 2019-22-11: Fixed the examples in the "Faux Predicate" section after several keen eyed readers noticed it was backwards.
What abomination of a date format is this? I can only assume this is a bug, a typo, or an easter egg for those paying attention. Please let it be one of those. The last thing the world needs is people pushing yet another crazy date format into use.
Sure, I could just use COUNT(b.a_id) since that's what I join on, but a more complicated example might not allow for that. For instance if it was a virtual table.