Hacker Newsnew | past | comments | ask | show | jobs | submit | mytherin's commentslogin

One of the DuckDB maintainers here. To clarify - the UI is not built into the DuckDB release. It is an extension that is downloaded and installed like any other extension. This extension happens to be developed by MotherDuck. We collaborated with them to streamline the experience - but fundamentally the extension is not distributed as part of DuckDB and works similarly to other extensions.

To be specific, the work we did was:

* Add the -ui command to the shell. This executes a SQL query (CALL start_ui()). The query that gets executed can be customized by the user through the .ui_command option - e.g. by setting .ui_command my_ui_function().

* The ui extension is automatically installed and loaded when the start_ui function is executed - similar to other trusted extensions we distribute. The automatic install and load can be disabled through configuration (SET autoinstall_known_extensions=false, SET autoload_known_extensions=false) and is also disabled when SET enable_external_access=false.


The nature of UI as an extension is somewhat hard to understand, since its installation method differs from other extensions. Even core ones. Some extensions autoload, some require INSTALL query, and this one has its own special builtin query. It at least feels more ingrained than other extensions by its user experience.

Then there's the (to me) entirely new feature of an extension providing a HTTP proxy for external web service. This part could have been more prominently explained.

Edit: the OP states that "built-in local UI for DuckDB" and "full-featured local web user interface is available out-of-the-box". These statements make me think this feature comes with the release binary, not that it's an extension.

To clarify my point: for me it's not the possible confusion of what this plugin does or how, but what this collaboration means for the future of DuckDB's no-cost and commercial usage.


I agree that the blog post seems to hint at the fact that this functionality is fully baked in in certain places - we've adjusted the blog post to be more explicit on the fact that this is an extension.

We have collaborated with MotherDuck on streamlining the experience of launching the UI through auto-installation, but the DuckDB Foundation still remains in full control of DuckDB and the extension ecosystem. This has no impact on that.

For further clarification:

* The auto-installation mechanism is identical to that of other trusted extensions - the auto-installation is triggered when a specific function is called that does not exist in the catalog - in this case the `start_ui` function. See [1]. The query I mentioned just calls that function. The only special feature here is the addition of the CLI flag (and what that flag executes is user-configurable).

* The HTTP server is necessary for the extension to function as the extension needs to communicate with the browser. The server is open-source as part of the extension code [2]. The server (1) fetches web resources (javascript/css) from ui.duckdb.org, and (2) communicates with localhost to co-ordinate the UI with DuckDB. Outside of these the server doesn't interface with other external web services.

[1] https://github.com/duckdb/duckdb/blob/main/src/include/duckd...

[2] https://github.com/duckdb/duckdb-ui


Ok, thank you for the explanation.

I realized that the extension provides a HTTP API to DuckDB. Is this perhaps to become the official way to use DuckDB through HTTP? For me this is much more interesting than one particular UI.

I went looking and found that there's community extension of similar functionality: https://duckdb.org/community_extensions/extensions/httpserve...

Official, supported HTTP API with stable schema versioning would be a nice addition.


You can query the metadata of a Parquet file using DuckDB's parquet_schema and parquet_metadata functions [1]

[1] https://duckdb.org/docs/data/parquet/metadata


Thanks for trying it out!

Could you perhaps open an issue [1] or share the file with us so we could investigate the problem?

[1] https://github.com/duckdb/duckdb/issues


I tried to do "select * from ... limit 1" from a 1.7GB JSON file (array of objects), and I had to increase maximum_object_size to 1GB to make it not throw an error. But DuckDB then consumed 8GB of RAM and sat there consuming 100% CPU (1 core) for ever — I killed it after about 10 minutes.

Meanwhile, doing the same with Jq ("jq '.[0]'") completed in 11 seconds and consumed about 2.8GB RAM.

I love DuckDB, but it does seem like something isn't right here.


Perhaps have a look at this article [1]

[1] https://www.vantage.sh/blog/querying-aws-cost-data-duckdb


Feature were stripped from Google Image Search because of a lawsuit [1]. Yandex, Bing and others can offer a better image search because they are not big/important enough to sue in the US search engine market. If Google were to be dethroned, the company that would take over its current dominant position would also be sued and would likely have to make the same changes.

[1] https://www.lifehacker.com.au/2020/01/what-happened-to-googl...


The changes from the lawsuit, AFAIK (and the article seems to agree) were removing "view image" and "search by image" from results. Google still had decent reverse image search back then, but now it might as well just throw your input image into Stable Diffusion.


Although true I don’t consider “lawsuit” an excuse and think the problem goes deeper to Google employees focusing on ML problems for career clout not customer solutions


We (the DuckDB team) are very happy working together with MotherDuck in a close partnership [1].

[1] https://duckdblabs.com/news/2022/11/15/motherduck-partnershi...


Most implementations of SQL are not dynamically typed - they are statically typed. There is an explicit compilation phase (`PREPARE`) that compiles the entire plan and handles any type errors.

For example - this query throws a type error when run in Postgres during query compilation without executing anything or reading a row of the input data:

   CREATE TABLE varchars(v VARCHAR);
   PREPARE v1 AS SELECT v + 42 FROM varchars;

   ERROR:  operator does not exist: character varying + integer
   LINE 1: PREPARE v1 AS SELECT v + 42 FROM varchars;
                               ^
   HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

The one notable exception to this is SQLite which has per-value typing, rather than per-column typing. As such SQLite is dynamically typed.


That is strong typing. Python, a dynamically typed language, exhibits the same quality:

    >>> v = ""
    >>> v + 42
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: can only concatenate str (not "int") to str
Types are most certainly not static (at least not in Postgres or most other implementations):

    CREATE TABLE varchars(v VARCHAR);
    ALTER TABLE varchars ALTER COLUMN v TYPE INTEGER USING v::integer;
    PREPARE v1 AS SELECT v + 42 FROM varchars;


Types are static within the context of a query, not necessarily within the context of the lifetime of a database. Types are determined and propagated at query compile time. In a query, the value "v" refers to a VARCHAR column (as determined by the table definitions). At query compile time an error is thrown when a type violation is detected.

It is distinct from Python because in Python individual objects have types - and type resolution is done strictly at run-time. There is no compile time type checking because types do not exist at compile time at all. Note how in my example I am preparing (i.e. compiling) a query, whereas in your example you are executing the statement.

If you refrain from changing the types of columns in a database and prepare your queries you can detect all type errors before having to process a single row. That is not possible in a dynamically typed language like Python, and is very comparable to the guarantees that a language like Typescript offers you.


> Types are static within the context of a query

Types in Python are also static within the context of a statement. That's not particularly meaningful, though, as programs don't run as individual statements in a vacuum. Just like SQL, the state that has been built up beforehand is quite significant to what each individual statement ends up meaning.

> types do not exist at compile time at all.

Same goes for SQL. Given the previous program, it is impossible to determine what the type of v is until you are ready to execute the SELECT statement. If the CREATE TABLE statement fails, v will not be the VARCHAR you think it is. If someone else modifies the world between your CREATE TABLE statement and your SELECT statement, v also will not be the VARCHAR you think it is. All you can do is throw the code at the database at runtime and hope it doesn't blow up.

> whereas in your example you are executing the statement.

What do you think "" + 42 executes to, then? The thing is, it doesn't execute because it doesn't satisfy the type constraints. It fails before execution. If it were a weakly typed language, like Javascript, then execution may be possible. Javascript produces "42".

> and is very comparable to the guarantees that a language like Typescript offers you.

Not at all. Something like Typescript provides guarantees during development. SQL, being a dynamically typed language, cannot know the types ahead of time – they don't exist until the program has already begun execution – and will throw errors at you in runtime after you've moved into production if you've encountered a problem.


> Same goes for SQL. Given the previous program, it is impossible to determine what the type of v is until you are ready to execute the SELECT statement. If the CREATE TABLE statement fails, v will not be the VARCHAR you think it is. If someone else modifies the world between your CREATE TABLE statement and your SELECT statement, v also will not be the VARCHAR you think it is. All you can do is throw the code at the database at runtime and hope it doesn't blow up.

If you are working in a hypothetical adversarial world where people are changing data types out from under you then that might happen. That will also happen with any other shared data source, regardless of what language you might be using. If you are sharing a set of JSON documents and people start altering documents and removing keys then your Typescript program that ingests those documents will also start failing.

That is not a problem of the language but a fundamental problem of having a shared data source - changing to a different language will not solve that problem.

I think you are conflating SQL the language with a relational database management system. A relational database management system models a shared data source - and a shared data source can be modified by others. That introduces challenges no matter what language you are using.

> What do you think "" + 42 executes to, then? The thing is, it doesn't execute because it doesn't satisfy the type constraints. It fails before execution. If it were a weakly typed language, like Javascript, then execution may be possible. Javascript produces "42".

It is fundamentally different. Given a Python program, you cannot know whether or not it will produce type errors at run-time without executing the program. Type checking is done at run-time as part of the execution of the "+" operator. In SQL, type checking is done as a separate compilation step that can be performed without requiring any data. In Python the object has a type. In SQL the variable has a type.

Given a fixed database schema and a set of queries, you can compile the queries and know whether or not the queries will provide type errors. You cannot do this with Python.

> Not at all. The benefits of something like Typescript is that the guarantees are provided during development. SQL cannot know the types ahead of time – they don't exist until the program has begun execution – and will throw errors at you in runtime after you've moved into production if you've encountered a problem. That's exactly when you don't want to be running into type programs, and why we're largely moving away from dynamically typed languages in general.

SQL can provide the same guarantees as Typescript. However, unlike Typescript in which the variables are fixed, SQL always deals with processing data from an external data source - the tables. The data sources (i.e. tables) can be changed. That is not a problem a language can solve.


> Given a Python program, you cannot know whether or not it will produce type errors at run-time without executing the program.

Like you mentioned earlier, if you refrain from changing types, a compiler can theoretically determine Python types as well. The Typescript type checker when run on Javascript code is able to do this. But that puts all the onus on you to be super careful instead of letting the language hold your hand.

> SQL can provide the same guarantees as Typescript.

1. It cannot as it does not provide guarantees about its memory model. Declaring a variable as a given type does not guarantee the variable will be created with that type. Typescript does guarantee that when you declare a type that's what the variable type is going to be.

2. As SQL is dynamically typed, allowing you to change the type of a variable mid-execution, what type a variable is depends on when that statement is up for execution. And as SQL is not processed sequentially, at least not in a meaningful sense, it is impossible for a compiler to determine when in the sequence it will end up being called upon.


Again, you are still conflating tables in a relational database with variables in Typescript. While I can understand the confusion - given that SQL makes it so natural to interact with tables - they are not equivalent.

SQL as a language is statically typed. It has variables that are statically typed [1], much like Typescript. All columns in a query have fixed types, and there is a separate compilation phase that resolves types of parameters - much like any other statically typed language.

SQL generally operates on tables - and tables model a persistent, shared data source. The Typescript equivalent to a table is JSON files on disk. There is a fundamental need to be able to change data sources as time goes on and business requirements change. That is why SQL supports modifying shared data sources, and supports operations like adding columns, changing column types and dropping columns. SQL deals with the fact that changes can be made to tables by rebinding (recompiling) queries when a data source is changed.

Any hypothetical language that will replace SQL will have to deal with the very same realities of people wanting to modify shared data sources. A different language cannot solve this problem because there is a fundamental need to change how data is stored at times.

Perhaps what you are looking for is better tooling around this problem. SQL can be used to power such tooling, because of its strong static type system that works alongside the persistent shared data source. For example - you could check if all your queries will still successfully compile after changing the type of a column.

[1] https://www.postgresql.org/docs/current/plpgsql-declarations...


> Again, you are still conflating tables in a relational database with variables in Typescript.

No, we're only talking about SQL here, and specifically to the example you gave. It demonstrated strong typing, but not static typing. I am not sure where you think Typescript comes into play with respect to the core discussion.

> SQL as a language is statically typed.

As before, SQL defines the ALTER statement. SQL is no doubt strongly typed. Once you declare a type it will enforce that type until such time as it is explicitly changed (SQLite notwithstanding), but it can be changed. To be static means that it cannot change.

> SQL generally operates on tables - and tables model a persistent, shared data source. The Typescript equivalent to a table is JSON files on disk.

That's but an implementation detail. Much of the beauty of SQL, the language, is that it abstracts the disk and other such machine details away. Conceptually all you have is a machine that has functions that operate on sets (or vectors, perhaps, since SQL deviates from the relational model here) of tuples. Perhaps your struggle here is that you are hung up on specific implementations rather than the concepts expressed in these languages?

> Any hypothetical language that will replace SQL will have to deal with the very same realities of people wanting to modify shared data sources.

That's not a reasonable assumption. In practice you don't want arbitrary groups of people modifying a shared data source. You want one master operator that maintains full control of that data source, extracting the information other people need on their behalf. The n number of users feeding a virtual machine code fragments to build up an entire application is an interesting concept, but one that I am not sure has proven to be successful. It turns out we've learned a lot about software development since the 1970s. These days we usually build a program that sits in front of the SQL program to act as the master source in order to hide this grievous flaw, but a hypothetical SQL replacement can address it directly.

> SQL can be used to power such tooling, because of its strong static type system

Let's go back to your original example:

    CREATE TABLE varchars(v VARCHAR);
    ALTER TABLE varchars ALTER COLUMN v TYPE INTEGER USING v::integer;
    PREPARE v1 AS SELECT v + 42 FROM varchars;
What type is v? Well, it could either be a VARCHAR or an INTEGER. It depends on when the machine gets the `PREPARE v1 AS SELECT v + 42 FROM varchars;` statement. SQL makes no claims about order of operations, so PREPARE could come at any point. Therefore it is impossible for such tooling to figure out what type v is. If SQL were statically typed you could derive more information, but as it is dynamically typed...

Now, you said before that each statement when observed in isolation is statically typed. While I suppose that is true to the extent that the the type won't randomly change in the middle of the execution of that statement, the statement alone doesn't provide type information either. Parsing `PREPARE v1 AS SELECT v + 42 FROM varchars;` in isolation, we still don't know what v is.


C/C++ code can certainly be compiled down to WASM, but you cannot interface with the operating system as you would in a normal C/C++ program. To get around that restriction postgres-wasm ships an entire Linux distribution that is run inside the browser. This comes with an immense performance penalty.

To get an impression of the performance penalty, just run the following query:

  SELECT SUM(i) FROM generate_series(0, 1000000, 1) tbl(i);

This simple query completes in 100ms locally on my laptop, but takes 17265ms in postgres-wasm. That is a slowdown of 170x.

Now that is not WASM's fault - when running the same query in duckdb-wasm [1] on my laptop the query takes 10ms using WASM, and 5ms when run locally, with a slow-down of only a factor of 2. But in order to achieve those results we did have to adapt the DuckDB codebase to compile natively to WASM. That is absolutely possible but it does take engineering effort - particularly when it comes to larger older projects that are not designed from the ground up with this in mind.

[1] https://shell.duckdb.org


Thank you for these details. Always suspected these claims but hadn’t dug deep enough.

Seems like some X can now run in wasm should come with disclaimer (includes Linux)


Because DuckDB uses ACID [1] data is loaded in an all-or-nothing manner. As the load was interrupted due to the system running out of memory, the table is expected to be empty.

[1] https://en.wikipedia.org/wiki/ACID


CREATE INDEX currently has the restriction that the index must fit in memory [1]. As the data is already sorted, creating an index is not necessary anyway. The min/max indexes created automatically by the system are sufficient to complete the query in a few milliseconds.

  D CREATE TABLE passwords (hash TEXT, count INT);
  D COPY passwords FROM '~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt' (SEPARATOR ':');
  D .timer on
  D SELECT \* FROM passwords WHERE hash=upper('5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8');
  ┌──────────────────────────────────────────┬─────────┐
  │                   hash                   │  count  │
  │                 varchar                  │  int32  │
  ├──────────────────────────────────────────┼─────────┤
  │ 5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8 │ 9545824 │
  └──────────────────────────────────────────┴─────────┘
  Run Time (s): real 0.005 user 0.007455 sys 0.000584
[1] https://duckdb.org/docs/sql/indexes


based on the headline, it must be under 1 ms. love the table


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: