Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most columnar stores I'm aware of are hybrid, so all columns of a row are still colocated in the same page.


Mine is not. If a table has 4 columns (e.g. name, address, phone, email), then all the names are stored separately from the addresses. Likewise all the phone numbers are stored separately from the emails. The data is de-duped so it is incredibly easy to find out how many of each value is in each column (e.g. there are 1,234,567 rows in the table where name = 'John').


The downside is that projecting a row requires random I/O across a larger number of pages, which also means more evictions from the in-memory buffer and worse cache efficiency. Apache Arrow, Parquet, Redshift, Bigtable/Spanner, Snowflake are all hybrid columnar, for example. You get good row data locality while still being able to exploit SIMD/vectorized ops.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: