Sql – PostgreSQL: Defining a primary key on a large database


I am planing a database to store lots of text. (blog posts, news articles, etc.) The database needs to have the title, content (50k characters max), date, link and language fields. The same content can't occur on one link. Old content (older then 30 days, for example) will be deleted.

Now, the problem is the primary key. I could just set a automatically incrementing (SERIAL type) field and use it as a primary key. But, it seems stupid and a waste of disc space, because the field wouldn't serve any purpose but to be a primary key. (and the field could eventually run out, or not?) And there's always the other performance issue: the content of each new row inserted needs to be checked for duplicates. So the other solution for the primary key I've come up with would be to calculate a sha256 hash of content+link value and then put that in a new 'hash' column and use that as a primary key. Two birds with one stone. Of course, the problem with that are hash collisions. Is it a big threat?

I don't have any experience with PostgreSQL, and very little experience with DBMS's generally, so I would appreciate a second opinion before a create a database with the performance characteristics of a snail on the highway (horrible comparison).

Please help me out here if you have any experience with large databases. Is setting a 64 character string as a primary key field a good idea in my situation? (because I'm under the impression that generally this is avoided)

Best Solution

Just did this exact test for a rather medium-large DB (200GB+), bigserial won by quite a large margin. It was faster to generate, faster to join, less code, smaller footprint. Because of the way postgres stores it, a bigint is negligible compared to a normal int. You'll run out of storage space from your content long before you ever have to worry about overflowing the bigint. Having done the computed hash vs bigint - surrogate bigint all the way.