In a table without a clustered index (a heap table), data pages are not linked together - so traversing pages requires a lookup into the Index Allocation Map.
A clustered table, however, has it's data pages linked in a doubly linked list - making sequential scans a bit faster. Of course, in exchange, you have the overhead of dealing with keeping the data pages in order on INSERT
, UPDATE
, and DELETE
. A heap table, however, requires a second write to the IAM.
If your query has a RANGE
operator (e.g.: SELECT * FROM TABLE WHERE Id BETWEEN 1 AND 100
), then a clustered table (being in a guaranteed order) would be more efficient - as it could use the index pages to find the relevant data page(s). A heap would have to scan all rows, since it cannot rely on ordering.
And, of course, a clustered index lets you do a CLUSTERED INDEX SEEK, which is pretty much optimal for performance...a heap with no indexes would always result in a table scan.
So:
For your example query where you select all rows, the only difference is the doubly linked list a clustered index maintains. This should make your clustered table just a tiny bit faster than a heap with a large number of rows.
For a query with a WHERE
clause that can be (at least partially) satisfied by the clustered index, you'll come out ahead because of the ordering - so you won't have to scan the entire table.
For a query that is not satisified by the clustered index, you're pretty much even...again, the only difference being that doubly linked list for sequential scanning. In either case, you're suboptimal.
For INSERT
, UPDATE
, and DELETE
a heap may or may not win. The heap doesn't have to maintain order, but does require a second write to the IAM. I think the relative performance difference would be negligible, but also pretty data dependent.
Microsoft has a whitepaper which compares a clustered index to an equivalent non-clustered index on a heap (not exactly the same as I discussed above, but close). Their conclusion is basically to put a clustered index on all tables. I'll do my best to summarize their results (again, note that they're really comparing a non-clustered index to a clustered index here - but I think it's relatively comparable):
INSERT
performance: clustered index wins by about 3% due to the second write needed for a heap.
UPDATE
performance: clustered index wins by about 8% due to the second lookup needed for a heap.
DELETE
performance: clustered index wins by about 18% due to the second lookup needed and the second delete needed from the IAM for a heap.
- single
SELECT
performance: clustered index wins by about 16% due to the second lookup needed for a heap.
- range
SELECT
performance: clustered index wins by about 29% due to the random ordering for a heap.
- concurrent
INSERT
: heap table wins by 30% under load due to page splits for the clustered index.
>>> ["foo", "bar", "baz"].index("bar")
1
Reference: Data Structures > More on Lists
Caveats follow
Note that while this is perhaps the cleanest way to answer the question as asked, index
is a rather weak component of the list
API, and I can't remember the last time I used it in anger. It's been pointed out to me in the comments that because this answer is heavily referenced, it should be made more complete. Some caveats about list.index
follow. It is probably worth initially taking a look at the documentation for it:
list.index(x[, start[, end]])
Return zero-based index in the list of the first item whose value is equal to x. Raises a ValueError
if there is no such item.
The optional arguments start and end are interpreted as in the slice notation and are used to limit the search to a particular subsequence of the list. The returned index is computed relative to the beginning of the full sequence rather than the start argument.
Linear time-complexity in list length
An index
call checks every element of the list in order, until it finds a match. If your list is long, and you don't know roughly where in the list it occurs, this search could become a bottleneck. In that case, you should consider a different data structure. Note that if you know roughly where to find the match, you can give index
a hint. For instance, in this snippet, l.index(999_999, 999_990, 1_000_000)
is roughly five orders of magnitude faster than straight l.index(999_999)
, because the former only has to search 10 entries, while the latter searches a million:
>>> import timeit
>>> timeit.timeit('l.index(999_999)', setup='l = list(range(0, 1_000_000))', number=1000)
9.356267921015387
>>> timeit.timeit('l.index(999_999, 999_990, 1_000_000)', setup='l = list(range(0, 1_000_000))', number=1000)
0.0004404920036904514
Only returns the index of the first match to its argument
A call to index
searches through the list in order until it finds a match, and stops there. If you expect to need indices of more matches, you should use a list comprehension, or generator expression.
>>> [1, 1].index(1)
0
>>> [i for i, e in enumerate([1, 2, 1]) if e == 1]
[0, 2]
>>> g = (i for i, e in enumerate([1, 2, 1]) if e == 1)
>>> next(g)
0
>>> next(g)
2
Most places where I once would have used index
, I now use a list comprehension or generator expression because they're more generalizable. So if you're considering reaching for index
, take a look at these excellent Python features.
Throws if element not present in list
A call to index
results in a ValueError
if the item's not present.
>>> [1, 1].index(2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: 2 is not in list
If the item might not be present in the list, you should either
- Check for it first with
item in my_list
(clean, readable approach), or
- Wrap the
index
call in a try/except
block which catches ValueError
(probably faster, at least when the list to search is long, and the item is usually present.)
Best Solution
Your questions:
Is FK-index going to eat up too much space/time as table grows few million records?
No worries, here, at least not a concern "as the table grows". Both space and time requirements will grow linearly with regards to the number of records added.
(well technically not quite, if you cross boundaries that introduce an extra level in the tree, but typically a database with readily million of records, tree depth is readily where it is supposed to be)
In that case, is it worth to go for an FK-index "each time?"
Typically yes, but it is indeed a case-by-case situation. One think to consider too, rather than plain FK index are indexes that include additional columns and may be used both for searching and to cover [parts of] the select list. Again deciding on such alternative (or additional indexes) is a case-by-case, sorry ;-) ...
In what case should I NOT apply FK or Index it or do neither of it (of course I can handle a LOT from the app)
Of course all such cases, exclude ones where it is important that referential integrity be intrinsically supported by the dbms itself (Such integrity can alternatively be managed at the level of the application / processes which Insert and Delete rows in the database)
Any other tricky to speedup JOIN or other such time-consuming lookups?
When it comes to moving the data around, for example when significant amount of data is added, etc. It is often a worthwhile strategy to drop the indexes (or some of them), do the CUD (INSERT / UPDATE / DELETE) operations, and then re-create the indexes. Of course this is not always possible, if the database is concurrently searched during the updates etc.
Also watch for the FILL_FACTOR associated with indexes, as a judicious choice for these keep the index fragmentatation to a minimum (at the cost of consuming, up from a bit more space) at least between scheduled maintenance of the indexes