I love your question. You're very articulate about what you do not yet understand. You just need something to tie things together. My recommendation is that you read "How XSLT Works", a chapter I wrote to address exactly the questions you're asking. I'd love to hear if it ties things together for you.
Less formally, I'll take a stab at answering each of your questions.
- In what order do the templates execute, and
- When they execute, do they match on (a) the original source XML, or (b)
the current output of the XSLT to that
point?
At any given point in XSLT processing, there are, in a sense, two contexts, which you identify as (a) and (b): where you are in the source tree, and where you are in the result tree. Where you are in the source tree is called the current node. It can change and jump all around the source tree, as you choose arbitrary sets of nodes to process using XPath. However, conceptually, you never "jump around" the result tree in the same way. The XSLT processor constructs it in an orderly fashion; first it creates the root node of the result tree; then it adds children, building the result in document order (depth-first). [Your post motivates me to pick up my software visualization for XSLT experiments again...]
The order of template rules in a stylesheet never matters. You can't tell, just by looking at the stylesheet, in what order the template rules will be instantiated, how many times a rule will be instantiated, or even whether it will be at all. (match="/"
is an exception; you can always know that it will get triggered.)
I am assuming that Template #1 will
execute first. I don't know why I
assume this -- is it just because it
appears first in the document?
Nope. It would be called first even if you put it last in the document. Template rule order never matters (except under an error condition when you have more than one template rule with the same priority matching the same node; even then, it's optional for the implementor and you should never rely on such behavior). It gets called first because the first thing that always happens whenever you run an XSLT processor is a virtual call to <xsl:apply-templates select="/"/>
. The one virtual call constructs the entire result tree. Nothing happens outside it. You get to customize, or "configure", the behavior of that instruction by defining template rules.
Will Template #2 execute? It matches a node in the source XML, but
by the time the we get to this
template (assuming it runs second),
the "firstName" node will not be in
the output tree.
Template #2 (nor any other template rules) will never get triggered unless you have an <xsl:apply-templates/>
call somewhere in the match="/"
rule. If you don't have any, then no template rules other than match="/"
will get triggered. Think of it this way: for a template rule to get triggered, it can't just match a node in the input. It has to match a node that you elect to process (using <xsl:apply-templates/>
). Conversely, it will continue to match the node as many times as you choose to process it.
Would [the match="/"
template] pre-empt all other templates
from executing since there is nothing
to match on after that first template
is complete?
That rule preempts the rest by nowhere including <xsl:apply-templates/>
in it. There are still plenty of nodes that could be processed in the source tree. They're always all there, ripe for the picking; process each one as many times as you want. But the only way to process them using template rules is to call <xsl:apply-templates/>
.
To this point, I've been concerned
with later templates not executing
because the nodes they have operated
on do not appear in the output, but
what about the inverse? Can an
"earlier" template create a node that
a "later" template can do something
with?
It's not that an "earlier" template creates a new node to be processed; it's that an "earlier" template in turn processes more nodes from the source tree, using that same instruction (<xsl:apply-templates
). You can think of it as calling the same "function" recursively, with different parameters each time (the nodes to process as determined by the context and the select
attribute).
In the end, what you get is a tree-structured stack of recursive calls to the same "function" (<xsl:apply-templates>
). And this tree structure is isomorphic to your actual result. Not everyone realizes this or has thought about it this way; that's because we don't have any effective visualization tools...yet.
Template #1 creates a new node called
"fullName". Template #2 matches on
that same node. Will Template #2
execute because the "fullName" node
exists in the output by the time we
get around to Template #2?
Nope. The only way to do a chain of processing is to explicitly set it up that way. Create a variable, e.g., $tempTree
, that contains the new <fullName>
element and then process it, like this <xsl:apply-templates select="$tempTree">
. To do this in XSLT 1.0, you need to wrap the variable reference with an extension function (e.g., exsl:node-set()
), but in XSLT 2.0 it will work just as is.
Whether you're processing nodes from the original source tree or in a temporary tree that you construct, either way you need to explicitly say what nodes you want to process.
What we haven't covered is how XSLT gets all its implicit behavior. You must also understand the built-in template rules. I write stylesheets all the time that don't even include an explicit rule for the root node (match="/"
). Instead, I rely on the built-in rule for root nodes (apply templates to children), which is the same as the built-in rule for element nodes. Thus I can ignore large parts of the input, let the XSLT processor automatically traverse it, and only when it comes across a node I'm interested in will I do something special. Or I could write a single rule that copies everything recursively (called the identity transform), overriding it only where necessary, to make incremental changes to the input. After you've read "How XSLT Works", your next assignment is to look up the "identity transform".
I realize that I'm deeply ignorant
about the "zen" of XSLT. To date, my
stylesheets have consisted of a
template matching the root node, then
are completely procedural from there.
I'm tired of doing this. I would
rather actually understand XSLT
correctly, hence my question.
I applaud you. Now it's time to take the "red pill": read "How XSLT Works"
We start by answering the first question:
Question 1
Why do I get ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
df.duplicated(['row', 'col']).any()
True
So when I pivot
using
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
Here is a list of idioms we can use to pivot
pd.DataFrame.groupby
+ pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
- Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
pd.DataFrame.set_index
+ pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index
, columns
, values
.
- Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
pd.crosstab
- This a specialized version of
pivot_table
and in its purest form is the most intuitive way to perform several tasks.
pd.factorize
+ np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies
+ pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.
Examples
What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table
. Then I'll provide alternatives to perform the same task.
Question 3
How do I pivot df
such that the col
values are columns, row
values are the index, mean of val0
are the values, and missing values are 0
?
pd.DataFrame.pivot_table
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to 0
. Notice I skipped question 2 as it's the same as this answer without the fill_value
aggfunc='mean'
is the default and I didn't have to set it. I included it to be explicit.
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)
Question 4
Can I get something other than mean
, like maybe sum
?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='sum')
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='sum').fillna(0)
Question 5
Can I do more that one aggregation at a time?
Notice that for pivot_table
and crosstab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])
size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Question 6
Can I aggregate over multiple value columns?
pd.DataFrame.pivot_table
we pass values=['val0', 'val1']
but we could've left that off completely
df.pivot_table(
values=['val0', 'val1'], index='row', columns='col',
fill_value=0, aggfunc='mean')
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Question 7
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
pd.DataFrame.groupby
df.groupby(
['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Question 8
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index=['key', 'row'], columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
pd.DataFrame.groupby
df.groupby(
['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
pd.DataFrame.set_index
because the set of keys are unique for both rows and columns
df.set_index(
['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Question 9
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
pd.crosstab
pd.crosstab(df['row'], df['col'])
pd.factorize
+ np.bincount
# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`. Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)
col3 col2 col0 col1 col4
row3 2 0 0 1 0
row2 1 2 1 0 2
row0 1 0 1 2 1
row4 2 2 0 1 1
pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
col0 col1 col2 col3 col4
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
Question 10
How do I convert a DataFrame from long to wide by pivoting on ONLY two
columns?
DataFrame.pivot
The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount
:
df2.insert(0, 'count', df2.groupby('A').cumcount())
df2
count A B
0 0 a 0
1 1 a 11
2 2 a 2
3 3 a 11
4 0 b 10
5 1 b 10
6 2 b 14
7 0 c 7
The second step is to use the newly created column as the index to call DataFrame.pivot
.
df2.pivot(*df2)
# df2.pivot(index='count', columns='A', values='B')
A a b c
count
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
DataFrame.pivot_table
Whereas DataFrame.pivot
only accepts columns, DataFrame.pivot_table
also accepts arrays, so the GroupBy.cumcount
can be passed directly as the index
without creating an explicit column.
df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')
A a b c
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
Question 11
How do I flatten the multiple index to single index after pivot
If columns
type object
with string join
df.columns = df.columns.map('|'.join)
else format
df.columns = df.columns.map('{0[0]}|{0[1]}'.format)
Best Solution
Muench to the rescue!