Python data structure recommendation

python

I currently have a structure that is a dict: each value is a list that contains numeric values. Each of these numeric lists contain what (to borrow a SQL idiom) you could call a primary key containing the first three values which are: a year, a player identifier, and a team identifier. This is the key for the dict.

So you can get a unique row by passing the a value in for the year, player ID, and team ID like so:

statline = stats[(2001, 'SEA', 'suzukic01')]

Which yields something like

[305, 20, 444, 330, 45]

I'd like to alter this data structure to be quickly summed by either of these three keys: so you could easily slice the totals for a given index in the numeric lists by passing in ONE of year, player ID, and team ID, and then the index. I want to be able to do something like

hr_total = stats[year=2001, idx=3]

Where that idx of 3 corresponds to the third column in the numeric list(s) that would be retrieved.

Any ideas?

Best Solution

Read up on Data Warehousing. Any book.

Read up on Star Schema Design. Any book. Seriously.

You have several dimensions: Year, Player, Team.

You have one fact: score

You want to have a structure like this.

You then want to create a set of dimension indexes like this.

years = collections.defaultdict( list )
players = collections.defaultdict( list )
teams = collections.defaultdict( list )

Your fact table can be this a collections.namedtuple. You can use something like this.

class ScoreFact( object ):
    def __init__( self, year, player, team, score ):
        self.year= year
        self.player= player
        self.team= team
        self.score= score
        years[self.year].append( self )
        players[self.player].append( self )
        teams[self.team].append( self )

Now you can find all items in a given dimension value. It's a simple list attached to a dimension value.

years['2001'] are all scores for the given year.

players['SEA'] are all scores for the given player.

etc. You can simply use sum() to add them up. A multi-dimensional query is something like this.

[ x for x in players['SEA'] if x.year == '2001' ]