Transform¶
Pivot by a single column¶
The Table.pivot() method is a general process for grouping data by row and, optionally, by column, and then calculating some aggregation for each group. Consider the following table:
| name | race | gender | age | 
|---|---|---|---|
| Joe | white | female | 20 | 
| Jane | asian | male | 20 | 
| Jill | black | female | 20 | 
| Jim | latino | male | 25 | 
| Julia | black | female | 25 | 
| Joan | asian | female | 25 | 
In the very simplest case, this table can be pivoted to count the number occurences of values in a column:
transformed = table.pivot('race')
Result:
| race | pivot | 
|---|---|
| white | 1 | 
| asian | 2 | 
| black | 2 | 
| latino | 1 | 
Pivot by multiple columns¶
You can pivot by multiple columns either as additional row-groups, or as intersecting columns. For example, given the table in the previous example:
transformed = table.pivot(['race', 'gender'])
Result:
| race | gender | pivot | 
|---|---|---|
| white | female | 1 | 
| asian | male | 1 | 
| black | female | 2 | 
| latino | male | 1 | 
| asian | female | 1 | 
For the column, version you would do:
transformed = table.pivot('race', 'gender')
Result:
| race | male | female | 
|---|---|---|
| white | 0 | 1 | 
| asian | 1 | 1 | 
| black | 0 | 2 | 
| latino | 1 | 0 | 
Pivot to sum¶
The default pivot aggregation is Count but you can also supply other operations. For example, to aggregate each group by Sum of their ages:
transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'))
| race | male | female | 
|---|---|---|
| white | 0 | 20 | 
| asian | 20 | 25 | 
| black | 0 | 45 | 
| latino | 25 | 0 | 
Pivot to percent of total¶
Pivot allows you to apply a Computation to each row of aggregated results prior to returning the table. Use the stringified name of the aggregation as the column argument to your computation:
transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'), computation=agate.Percent('sum'))
| race | male | female | 
|---|---|---|
| white | 0 | 14.8 | 
| asian | 14.8 | 18.4 | 
| black | 0 | 33.3 | 
| latino | 18.4 | 0 | 
Note: actual computed percentages will be much more precise.
It’s helpful when constructing these cases to think of all the cells in the pivot table as a single sequence.
Denormalize key/value columns into separate columns¶
It’s common for very large datasets to be distributed in a “normalized” format, such as:
| name | property | value | 
|---|---|---|
| Jane | gender | female | 
| Jane | race | black | 
| Jane | age | 24 | 
| … | … | … | 
The Table.denormalize() method can be used to transform the table so that each unique property has its own column.
transformed = table.denormalize('name', 'property', 'value')
Result:
| name | gender | race | age | 
|---|---|---|---|
| Jane | female | black | 24 | 
| Jack | male | white | 35 | 
| Joe | male | black | 28 | 
Normalize separate columns into key/value columns¶
Sometimes you have a dataset where each property has its own column, but your analysis would be easier if all properties were stored together. Consider this table:
| name | gender | race | age | 
|---|---|---|---|
| Jane | female | black | 24 | 
| Jack | male | white | 35 | 
| Joe | male | black | 28 | 
The Table.normalize() method can be used to transform the table so that all the properties and their values share two columns.
transformed = table.normalize('name', ['gender', 'race', 'age'])
Result:
| name | property | value | 
|---|---|---|
| Jane | gender | female | 
| Jane | race | black | 
| Jane | age | 24 | 
| … | … | … |