pyhdfe.create¶
-
pyhdfe.
create
(ids, cluster_ids=None, drop_singletons=True, compute_degrees=True, degrees_method=None, residualize_method=None, options=None)¶ Initialize an algorithm for absorbing fixed effects.
By default, simple de-meaning is used for a single fixed effect, and non-accelerated de-meaning is used for more than one dimension. This is the most conservative and simplest algorithm for fixed effect absorption. If it is taking a long time, consider switching to a faster
residualize_method
and using differentoptions
.When an algorithm is initialized, by default, singletons are dropped and degrees of freedom are computed. If either behavior isn’t needed, or if degrees of freedom computation is taking a long time, consider using a more conservative
degrees_method
or disabling these behaviors withdrop_singletons
andcompute_degrees
.Warning
This function assumes that all of your data have already been cleaned. For example, it will not drop observations with null values.
- Parameters
ids (array-like) – Two-dimensional array of fixed effect identifiers. Columns are fixed effect dimensions and rows are observations. Identifiers can be integers, strings, or other hashable data types. Columns after the first should have more than one unique value.
cluster_ids (array-like, optional) – Two-dimensional array of cluster group identifiers, which if specified will be used when computing degrees of freedom. If a fixed effect (i.e., a column in
ids
) is nested within a cluster (i.e., a column of this matrix), it will not contribute towards degrees of freedom used by the fixed effects. For more information, see Correia (2015).drop_singletons (bool, optional) – Whether to drop singleton groups or observations in
ids
when initializing the algorithm. Singletons groups are fixed effect groups with only one observation. By default, singletons are dropped. When dropped, the number of singleton groups is equal to the number of rows inids
minusAlgorithm.observations
. For more information about singletons and why they are typically dropped, see Correia (2015).compute_degrees (bool, optional) – Whether to compute the number of degrees of freedom used by the fixed effects. By default, degrees of freedom are computed.
degrees_method (str, optional) –
How to compute or approximate the number of degrees of freedom used by the fixed effects that aren’t nested within any
cluster_ids
. The following methods are supported:'none'
(default for one dimension) - Assume there are no redundant fixed effects. This method is exact for one dimension (i.e., for one column inids
). It provides the most conservative upper bound for multiple dimensions but requires no additional computation.For one dimension this method simply counts the number of fixed effect levels (i.e., the number of distinct values in
ids
). Each dimension after the first contributes its number of levels minus one.'pairwise'
(default for multiple dimensions) - Apply the algorithm of Abowd, Creecy, and Kramarz (2002) to each pair of fixed effect dimensions. This method is exact for two dimensions. It provides a smaller upper bound for more than two dimensions but can be computationally expensive.For one dimension this method is the same as
'none'
. However, the second dimension contributes its number of levels minus the number of connected components in the bipartite graph formed by the two dimensions. Each dimension after the second contributes its number of levels minus the maximum number of connected components in the bipartite graphs that it forms with prior dimensions. This is the method used by reghdfe.'exact'
- Applynumpy.linalg.matrix_rank()
to dummy variables constructed fromids
. This method is exact for any number of dimensions but is typically computationally infeasible. It is meant to be a benchmark.
residualize_method (str, optional) –
Type of algorithm to initialize. The following methods are supported:
'within'
(default for one dimension) - Within transform. Matrix columns are de-meaned within each fixed effect group (i.e., each unique value inids
). This algorithm only works for a single fixed effect dimension (i.e., one column inids
).'map'
(default for multiple dimensions) - Method of alternating projections applied to fixed effect absorption by Guimarães and Portugal (2010), Gaure (2013a), Gaure (2013b), and Correia (2017), among others. Matrix columns are iteratively de-meaned until convergence. This method works for any number of fixed effect dimensions but will be slower than'within'
for one dimension. Variations on this method are used by lfe and reghdfe.'lsmr'
- LSMR method of Fong and Saunders (2011). This implementation is taken fromscipy.sparse.linalg.lsmr()
and modified for simultaneous iteration over multiple matrix columns and custom convergence criteria. Matrix columns are iterated on until convergence. This method works for any number of fixed effect dimensions but will be slower than'within'
for one dimension. This is the method used by FixedEffectModels.jl.'sw'
- Method of Somaini and Wolak (2016). This non-iterative method only works for two dimensions (i.e., two columns inids
). To minimize memory usage, the first dimension of fixed effects should have fewer levels than the second dimension (i.e., the first column inids
should have fewer unique values than the second column). This is the method used by res2fe.'dummy'
- Matrix columns are replaced by residuals from regressions on dummy variables constructed fromids
. This method works for any number of dimensions but is typically computationally infeasible. It is meant to be a benchmark.
options (dict, optional) –
Configuration options for the chosen
method
. The'within'
,'sw'
, and'dummy'
methods do not support any configuration options. The following options are supported by both'map'
and'lsmr'
:iteration_limit : (int, optional) - Maximum number of iterations, after which an exception will be raised if the algorithm has not converged. By default, the maximum number of iterations is
1000000
.tol : (float, optional) - Common convergence criteria based on the differences between two iterations’ residualized matrices. By default, algorithms will converge when the maximum absolute value of these differences is less than
1e-8
. Convergence based on this criteria can be disabled by setting this value to0
.converged : (callable or None, optional) - Custom convergence criteria, which should be a function of the form
converged(last_matrix, matrix) -> bool
that accepts the current iteration’s residualizedmatrix
and the last iteration’s residualizedlast_matrix
. It should return a boolean indicating whether the routine has converged. When a custom convergence criteria is used,tol
is ignored.
The following options are supported only by
'map'
:transform : (str, optional) - Transform operator \(T\) that determines the order of projections \(P_1, P_2, \dots, P_n\) for each of the \(n\) columns of fixed effects in
ids
. The following transforms are supported:'kaczmarz'
(default) - Kaczmarz or von Neumann-Halpering operator \(T = P_n \cdots P_1\), which is asymmetric and hence does not support'cg'
acceleration.'symmetric'
- Symmetric Kaczmarz operator \(T = P_n \cdots P_1 \cdots P_n\).'cimmino'
- Symmetric Cimmino operator \(T = (P_1 + \cdots + P_n) / n\).
acceleration : (str, optional) - Method used to accelerate fixed point iteration. The following methods are supported:
'none'
(default) - Simple non-accelerated fixed point iteration.'gk'
- Line search method of Gearhart and Koshy (1989) applied to fixed effect absorption by Gaure (2013a).'cg'
- Conjugate gradient method described by Hernández-Ramos, Escalante, and Raydan (2011). This method is not supported by the asymmetric'kaczmarz'
transform.
acceleration_tol : (float, optional) - Acceleration method-specific tolerance for when to stop accelerating the convergence of a vector and switch to simple iteration.
For
'gk'
, each vector’s convergence is accelerated only when the sum of squared residuals relative to the sum of squared vector values is greater than this value, which is by default1e-16
.For
'cg'
, each vector’s convergence is accelerated up until the first time that its sum of squared residuals is greater than this value.
The following options are supported only by
'lsmr'
:residual_tol : (float, optional) - Convergence criteria S2 from Fong and Saunders (2011) based on Stewart’s backwards error estimate. This is by default
1e-8
. Convergence based on this criteria can be disabled by setting this value to0
.condition_limit : (float, optional) - Maximum estimated condition number of the matrix of fixed effects. For higher estimated condition numbers, an exception will be raised. By default, the maximum estimated condition number is
100000000
.
- Returns
Initialized
Algorithm
for absorbing fixed effects. Class attributes contain information about the number of observations, the number of fixed effect dimensions, and if computed, the number of singletons and degrees of freedom used by the fixed effects.- Return type
Algorithm
Examples