pyhdfe.create

pyhdfe.create(ids, cluster_ids=None, drop_singletons=True, compute_degrees=True, degrees_method=None, residualize_method=None, options=None)

Initialize an algorithm for absorbing fixed effects.

By default, simple de-meaning is used for a single fixed effect, and non-accelerated de-meaning is used for more than one dimension. This is the most conservative and simplest algorithm for fixed effect absorption. If it is taking a long time, consider switching to a faster residualize_method and using different options.

When an algorithm is initialized, by default, singletons are dropped and degrees of freedom are computed. If either behavior isn’t needed, or if degrees of freedom computation is taking a long time, consider using a more conservative degrees_method or disabling these behaviors with drop_singletons and compute_degrees.

Warning

This function assumes that all of your data have already been cleaned. For example, it will not drop observations with null values.

Parameters
  • ids (array-like) – Two-dimensional array of fixed effect identifiers. Columns are fixed effect dimensions and rows are observations. Identifiers can be integers, strings, or other hashable data types. Columns after the first should have more than one unique value.

  • cluster_ids (array-like, optional) – Two-dimensional array of cluster group identifiers, which if specified will be used when computing degrees of freedom. If a fixed effect (i.e., a column in ids) is nested within a cluster (i.e., a column of this matrix), it will not contribute towards degrees of freedom used by the fixed effects. For more information, see Correia (2015).

  • drop_singletons (bool, optional) – Whether to drop singleton groups or observations in ids when initializing the algorithm. Singletons groups are fixed effect groups with only one observation. By default, singletons are dropped. When dropped, the number of singleton groups is equal to the number of rows in ids minus Algorithm.observations. For more information about singletons and why they are typically dropped, see Correia (2015).

  • compute_degrees (bool, optional) – Whether to compute the number of degrees of freedom used by the fixed effects. By default, degrees of freedom are computed.

  • degrees_method (str, optional) –

    How to compute or approximate the number of degrees of freedom used by the fixed effects that aren’t nested within any cluster_ids. The following methods are supported:

    • 'none' (default for one dimension) - Assume there are no redundant fixed effects. This method is exact for one dimension (i.e., for one column in ids). It provides the most conservative upper bound for multiple dimensions but requires no additional computation.

      For one dimension this method simply counts the number of fixed effect levels (i.e., the number of distinct values in ids). Each dimension after the first contributes its number of levels minus one.

    • 'pairwise' (default for multiple dimensions) - Apply the algorithm of Abowd, Creecy, and Kramarz (2002) to each pair of fixed effect dimensions. This method is exact for two dimensions. It provides a smaller upper bound for more than two dimensions but can be computationally expensive.

      For one dimension this method is the same as 'none'. However, the second dimension contributes its number of levels minus the number of connected components in the bipartite graph formed by the two dimensions. Each dimension after the second contributes its number of levels minus the maximum number of connected components in the bipartite graphs that it forms with prior dimensions. This is the method used by reghdfe.

    • 'exact' - Apply numpy.linalg.matrix_rank() to dummy variables constructed from ids. This method is exact for any number of dimensions but is typically computationally infeasible. It is meant to be a benchmark.

  • residualize_method (str, optional) –

    Type of algorithm to initialize. The following methods are supported:

    • 'within' (default for one dimension) - Within transform. Matrix columns are de-meaned within each fixed effect group (i.e., each unique value in ids). This algorithm only works for a single fixed effect dimension (i.e., one column in ids).

    • 'map' (default for multiple dimensions) - Method of alternating projections applied to fixed effect absorption by Guimarães and Portugal (2010), Gaure (2013a), Gaure (2013b), and Correia (2017), among others. Matrix columns are iteratively de-meaned until convergence. This method works for any number of fixed effect dimensions but will be slower than 'within' for one dimension. Variations on this method are used by lfe and reghdfe.

    • 'lsmr' - LSMR method of Fong and Saunders (2011). This implementation is taken from scipy.sparse.linalg.lsmr() and modified for simultaneous iteration over multiple matrix columns and custom convergence criteria. Matrix columns are iterated on until convergence. This method works for any number of fixed effect dimensions but will be slower than 'within' for one dimension. This is the method used by FixedEffectModels.jl.

    • 'sw' - Method of Somaini and Wolak (2016). This non-iterative method only works for two dimensions (i.e., two columns in ids). To minimize memory usage, the first dimension of fixed effects should have fewer levels than the second dimension (i.e., the first column in ids should have fewer unique values than the second column). This is the method used by res2fe.

    • 'dummy' - Matrix columns are replaced by residuals from regressions on dummy variables constructed from ids. This method works for any number of dimensions but is typically computationally infeasible. It is meant to be a benchmark.

  • options (dict, optional) –

    Configuration options for the chosen method. The 'within', 'sw', and 'dummy' methods do not support any configuration options. The following options are supported by both 'map' and 'lsmr':

    • iteration_limit : (int, optional) - Maximum number of iterations, after which an exception will be raised if the algorithm has not converged. By default, the maximum number of iterations is 1000000.

    • tol : (float, optional) - Common convergence criteria based on the differences between two iterations’ residualized matrices. By default, algorithms will converge when the maximum absolute value of these differences is less than 1e-8. Convergence based on this criteria can be disabled by setting this value to 0.

    • converged : (callable or None, optional) - Custom convergence criteria, which should be a function of the form converged(last_matrix, matrix) -> bool that accepts the current iteration’s residualized matrix and the last iteration’s residualized last_matrix. It should return a boolean indicating whether the routine has converged. When a custom convergence criteria is used, tol is ignored.

    The following options are supported only by 'map':

    • transform : (str, optional) - Transform operator \(T\) that determines the order of projections \(P_1, P_2, \dots, P_n\) for each of the \(n\) columns of fixed effects in ids. The following transforms are supported:

      • 'kaczmarz' (default) - Kaczmarz or von Neumann-Halpering operator \(T = P_n \cdots P_1\), which is asymmetric and hence does not support 'cg' acceleration.

      • 'symmetric' - Symmetric Kaczmarz operator \(T = P_n \cdots P_1 \cdots P_n\).

      • 'cimmino' - Symmetric Cimmino operator \(T = (P_1 + \cdots + P_n) / n\).

    • acceleration : (str, optional) - Method used to accelerate fixed point iteration. The following methods are supported:

    • acceleration_tol : (float, optional) - Acceleration method-specific tolerance for when to stop accelerating the convergence of a vector and switch to simple iteration.

      For 'gk', each vector’s convergence is accelerated only when the sum of squared residuals relative to the sum of squared vector values is greater than this value, which is by default 1e-16.

      For 'cg', each vector’s convergence is accelerated up until the first time that its sum of squared residuals is greater than this value.

    The following options are supported only by 'lsmr':

    • residual_tol : (float, optional) - Convergence criteria S2 from Fong and Saunders (2011) based on Stewart’s backwards error estimate. This is by default 1e-8. Convergence based on this criteria can be disabled by setting this value to 0.

    • condition_limit : (float, optional) - Maximum estimated condition number of the matrix of fixed effects. For higher estimated condition numbers, an exception will be raised. By default, the maximum estimated condition number is 100000000.

Returns

Initialized Algorithm for absorbing fixed effects. Class attributes contain information about the number of observations, the number of fixed effect dimensions, and if computed, the number of singletons and degrees of freedom used by the fixed effects.

Return type

Algorithm

Examples