Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle intercept term? #13

Open
cicdw opened this issue Jan 26, 2017 · 3 comments
Open

How to handle intercept term? #13

cicdw opened this issue Jan 26, 2017 · 3 comments

Comments

@cicdw
Copy link
Collaborator

cicdw commented Jan 26, 2017

What is the best way to handle intercepts?

Right now, the algorithms assume the user creates a column of 1s in their dask array, à la statsmodels. However, sometimes it's convenient to have a fit_intercept option similar to scikit-learn. Having this option set to True will require a step which appends a column of 1's to the user-supplied dask array, but it won't be as simple as the corresponding numpy case.

@mrocklin

@mrocklin
Copy link
Member

If the question is, how do I concatenate a column of ones, then the answer is to use the da.concatenate function.

In [2]: import dask.array as da

In [3]: x = da.random.random((5, 2), chunks=(2, 2))

In [4]: o = da.ones((x.shape[0], 1), chunks=(x.chunks[0], (1,)))

In [5]: z = da.concatenate([x, o], axis=1)

In [6]: z.compute()
Out[6]: 
array([[ 0.16174789,  0.06872224,  1.        ],
       [ 0.01018076,  0.68570003,  1.        ],
       [ 0.31238221,  0.91503403,  1.        ],
       [ 0.90225416,  0.04750495,  1.        ],
       [ 0.98440154,  0.22888387,  1.        ]])

@cicdw
Copy link
Collaborator Author

cicdw commented Feb 22, 2017

Naive attempt at using this to add an intercept makes admm choke:

X = da.random.random((100, 2), chunks=(50,2))
y = make_y(X, beta=np.array([-1.0, 2]), chunks=(50,))
o = da.ones((X.shape[0], 1), chunks=(X.chunks[0], (1,)))
z = da.concatenate([X, o], axis=1)
admm(z, y)
...
ValueError: shapes (50,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)

Traceback

  File "algorithms.py", line 199, in wrapped
    return func(beta, X, y) + (rho / 2) * np.dot(beta - z + u,
  File "families.py", line 17, in pointwise_loss
    Xbeta = X.dot(beta)

This could be an issue with how admm assumes the data is chunked, or there might be another way we should handle intercepts?

@mrocklin
Copy link
Member

These lines are problematic

XD = X.to_delayed().flatten().tolist()
yD = y.to_delayed().flatten().tolist()

I recommend first rechunking these arrays to have only a single chunk along columns

In [8]: z
Out[8]: dask.array<concate..., shape=(100, 3), dtype=float64, chunksize=(50, 2)>

In [9]: z.rechunk((None, z.shape[1]))
Out[9]: dask.array<rechunk..., shape=(100, 3), dtype=float64, chunksize=(50, 3)>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants