In R, linear least squares models are fitted via the `lm()`

function. Using the formula interface we can use the `subset`

argument to select the data points used to fit the actual model, for example:

```
lin <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
linm <- lm(y ~ x, data = lin, subset = 2:4)
```

giving:

```
R> linm
Call:
lm(formula = y ~ x, data = lin, subset = 2:4)
Coefficients:
(Intercept) x
-1.633 1.500
R> fitted(linm)
2 3 4
-0.1333333 1.3666667 2.8666667
```

As for the double log, you have two choices I guess; i) estimate two separate models as we did above, or ii) estimate via ANCOVA. The log transformation is done in the formula using `log()`

.

Via two separate models:

```
logm1 <- lm(log(y) ~ log(x), data = dat, subset = 1:7)
logm2 <- lm(log(y) ~ log(x), data = dat, subset = 8:15)
```

Or via ANCOVA, where we need an indicator variable

```
dat <- transform(dat, ind = factor(1:15 <= 7))
logm3 <- lm(log(y) ~ log(x) * ind, data = dat)
```

You might ask if these two approaches are equivalent? Well they are and we can show this via the model coefficients.

```
R> coef(logm1)
(Intercept) log(x)
-0.0001487042 -0.4305802355
R> coef(logm2)
(Intercept) log(x)
0.1428293 -1.4966954
```

So the two slopes are -0.4306 and -1.4967 for the separate models. The coefficients for the ANCOVA model are:

```
R> coef(logm3)
(Intercept) log(x) indTRUE log(x):indTRUE
0.1428293 -1.4966954 -0.1429780 1.0661152
```

How do we reconcile the two? Well the way I set up `ind`

, `logm3`

is parametrised to give more directly values estimated from `logm2`

; the intercepts of `logm2`

and `logm3`

are the same, as are the coefficients for `log(x)`

. To get the values equivalent to the coefficients
of `logm1`

, we need to do a manipulation, first for the intercept:

```
R> coefs[1] + coefs[3]
(Intercept)
-0.0001487042
```

where the coefficient for `indTRUE`

is the difference in the mean of group 1 over the mean of group 2. And for the slope:

```
R> coefs[2] + coefs[4]
log(x)
-0.4305802
```

which is the same as we got for `logm1`

and is based on the slope for group 2 (`coefs[2]`

) modified by the difference in slope for group 1 (`coefs[4]`

).

As for plotting, an easy way is via `abline()`

for simple models. E.g. for the normal data example:

```
plot(y ~ x, data = lin)
abline(linm)
```

For the log data we might need to be a bit more creative, and the general solution here is to predict over the range of data and plot the predictions:

```
pdat <- with(dat, data.frame(x = seq(from = head(x, 1), to = tail(x,1),
by = 0.1))
pdat <- transform(pdat, yhat = c(predict(logm1, pdat[1:70,, drop = FALSE]),
predict(logm2, pdat[71:141,, drop = FALSE])))
```

Which can plot on the original scale, by exponentiating `yhat`

```
plot(y ~ x, data = dat)
lines(exp(yhat) ~ x, dat = pdat, subset = 1:70, col = "red")
lines(exp(yhat) ~ x, dat = pdat, subset = 71:141, col = "blue")
```

or on the log scale:

```
plot(log(y) ~ log(x), data = dat)
lines(yhat ~ log(x), dat = pdat, subset = 1:70, col = "red")
lines(yhat ~ log(x), dat = pdat, subset = 71:141, col = "blue")
```

For example...

This general solution works well for the more complex ANCOVA model too. Here I create a new pdat as before and add in an indicator

```
pdat <- with(dat, data.frame(x = seq(from = head(x, 1), to = tail(x,1),
by = 0.1)[1:140],
ind = factor(rep(c(TRUE, FALSE), each = 70))))
pdat <- transform(pdat, yhat = predict(logm3, pdat))
```

Notice how we get all the predictions we want from the single call to `predict()`

because of the use of ANCOVA to fit `logm3`

. We can now plot as before:

```
plot(y ~ x, data = dat)
lines(exp(yhat) ~ x, dat = pdat, subset = 1:70, col = "red")
lines(exp(yhat) ~ x, dat = pdat, subset = 71:141, col = "blue")
```

# Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo's answer, that uses the full list of the current `scipy.stats`

distributions and returns the distribution with the least SSE between the distribution's histogram and the data's histogram.

## Example Fitting

Using the El Niño dataset from `statsmodels`

, the distributions are fit and error is determined. The distribution with the least error is returned.

### All Distributions

### Best Fit Distribution

### Example Code

```
%matplotlib inline
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')
# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
"""Model data by finding best fit distribution to data"""
# Get histogram of original data
y, x = np.histogram(data, bins=bins, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0
# Best holders
best_distributions = []
# Estimate distribution parameters from data
for ii, distribution in enumerate([d for d in _distn_names if not d in ['levy_stable', 'studentized_range']]):
print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))
distribution = getattr(st, distribution)
# Try to fit the distribution
try:
# Ignore warnings from data that can't be fit
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
# fit dist to data
params = distribution.fit(data)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Calculate fitted PDF and error with fit in distribution
pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
# if axis pass in add to plot
try:
if ax:
pd.Series(pdf, x).plot(ax=ax)
end
except Exception:
pass
# identify if this distribution is better
best_distributions.append((distribution, params, sse))
except Exception:
pass
return sorted(best_distributions, key=lambda x:x[2])
def make_pdf(dist, params, size=10000):
"""Generate distributions's Probability Distribution Function """
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Get sane start and end points of distribution
start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)
# Build PDF and turn into pandas Series
x = np.linspace(start, end, size)
y = dist.pdf(x, loc=loc, scale=scale, *arg)
pdf = pd.Series(y, x)
return pdf
# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())
# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams['axes.prop_cycle'])[1]['color'])
# Save plot limits
dataYLim = ax.get_ylim()
# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions[0]
# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')
# Make PDF with best params
pdf = make_pdf(best_dist[0], best_dist[1])
# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, density=True, alpha=0.5, label='Data', legend=True, ax=ax)
param_names = (best_dist[0].shapes + ', loc, scale').split(', ') if best_dist[0].shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_dist[1])])
dist_str = '{}({})'.format(best_dist[0].name, param_str)
ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')
```

## Best Solution

You could be looking at

`approx()`

and`approxfun()`

... or I suppose you could fit with`lm`

for linear or`lowess`

for non-parametric fits.