Multiple Regression - Dummy variables and interactions - example in Excel

Multiple Regression - Dummy variables and interactions - example in Excel


In this video, I present an example of a multiple regression analysis of website visit duration data using both quantitative and qualitative variables. Variables used include gender, browser, mobile/non-mobile, and years of education. Gender and mobile each require a single dummy variable, while browser requires several dummy variables. I also present models that include interactions between the dummy variables and years of education to analyze intercept effects, slope effects, and fully interacted models. In short, I cover:

- multiple category qualitative variables
- dummy variables
- intercept effects
- slope effects
- dummy interactions

I hope you find it useful! Please let me know if you have any questions!
--Dr. D.
Closed Caption:

hey everybody hope you're doing well as
dr. D it's been a while since I've
talked at all about regression so I
wanted to to brush up on my own
regression skills and talk to you guys a
little bit about time something that
seems to be pretty popular people seem
pretty interested in this so we're going
to do today we're going to do an example
of multiple regression with dummy
variables and I'm sure why
if you guys would post in the comments
why this seems to be so popular
I'd love to know but this seems to be
really popular
maybe it's just because it comes up a
lot and it does come up a lot dummy
variables are really useful
I'm so just a quick recap with multiple
regression when we use multiple
regression where we have a linear model
usually although we can vary that we can
use a more complicated model and we have
some variable Y that were interested in
and we think that it is constructed by a
linear combination of some variables X
that are modified by these coefficients
so b 2 x 2 plus dot dot plus beta K X K
and then we have some error term right
that creates a noise it makes this look
like the real world looks right
so these betas these are coefficients
this one's the intercept and then all
the rest of these are you know they're
there they're multipliers that affect
the influence of x1 on why
right and this is really useful and
powerful because if we take the
derivative of Y with respect to X 1 beta
1 is what you know d 1 dy DX 1 equals
beta 1 which if you took calculus you
remember that if you haven't taken us a
long time you want you meet my not but
that's the slope term for x1 right
that's the effect that x1 has and we can
do some other stuff with with with
derivatives to try to understand the
influence of X on Y but when we have
dummy variables
these are going to be these can be
categorical variables and that's what
makes dummy variable so powerful writing
is that we can treat them instead of
treating them like a quantitative
variable to get larger or smaller on a
continuum
they're just like switches that we flip
on and off again so before I get more
into the theory side of it only pull up
the data that i generated is fake data
I'm just to give you a heads up so you
know i'm not selling company secrets or
anything but its website data a
generator right so we have visitors to a
website and we have the duration of
their visit in seconds and i made up
numbers that I thought seemed reasonable
using a process that maybe I'll outline
another video and then we had them fill
out a survey for example that tells us
their gender the browser that they're
using we can pull this directly from
their visit a lot of times
in addition whether they're browsing
with a mobile browser or not and then
how many years of education they have I
just wanted to pick a few variables that
might be useful and interesting and in
determining the duration of of a visitor
I'm age might also be interesting I just
wanted didn't want to get it make it too
complicated and the first thing we can
do is we can just say okay well is the
duration of a visit and this will not
require dummy variables but we can say
is the duration of a visit related to
the years of education
so let me keep this window open maybe so
we can see it as we go along
so let's just say we want to let x1 the
years of education and we just think
that the amount of browsing time depends
on the years of education and so we're
going to let y equal our duration the
duration of the visit
well then once we've done that we can
build a model right the model is that
why is going to be an intercept
plus the influence of years of education
plus some error term
ok now that's the model in the
population these are great beta is right
and we
that's what we really want to know we
really want to know or maybe want to
know the intercept mostly usually want
to know what this value is we can't know
what that value is though
so what we can do is we can build an
estimated model I I mean sorry the necks
of linear regression equation which is
going to be
the expected value of y equals beta 0
plus beta 1 X 1 and then we can estimate
this within our samples and so when we
estimate this will get Y hat equals b0
plus b1 x 1 and we go from Greek -
- Roman because at this point we are
looking at our sample we're not looking
at a population so we're using our Roman
letters here
I'm side we want to estimate this right
now the way that we do this with for
linear equations we could technically do
it manually but it's not super easy so
the way that I'm going to do is I'm
going to go back to our website data and
i'm going to use our data analysis
toolpak which if you haven't installed
it you don't have it it's really
powerful
it's free i believe if you have excel
already I'm so i highly recommend
installing it
I show you where where it is it's in the
data tab if you click on data analysis i
will give you a bunch of stuff here
usually i think it starts out selected
something else but regression is one
that I use frequently so that's
highlighted because i was just using a
minute ago
and normally these will not be filled in
at all but since I've been playing with
this a little bit these are all filled
in so let me get back to what it looks
like when it's when you'll see it
so first we have to choose our wire
range that's where are our left hand
side variable in strength to the
dependent variable and a nice this case
i want to select this whole column you
can see that i have 10,000 observation
so that's that's a pretty big data set
I do have labels right that first one
has the label on it and now i want to
put my ex in this case i only have one X
so all i need is this year's of
Education column right click that now i
want to put this in a new worksheet and
what i'll do is i'll call this duration
equals and it won't let me do equals I
think so i'll just use a -
education ok so i'll give me a new
regression output table the whole video
on how to read regression output cables
but but just so we can take a look at
this actually won't neck and just need
this left half over here we have the R
squared that says that eleven percent of
the variation in duration can be
accounted for
just by the years of schooling so that
accounts for some of this stuff and then
down here we have a p-value that's on
the intercept here we have years of
education our p value here is 1.6 times
10 to the negative 2 under 60 second
that's very unlikely
and that if you recall is a into p value
on the hypothesis that beta 0 is equal
to 0 beta 1 is equal to 0 and without
very low p value we can reject that it
clearly beta 1 does not equal 0
it's a very low standard error it looks
like right here it says that our
estimate b1 equals 10 . 39 roughly 386 I
guess what that says is for every year
of schooling persons web site visit is
10 seconds longer
now there's a fair bit of variation but
at the same time with 10,000 data points
we can we can pretty precisely estimate
that so 10 . 386 is our guest and so
that's a very simple model and we've
done a good job of it but what if we
want to know about browsers right what
if we want to know if different people
using different browsers have different
visit times right
but one thing we can do is to what we
want to do that is we want to multiply a
number by our browser but we can't write
because you can't multiply a number by
internet explorer it doesn't mean
anything so what we usually do in this
case i have four possibilities built in
here
one is internet explorer 1 is a chrome
one is
safari and one is firefox and so those
are the four options that i have could
put another one or whatever i'm just but
I just to make up the date i figured to
choose these four and what we want to do
that is we want to basically separate
our data into four subsets that's the
way that way we do this and we can build
these dummy variables such that it
equals 1
if I wonder if the user is using
internet explorer an equal 0 otherwise
and so let's define this as d1 right
that's equals one in this case will
define this one is d to this one is d3
and then we always want to have one
category omitted and so in this case I'm
going to omit firefox meaning that they
will all equal 0 if firefox if we're not
using or if they're using firefox the
reason is that we don't want these days
we don't want to have always to add up
to one or else it'll interfere with our
intercept term so we can define the
dummies these ways the this way i have
another video on kind of the theory of
them are variable so you can feel free
to take a look at that
but this is how we will use our dummy
variables so in Excel the easy way to do
this is to copy your category name right
there
so I've internet explorer I have chrome
and then I have safari and what I want
to do is wherever this is equal to
Internet Explorer I want a one and
whatever it's not equal internet
explorer 10 so if i do equals if now I
have c2 equals f 1 and 1 otherwise 0 now
it's a formula works for right there but
i want to be able to use it over and
over again right
and so I want to keep my browser column
locked so i'm going to put a dollar sign
there to convert that to an absolute
reference and in this case I want to
keep my first column our first row
locked
that way when i get the browser roast
column locked means that it will slide
down right that's a lot to this column
it will slide up and down the rows and
over here keeping this row locked means
that it will slide across the test the
comparison will slide across but it
won't go up and down
so if i copy if I feel these down you
can see that wherever it is equal to
Internet Explorer it's a one
everywhere else it's a zero and then if
I've copy these to the right you can see
that it now it's doing that for Chrome
as well and then first far over here and
that firefox again is emitted variable
so wherever we see firefox and for some
reason the formula used me that unlikely
000 from those are all zeros which is
what we want
ok ok so now let's see let's build a
more complicated model let's say we
don't want to know how a duration of
visit
why depends on not only education right
which we can still including this model
at x1
but also our dummy variables B to D 1 2
3 D 2 plus beta 4 d 3 plus epsilon so
now this is a larger model and more
complicated model take the expected
value again to get rid of that epsilon
and then we can get our estimated
regression equation y equals bait b0
plus b1 x 1 plus B to D 1 plus B 3 D 2
plus B for d3
ok so we have all of these in our model
what we now on its estimates for all of
these right we want estimates for all of
the bees across here so the
interpretation of them is going to be
somewhat straight forward particularly
for this one that's a slope term but now
for dummies
the way that this model is set up we
have an offset right so this is our
intercept right
and now Wendy 1 is equal to 1 meaning
that our users using internet explorer
then these are going to both be equal to
0 right so with ie
this will be equal to 1 and this will be
feasible to be equal to 0 but that's
this is going to be equal to 1 which
means that it's just going to shift for
every user by this be - amount right
and so in essence we have a different
internet explorer intercept right the
intercept for inter net explorer
is going to be y hat equals B 0 + b2 and
then we're going to have plus b1x one
right and so that's our estimated
regression equation given that we know
the person is using internet explorer
alternatively if we have somebody who we
know is using chrome then this is going
to be 0 and this is going to be 1 and
this is going to be 0 and so they're y
hat is going to be beta 0 plus B 3
equals B 3 plus b1x one
so this slope stays the same across
different users right but now we are
offsetting by different amounts right
and so the same thing for the intercept
for fire for safari is going to be be 0
plus before and then for firefox the
intercept is just going to be be 0 so
you can think of the b2 b3 before I sort
of the bonus amount of time positive or
negative that a visitor spends on our
site
if they're using internet explorer
instead of firefox frame
ok now let's we can estimate this
equation just bite retelling excelled at
we now have these dummy variables as our
exes
so now we're going to do data analysis
again regression are wise to stay the
same but now our exes are a little bit
different in addition two years of
education we now want include are three
dummy variables
ok now we have this entire subset of
data as our regressors star regresar
data now I have duration is is a
function of Education and browser and
click OK
here you can see my r squared still
eleven percent not sure that was really
super useful and sure enough when we
look at the P the P values
none of these seem to be statistically
significant now that's not to say that
we don't have . estimates in our data it
looks like safari users on average use
their browser about four and a half
seconds longer than firefox users what
chrome users used it about 1.3 seconds
longer than firefox users but those are
not so far away from zero that that we
can say that they're statistically
significantly different from zero
ok let's look at another dummy variable
in our website data we also have gender
right so let's let's define a new
variable will call it
let's call it female and this is going
to be equal to if this equals F and it's
one otherwise it's 0
ok now we have another dummy variable
this one's a simpler dummy it's just
true or false right we only need one in
this case and so in our model we can
just define this as maybe we can call
this D Sub F because it's a female dummy
maybe we can even refer to the rest of
them as d sub IE d sub chrome and d-sub
safari now allow us to kind of read it a
little more easily and now we have a lot
of even bigger model
beta 0 plus beta 1 x years of education
just called xx1 again plus beta2 times
internet explorer plus beta three times
chrome plus beta four times safari plus
beta 5 times female plus epsilon so
there's an even bigger model is a while
by the way we can estimate this again
and we'll go through the same process
turn these into bees and now we're
trying to find an estimate of the bees
to see if gender makes a statistically
significant difference on the duration
that people spend so let's go to data
analysis
regression just turn this - I and that
should cover our entire data set
now we have duration as a function of
Education browser and female
click OK takes a minute but there we go
and what do we have here will now is we
have an intercept years of education
still pretty stable tea - 10.3 6214 not
a lot of movement here now we see that
female actually has this is a
significant as a significant difference
right
that's positive what that means is that
females on average spend 8.88
controlling for years of education
browser right spend 8 . 7 9 seconds
longer on the site and men do and that's
significantly different from zero at any
level of significance right that p-value
is very very low so we can say that
females seem to like our website more
than males do
all right now the last thing i want to
show you a couple things we can do with
a with with interactions to move up
again
let's see ok let's do you work with some
interactions here let's say maybe we can
simplify our date error or model a
little bit maybe not maybe we could do a
really complicated model first let's say
that instead of internet explorer chrome
and safari we want to interact with
education
let's say why would we want to do this
maybe we think that instead of there
being a data structure that looks like
this slope for enter you know where this
is our education and this is our browser
our adoration right
duration of browsing the model we built
it looks like this with three parallel
lines or four parallel lines based on
browser usage but the slopes are the
same regardless of which browser they're
using right the relationship between
duration and education is the same
now another way to look at this might be
to say maybe the inner steps are the
same right
but the slopes are different right so
maybe they started the same place but if
you have higher education and use
explorer then you're the night then that
education will really drive up your
amount of browsing alternatively mind
might say that a safari it grows at a
slower level in and some other maybe
firefox the more energy you use firefox
the more education use actually you have
actually the lower your duration of the
visit
this relationship is when we modeled and
this is the one that you model where you
just have your dummies kind of added
linearly we can we can also model this
third one or this other one this one is
a is modeled as an interaction effect
where we have now we still have an
intercept intersect right yeah and
intercept term but now instead of beta 1
x
well we can still be 1 times X 1 and it
will be for firefox now instead of beta
2 x RI internet explorer dummy
we have our internet explorer dummy
times rx1r education and then we'll have
beta three times are dummy for chrome
times our education and then be four
times are dummy for safari times our
education right
the fact that this has is that now let's
say we know that
d i equals 1 we know that we have
internet explorer user
what is the effect of exon r of x on
I'm sorry this should be one what is the
effect of X on Y
well the intercept is still the same as
it is isn't beta 0 now so that's being
held constant for everybody right
beta 0 right here but the derivative of
Y with respect to X 1
depends on beta right and on the dummies
in this case we know that d i equals 1
then it's going to be beta 1 plus beta 2
is Ben 1 plus beta 2 x your dummy right
alternatively the slope term for Chrome
users is going to be beta 1 plus beta 3
and so now we are allowing this slope to
adjust how do we do this well
instead of using our exes here let's a
let's create kind of a new data just
years
let's call this a joke so I want this to
be years of education
I want to copy that all the way down and
that's just a the same variable excel
requires you to use rectangular
regressors shapes rectangles matrices
array so when I stick with that now
we're going to do educate cross IE going
to this
I stick with them dollar sign k 2 loops
washing k2 times IE internet explorer
and then we have and you and your ex
chrome and then fifty
x safari and those will be 0
everywhere that too but now you can see
that instead of having a dummy variable
which is just ones and zeros
I have a variable that takes the value
of education or takes the value of zero
now when I do a regression it's going to
estimate this something that looks like
this
I think the way that I constructed the
data that should show no real strong
relationship either but we can see so
now these I don't gender in my model
again this time but I do have something
like this so let's go with this
enter and then we'll call this education
browse cross magic go
ok takes a minute
yeah none of these are statistically
significant but you can see these terms
are very different education term is
still the same
now education these interactions tell us
that for every year of Education safari
users use it for . 3 4 seconds longer
than firefox users
so that's how the slopes different by
very much not enough to conclude that
they're different from zero but if we do
instead of a browser on interaction
let's do a magic
let's do a gender interaction so let's
do educate cross female
and let's steer clear of the browsers
let's just do the simple model here
aggression
now i'm going to take these two and now
that is a female cross education
click OK
this is statistically significantly
different from zero and before we saw
that women
what women do they browse the eight
seconds more
eight seconds longer here we see that
actually this you know in this model it
has to do with their education level
right so it's . 66 seconds per year of
Education in addition to you know more
more than men
now we can include the full interaction
model and so this model
just to give you kind of an idea of what
it's doing for men and women we had the
two models one of which looked at a
relationship that had the same slope but
different intercepts and then this one
has the same intercept
but different slopes right so this in
this case females that they're they're
browsing increases faster with respect
to education than males and so what we
can do finally as we can do a full
interaction model where we allow these
two basically just be two different
lines that have different functions the
way we do this is by including both a
dummy term and intercept term right so
beta 0 plus beta 1 x 1 plus beta 2 d
female plus beta 3d female times x1 and
now you can see that if we have
somebody's a male both of these will be
0 right and so it's just this simple
regression but if we have somebody is a
female both of these are equal to one or
this is equal to 1 and this is equal to
one and then this one becomes equal to
education which means that for females
we are actually estimating a totally
different equation which is basically to
put groups together
beta 0 plus beta 2
that's our intercept plus and then
distributing beta 1 plus beta 3 times X
1 and so it
we have a you think of beta 2 as the
intercept modification for females and
beta 3 is the slope modification for
female modification for three females so
let's build that structure into this
data
the easy way to do that one easy way to
do that is just insert here and then set
this equal to one over here so you can
see that we have a dummy variable as
well as an interaction variable data
analysis regression
now this data we wanted to go from here
to here and then all the way down
this is going to be education female and
family female cross education
click ok and here we see that it looks
like it's the education that seems to be
driving it
females this is not significant
education is still significant right -
9.5 six seconds per year of Education
for men
right because that's what that is and
for women it's going to be that turns
into like 11 seconds per female right
nine point five times per year of
Education and that significant at where
the two percent level anything above the
one percent level really but now the
female intercept difference is not
statistically significant women on
average controlling for the interaction
effect of education
look at this website for 12 seconds
shorter but that there's too much
dispersion around that to really really
be confident about that measure that
measurement we can build one more model
maybe build the biggest the big big big
model we have a really a lot of
observations here so there's a lot we
can do we didn't include mobility at all
do you want to include that I don't know
let's see
so here we have years of education
internet Chrome Safari female full
interactions the kitchen sink model
let's do education across mobile to
mobile is built in here already as a as
a dummy variables to education across
female may be less education across
browsers first
from me to you
yes safari and then female and it's
going to be really big model but it
should give us all the full interactions
there we go education across female
education cross chrome education across
fari mobile
there we go ok so did an analysis
regression we have our same inputs now
our exes are going to be starting with
mobile all the way across to all of the
interactions and all the dummies and all
of we now can call this the duration
equals everything
click OK it's going to take a minute
because that's kind of a long aggression
but you can see here that what we have
all sorts of stuff so years of education
are significant mobile device that kind
of matters maybe education cross mobile
does seem to matter for every year of
Education mobile users use this for
three seconds less education cross
female also seems to matter so we have
to interaction effects going on here yet
so this is how you use dummy variables
there's like clearly there's a lot you
can do you could introduce
nonlinearities if you really wanted to
write you can do that pretty easily by
squaring things or by Reese entering
them
r squared on this is still still not
great right we only explain eighteen
percent even at the best so maybe we
need to collect age or maybe we need to
collect something else to to really
explain what's going on here
and what we do we can explain some stuff
here education seems to matter that
seems that be moderate i moderated
somewhat by mobile device use and gender
but yeah that's how you use dummy
variables written just the one thing to
keep in mind i think when you're doing
this stuff is to always when you're
building your model right
be careful what you're looking for and
always think about what the expected
value of y is given the dummies you're
using right because you're switching
dummies on your switching dummies off
and and that changes both the intercept
and the slope depending on how this is
structured
you can use dummies to modify square
right so if you think education is a
nonlinear effect you can take years of
education squared or recenter that and
square it and then interact with mobile
device
something like that would allow it to
you know go up and then tail off or
something like that for mobile users or
whatever but yeah I hope that you found
this useful to it certainly there's a
lot you can do with a dummy variables if
you have any questions or would like to
see anything more done on this subject
please let me know and i'll be happy to
take a look and I'm really grateful for
you taking the time thanks guys

Video Length: 30:31
Uploaded By: Jason Delaney
View Count: 156,361

Related Software Products
Multiple Regression Analysis and Forecasting
Multiple Regression Analysis and Forecasting

Published By:
Business Spreadsheets

Description:
The Excel Multiple Regression Analysis and Forecasting template enables the confident identification of value drivers and forecasting of business plan or scientific data. The multiple regression process utilizes commonly employed statistical measures to test the validity of the analysis and results are summarized in textual form to be easily understood and interpreted. Feature selection automatically identifies the best variables for predictive analytics and accurate forecasting results. br ...


Related Videos
How to Run a Multiple Regression in Excel 2007
How to Run a Multiple Regression in Excel 2007

I take some pulse rate data and relate it to incline and walking speed on a treadmill via a multiple regression analysis. In addition to showing you how to get the multiple regression output, I interpret the results. hr / bClosed Caption:/b hello this is a second video in a series on multiple regression in this video I'll show you how to run the multiple regression using Excel 2007 we're using the same data is in the first video pulse ...
Video Length: 08:22
Uploaded By: ProfTDub
View Count: 155,149

regression analysis 2 in Excel with extended multiple regression analysis example
regression analysis 2 in Excel with extended multiple regression analysis example

We take the model that was presented in the first video and run the linear regression in excel. We look at the p values and do hypothesis testing. We then extend the model to a multiple regression model. hr / bClosed Caption:/b font color="#CCCCCC"ok next/fontfont color="#E5E5E5" next I'd like to continue with/font our example problem and insert our data font color="#E5E5E5"recall that we had hours of instruction/font and units produced in ...
Video Length: 09:55
Uploaded By: pomscm
View Count: 109,100

Copyright © 2025, Ivertech. All rights reserved.