EF11: Digital Experimentation and Peer Effects with Dean Eckles

Speaker 1:                           Welcome to Economic Frontiers. Today our guest is Dean Eckles who is an assistant professor of marketing at MIT's Sloan School of Management. Dean is an expert in experimentation, causal inference, and peer effect, and is doing some really innovative work. Really excited to have this conversation, welcome to the show Dean.

Dean Eckles:                       Yeah glad to be here.

Speaker 1:                           Why don't we get started with a little bit about your background. One of the interesting things about you is that you don't have a PhD in marketing or economics, but in something else. Why don't you tell us how you got involved in this field.

Dean Eckles:                       Yeah sure. Actually a lot of my initial interests in getting involved in research was from the perspective of human computer interaction. That's how I started working in the internet industry. Then there's a lot of connections between thinking about human computer interaction, HCI, and how online markets and services work. That's really what led me to this direction. My PhD is in communication and I also have master's degrees in cognitive science with a focus on human computer interaction, especially computers trying to persuade people. Bring about behavior change, and also in statistics. That's my mix of skills. Actually before I joined MIT I worked ... During my PhD, and then for three years afterwards at Facebook. A lot of the experience I have running randomized experiments, thinking about the tools to run randomized experiments is from that time at Facebook.

Speaker 1:                           Yeah that's fascinating. One of the things that you might be able to tell us a little bit about is whether the human computer interaction viewpoint on persuasion is different than the marketing viewpoint on it. What approaches did the two fields take?

Dean Eckles:                       Yeah I think that's an interesting question. One of the things that I learned studying with two of my mentors, B. J. Fogg and Clifford Nass, was to think about the variables involved in the design of technology. I think a lot of times when people think about technology, they think about monolithic technologies. What does it mean for TV to enter the market? What does it mean for mobile, location based ads to enter the market? As these monolithic technologies or what happened when radio entered the market?

                                                Instead maybe you can think about some of the specific design choices and characteristics of those technologies, many of which might be under your control as a designer. I think part of it is that it's often a lot more design oriented and sees the technology as pretty plastic. As something that you as a designer or manager might consider moving somewhere else in that design space. I think that design oriented perspective is one of the things that I took from that.

Speaker 1:                           Yeah I think that's a very general theme of people that work at companies, they have the ability to change things, whereas academics, they can look back at history and try to figure out what happened, and of course from that perspective it seems that the big change is TV or radio, but of course while you're at a company, let's [inaudible 00:03:40] Facebook, Facebook is already there and frankly the effect of Facebook might be of academic interest, but the practical interest is how to change it in order to improve whatever your objective function is.

Dean Eckles:                       Yeah, and I think especially now that we're in this world of internet services and more and more things being in software. The idea that software is eating the world. That makes all of this so much more plastic and so as a designer, it is reasonable to consider, wait, how could I change how this communication technology works? How could I change how this market works? As a relatively open ended question. As opposed to yeah, there is something to the fact that once standards were agreed on for broadcast television, there's some constraints involved there. As a social scientist you're only going to be able to study actually existing television like technologies, not some of the alternative variations that you're considering. Whereas with software, with internet services it's really so much more of an open field.

Speaker 1:                           Starting with that background, one of your research and life interests is in causal inference. How does this intersect with this theme that you've been thinking about?

Dean Eckles:                       When you're thinking about a design oriented perspective on technology, it's often about trying different designs and seeing how they do with respect to some objective functions. Decision makers who are designers or product managers in the internet industry, one of the main tools that they have for figuring out what they should do next is trying different things. Usually what they want to know is what would happen if we launched this new redesign of our service. We changed the Facebook home page, we changed how the Yahoo homepage works. We change how people write reviews on Airbnb. What would happen if we rolled out that change to everyone?

                                                That's a question really about counter factuals. What would happen if we launch this change versus what would happen if we don't launch this change, or maybe we have a much wider design space that we're exploring. What would happen if we used this marketing copy or that marketing copy or some third set of marketing copy. There's a lot of these questions that are really about counter factual policies that we could implement. Counter factual designs that we could roll out to our users. Those sorts of questions are causal questions. What would happen if we did X.

                                                Now luckily, one of the natural ways to answer these is through randomized experiments, and randomized experiments or AB tests are super easy to do a lot of the time in internet services. That's one of the main tools that we have.

Speaker 1:                           I'm going to ask you the straw man question which I'm sure many managers have also asked you, which is okay, so this sounds great. I'm on board with causal inference, but what is there to say about it, let's just run the AB tests and go forward. Why is this an entire research area? What are the key choices one makes when trying to design for causal inference?

Dean Eckles:                       Yeah, so I think actually there's a lot of truth to that, which is that once you have some of the tools set up to do rapid AB testing, to do rapid experimentation on your customers or on your partners or on the market that you're running, that it can be pretty easy a lot of the time. A lot of it becomes super routinized, and you're churning out AB test after AB test over a short period of time and making lots of decisions using them.

                                                I think in some ways it can become really easy and routinized if you have the right tool sets and the right culture around that. A lot of that is not so much just about the details of causal inference as part of econometrics or applied statistics, but about software engineering from the perspective of trying to make reaching the right decisions the easiest thing to do. Having good defaults for all the people who are designing these experiments who may not know a lot about statistics.

                                                A lot of times actually it can become quite easy through tooling. That's part of the perspective behind an open source tool that we released with some of my colleagues at Facebook, Eytan Bakshy is the person who leads that project. A tool called plan out which is a framework for running and deploying randomized experiments in all kinds of settings. It's used at Facebook, but also at a number of other companies.

                                                Our perspective there was really this idea that there's this quote from Sir Ronald Fisher, pioneer in statistics, that to consult the statistician after the experiment has been conducted, is to merely ask him to do a post mortem. He can tell you what the experiment died of. One of the things about experimentation in the internet industry is there's so much of it happening. There's way more experiments, there's way more experimenters than there are statisticians. The goal with tooling is often to try to build the good advice of statisticians into the very tools, so that just the process pushes people towards running AB tests in the right way, and avoiding a lot of the pitfalls that often do come up in that area.

Speaker 1:                           What are those pitfalls? What mistakes would someone make if they're not aware of proper experimental practice?

Dean Eckles:                       Yeah, so I think some of the pitfalls are really boring ones. People think, oh yeah, we need to assign people to groups A and B. Let's just do it [half-hazardly 00:09:43]. We'll assign all the people with odd user IDs to treatment and all the people with even user IDs to control. That will be as good as random. That often turns out not to be true. Your user IDs might not really be as good as random, and all of a sudden you have an important bias in your results of your experiment. There's a lot of really nitty gritty things like that, like making sure that the pseudo random number generation that's behind the random assignment in your AB testing is actually valid. There's those kinds of things.

                                                There's some that start to involve a little bit more related to causal inference. Making sure that the units that you're randomizing are also the units that you care about the outcomes for. If I'm going to randomly assign users to different experiences, then I want to look at user level outcomes. If I'm randomly going to assign advertisers to different conditions, and I want to look at advertiser level outcomes, and sometimes we want to actually randomize smaller pieces than that. Maybe we want to actually randomize pairs of user and ad so that for each user we maybe see them experiencing ads in different designs. We could see the same user many times, and then we look at outcomes at the ad level.

                                                Being able to think carefully about what the units are that you're trying to randomize, and what that tells you about the decisions you're trying to make. That's a source of a lot of problems in some cases. In the big picture, one of the ways that that can be a problem is that often the units that we really care about are all of our customers at once, because our customers are interacting with each other. When we're saying, oh what would happen if we rolled out some redesign of a network product like Facebook, or what would happen if, as you know, we were to redesign how reviews work on Airbnb? The answers to those questions involved a lot of your customers interacting with each other.

Speaker 1:                           Yeah. I completely concur that actually thinking through designing the proper experiment to answer the question you're interested in, it's complicated. There's often times you have to make trade offs in terms of statistical power versus properly ensuring that your various people aren't interacting in the wrong ways to break the experimental conditions. Setting good defaults does seem to be really important.

                                                The other thing I wanted to ask you about the company's perspective on experimentation is perhaps the role of thinking about experimenting to learn behavior versus experimenting to test policy. Can you talk a little bit about that?

Dean Eckles:                       Yeah, I think that's a good question. A lot of routine AB testing is essentially bake offs right? We have one or more new ideas, we want to see which one of them is best and whether any of them are better than the status quo, and then that corresponds to some kind of really often minimal policy, which is hey let's just launch to everyone the thing that looks best. A lot of AB testing is just about hey, let's bake off some policies, some designs against each other. That way it can be sometimes sort of [atheoretical 00:13:13], and that's great actually, that you can make decisions in a lot of these cases without having a lot of theory, but in other cases, especially if you're getting at some things that are pretty core to your business, or that in the social sciences that we're trying to really understand how people are making decisions.

                                                What factors are affecting their behavior, then we care about designing experiments that are not just getting at some particular set of designs that we're baking off against each other right now to see which ones best, but we want to learn about which factors specifically are affecting people's behavior. That may involve doing other sorts of experiments in which we actually consider alternatives that we think in advance are not better than the status quo. Where we actually might try things that we think might not be as good as what we're already doing, whereas normally in a bake off we wouldn't do that.

                                                Maybe I can give an example [crosstalk 00:14:08] kind of. Facebook has a lot of it's ad units feature social queues. You might see an ad that says your friend likes this page, and then there's an ad from that page. For a lot of reasons, derive from many theories in the social and behavioral cognitive sciences, we'd think that this could make people attend to these ads more, make the ads more effective. If you wanted to actually learn about how big that social influence effect is from this social information about your peer, one way to learn about that is through doing a randomized experiment where you change which queues are present in the ad. Maybe, actually if you have multiple friends you could show, you could decide how many to show or which to show.

                                                Some of the easiest ways to learn about how important these social queues are, involve not showing social information that is available. Maybe there's an ad that I can see that says Andrey Fradkin likes MIT Sloan, and we could decide not to show Andrey Fradkin likes MIT Sloan, and just show the ad from MIT Sloan. Our expectation in advance is that that ad without Andrey's name is going to be worse. Is that it's going to attract less of my attention, I'm going to think it's less relevant. I'm less likely to click on it, I'm less likely to convert after clicking on it. In advanced, we think that that alternative would be worse than the version where we do show the social queue, but by experimenting with removing that social queue, we learn about how important social influence is overall in the ecosystem of advertising, and that could allow us to do things like both test theories in the social sciences, but also do things like allocate resources in an organization.

                                                How much time should we spend on designing how all this social information is collected and displayed? The answer to that is going to depend on how important social information is in this current regime, in the status quo where we're using it in this particular way.

Speaker 1:                           Yeah that makes a that on of sense. As another example, you can think about a lot of companies will randomize their ranking results in order to learn things like how large are position effects? What is the difference for example if you're displayed as the number one link versus number two link on Google search results. You can run an experiment and learn that, but if you don't run an experiment, then you always have confounding factors that it would be hard to infer what that effect is, and then if you find that it's a huge effect, then that should probably result in you allocating more resources on search ranking algorithms, and might also actually be relevant not just for the company's resource allocation, but for the company's communication with the rest of the world.

                                                If a company can credibly say that hey, they number one ad slot is really way better than the number two ad slot, that's going to affect how the advertisers bid, and it's going to affect the bottom line of the company through this alternative channel. I think ... Actually an interesting area with regards to social networks themselves is do the users, do we know what happens when we like a page for example? Who sees it? Which one of our friends see it? What they see about it? So on and so forth. Perhaps that's something that's under explored in the setting. Do you know anything about these types of effects?

Dean Eckles:                       Yeah. I think one interesting point there in general in a lot of social media is the notion that our audience is often invisible to a large degree. That we only know who our audience is when we post something from some of the traces that they leave on our posts. Some of the feedback that they might give us, whether they're liking our posts, commenting on it, or mentioning it to us in person.

                                                Generally social media platforms, whether it's Facebook or Twitter or others, if I post something I don't get to know whether you saw it unless you take some action on it. Actually, some of my colleagues at Facebook published a paper about quantifying the invisible audience on Facebook, where they say let's ask people how big they think their audience was for a particular post, and see how that compared to their real audience size. See what signals they're using to figure out what their audience might be.

                                                They found that actually people were dramatically under estimating their audience size for posts, probably because they were also over estimating the feedback rate. I might think that hey, 20% of people who see my posts, they're going to like it, and so then if I have a certain number of likes, then I'm going to use that to scale up to my audience size right? Actually people are liking things at a much lower rate than people were estimating, and so that means they're underestimating their audience size.

                                                Just in general in social media, even not considering the down stream consequences for things like advertising, often who my audience is is somewhat uncertain. I can limit my audience using privacy settings, but who the effective audience is depends on other people's consumption behaviors which I don't get to observe.

Speaker 1:                           Yeah. That is interesting. One thing that I've always wondered is to what extent informational interventions would affect people's behavior if you told them hey, if you write a post, a hundred people are actually going to see it, or 500 people are going to see it. Depending on what that number is for you, is that going to affect whether you write that post in the first place or not? I want to move on a little bit. This is a natural place to get into what you can do in a so called big data environment that you can't do in a small data environment.

                                                Let me preface this by saying that there is a parallel movement in experimentation going on in the development economics community, where they try to evaluate various interventions to help poor people, but that requires giving money or goods which may or may not be expensive but costs some money, to individuals in a developing world, and so the sample sizes of these experiments that one can run to evaluate a particular intervention, they're going to be small. Even I think a thousand observations is a pretty big experiment, whereas on Facebook I don't know what the official number of users is currently, but certainly it's hundreds of millions of people can be in your experiment if you would like them to be. What does that open up for you?

Dean Eckles:                       Yeah I think that's a really interesting contrast, and that's one of the examples that I like to bring up when I talk about doing randomized experiments in the internet industry, is just how much easier it is compared with really the incredible amount of work that some of our colleagues here in MIT like in the poverty action lab, go through to run these randomized experiments to evaluate development interventions. Even just the process of randomization, how you conduct that in the field in a way that sometimes needs to be public and credible that it really is random can be really complicated whereas mainly in the internet industry, a lot of times all you have to do is compute a hash function of some ID, and all of a sudden there's a random assignment. It's really nice I think whether it's big data or small data per se, how quickly you can iteratively do randomized experiments. That's a big difference.

                                                Even when I'm working with smaller startup companies, they don't have these sample sizes of hundreds of millions of people, but they still have the ability to quickly experiment on maybe a smaller sample.

Speaker 1:                           Yeah that makes a lot of sense. We don't think about it when we learn statistical theory, but it's easy to screw up a practical detail of an experiment, and if you only have one shot, it's game over. You're not going to get very much useful stuff out of it no matter how much you squeeze it. [crosstalk 00:23:04]

Dean Eckles:                       I think even if you don't screw up a practical detail, a lot of times you only realize that the experiment didn't answer certain questions that you really cared about until after you've run the experiment. Then you iterate. A single experiment is rarely definitive, both because circumstances can be changing, et cetera, but also because even the experiment itself will generate new questions. You might try a couple variations and then realize, oh we should've had a fourth variation here. That would have really clarified matters for us. A lot of times you want to be able to iterate. Some of that has to do with how quickly you can design the new interventions, how quickly you can randomize people to be exposed to these new interventions.

                                                Some of it is also about the scale on which you look at your outcomes. A lot of what is impressive about some of the work in development, in political science and economics, is the long term scale of some of the outcomes that they look at in these experiments where they track people over multiple years and look at how much food they're able to consume, educational outcomes, over a longer term.

                                                One of the things that happens a lot in the internet industry for better or worse sometimes is looking at much shorter term outcomes, and so that people can make a decision based on launching some treatment in an AB test. Should we launch that? Was that good? Should we follow up with another experiment? They can make those decisions in maybe a week's time because they're looking at much shorter term outcomes. Things like conversion rates, overall visitation, engagement metrics that they think they can get meaningful signal on in a short period of time. Sometimes they're wrong, maybe there's a lot of novelty effects and what happens in the long term can be very short, different than what happens in the short term. I think on the whole, they're able to move a lot faster both because of the ease of implementation of the interventions, the ease of randomization, and then that the outcomes can be measured on a much shorter time scale.

Speaker 1:                           Yeah, that's certainly an advantage, but of course I think as you were hinting at, there's some danger in hitting a local optimum if you will. If there are two modes of operating your online platform, and you're in one mode and you're changing things on the margin. An algorithm here. An email campaign here. You're not going to move so far away maybe from where you started off, when really the big thing would be to completely redesign your site for example. How does one square the large changes with experimentation as opposed to the smaller changes?

Dean Eckles:                       I think that's a great point. A lot of times it's easy to think about AB testing in the context of hill climbing. Basically we are at some point in this design space, or at some point in this fitness landscape, and we're going to try to make small changes to improve things. That's pretty easy to imagine and pretty easy to implement. A lot of times, especially for firms like startups where their business might really be in flux or there's parts of their product that they should be open to dramatically changing, I think they need to take much larger steps. That's often advice that I've given to startup companies, is hey a lot of these experiments that you're running seem like tiny tweaks for where you are in your lifecycle as a company and were your product is. You should be taking much larger steps.

                                                One of the things that experimentation should make you comfortable doing, is that you can try these larger steps, precisely because we can see whether they're good or not. Whereas in the absence of AB testing, and the absence of randomized experiments, you'd often have to make a big jump based on intuition, what folks of Microsoft call the hippo, the highest paid person's opinion, or based on market research that would often be done on a different population than who your actual customers are.

                                                With AB testing we often do get to know whether these larger steps are worth while. People should make those larger steps. That also comes back to the question of the difference between testing a policy and trying to learn about behavior. One of the reasons that we do experiments to try to learn about the factors that affect individuals' behaviors is because that could help us make bigger redesigns. To invest in resources to redesign how a platform or a product works rather than trying to test some radically different policy right away.

                                                A lot of times people think about AB testing or experimentation as just randomized evaluation. This is actually often how it's talked about in development econ and political science, which I think makes sense in that context. It's very expensive to do. You need to have a program that really has had a lot of work go into it already. That has AB testing really late in the design process right? In this confirmatory stage. I think there's a lot of room for randomized experiments much earlier in the design process, informing how to allocate resources across different areas you could think about. You have an interaction designer who could be working on many different things. What should they work on? Past randomized experiments might tell you what they should actually spend their time thinking about redesigning.

Speaker 1:                           Yeah that makes a lot of sense. I want to move on into one of your topic areas, and specifically you're very interested in peer effects. I was wondering if you could tell us about some of that research and what you've learned so far.

Dean Eckles:                       Yeah. When we say peer effects, we mean any effects of the behavior of an individual's peers on that individual's behavior. Usually we're going to consider all these things pretty broadly. When we say peers that's not just your school chums, that's anybody that we might think of as being a network neighbor. It could be your friends on Facebook, it could be people that you're in a running group with, in a running app. It could be your family or other kin. [Broadly 00:29:45] consider other people who you're somehow connected to. What's the effect of their behavior on yours, and really if you look at basically any field in they social sciences, they have multiple theories about why we should expect substantial peer effects in almost everything.

Speaker 1:                           What are the those theories?

Dean Eckles:                       Right. In economics one reason that we would expect the behavior of peers to effect yours is that it reveals information about those different behaviors. If I'm trying to decide whether to adopt a product, seeing a peer adopt the product is informative to me in a couple ways. They may have some private information that led them to adopt the product. They know whether it's a good product or not. Also if I get to see them after they've chosen to adopt it, I might see whether they're having a good time with this product for example. That's sometimes called information interactions or expectation interactions.

                                                Also, I might be trying to explicitly coordinate with my peers. If the product were trying to decide whether we want to adopt is a fax machine, a fax machine isn't very useful if I don't have anyone to fax. Facebook isn't very useful if I don't have anybody posting content showing up in my newsfeed. Snapchat is not very useful if I don't have anyone to snap. In those cases, it should really be that I'm trying to actually coordinate on the adoption of say communication technology with my peers. That's included under the umbrella usually of preference interactions.

                                                That's the view from economics, but you have similar stories that come from other fields as well. Psychology has different typologies of social influence as well. Basically any social scientist that you're going to talk to is going to tell you to expect peer effects in many settings. In fact, any lead person that you're going to talk to is going to say, oh yeah there should be a lot of influence or contagion in the adoption of all kinds of behaviors.

Speaker 1:                           Okay. Let's say I was a very naïve person and I was just looking at the data, and I saw that for example, people who's friends are obese also happen to be obese. Can we say that that is due to a peer effect, or not? What are the complexities in trying to make that inference?

Dean Eckles:                       Right. Though we expect peer effects almost everywhere, we also expect some other processes almost everywhere that can produce some of the same patterns that peer effects do. In particular, it can also cause us to observe behaviors being correlated in the network and in time. If we observe one person adopting a product after their friends have adopted it. Did their friends cause them to? Or some of the other explanations are things like well, actually people who are friends, or are family members, or coworkers or follow each other on Twitter, those people are similar to each other, often in ways that we don't observe. That often gets lumped under the term homophily, or love of the same which is captured by the aphorism birds of a feather flock together.

                                                The idea is that people who are peers, who are connected in a network are often similar in a whole host of ways. A lot of times we don't get to observe them. The reason that we see that adoptions are correlated in the network can sometimes be because there are peer effects, and sometimes because there's this homophily factor as well causing similar people to adopt. I think most likely, most of the time it's both of them. In the behavioral sciences there are no zero effects. It's often just a matter of how important are these different factors? Peer effects versus things like homophily.

                                                It can get even more complicated than that, because a lot of times when we observe people adopting a product or a fashion, or expressing a particular opinion, it's not just that there's some fixed characteristics of those people that are correlated in the network, but they might be exposed to external factors, so I think Max Weber has this famous example of you see a crowd of people and it seems like they're ... A bunch of them are putting up their umbrellas and it's almost a wave of people putting up their umbrellas is sweeping through the crowd. You might look at this and say, okay there's actually huge peer effects in putting up your umbrella. People are looking around them and seeing that people are putting up their umbrellas, and then they put up their umbrella, but actually of course what's happening is that the edge of the rain front is sweeping through the crowd and exposing some people to the rain sooner than others.

                                                Now of course there could still be peer effects in putting up your umbrellas, because if you're in a crowd you don't want to block everyone's view if you're the only guy with an umbrella, but a lot of it is driven by this external factor. If you weren't observing the rain sweeping through the crowd, you might conclude that this is all about peer effects. This is umbrella contagion. We have to be really careful about trying to distinguish those things.

Speaker 1:                           Yeah and this is ... This seems like often times maybe an academic distinction, but actually has really practical implications. Going back to the obesity example, the types of interventions that you might be thinking about in terms of helping people lose weight, the social interventions would seem more promising if we had conclusive evidence that what's going on is that I eat a lot because my friends eat a lot, rather than we both don't care about our health and that's why we're friends. I think that's true, not just in this health example, but a lot of examples on the internet as well.

Dean Eckles:                       Yeah, and actually maybe I'll say a little bit more about some of the internet examples. We already talked a little bit about the idea that when I choose to adopt a communication technology, I care about who else has adopted it because I want to communicate with them. Also that when I post something on some service like Facebook or Twitter, part of the value that I'm going to get is the idea that other people see my posts. That I have an audience, but I only get to find out about that audience often through their actions. The fact that they give me some sort of feedback in the form of likes or comments.

                                                That's a context where we should really care about peer effects in the continued use of these communication technologies. If somebody posts something on Facebook or Twitter and they get more or less feedback, how does that affect their decision to continue broadcasting. To continue posting on that service. That's a case where you really want to know how large is that, are the effects in that virtuous cycle? That could change a lot of your policy.

                                                You want to know, is it just that people in the network who post a lot and give their friends a lot of feedback, that they're just doing that because they're similar, or is that actually sustaining their use of this communication technology? Is that actually keeping this whole service attractive and interesting to the people who are using it. That's a case where you'd really want to distinguish between things like homophily and whether there are actually these peer effects that are driving your whole business essentially.

Speaker 1:                           You've written some papers about this. What have you found? Is there a brief synopsis I guess?

Dean Eckles:                       Yeah. This is definitely still an area we're working on, but maybe one of the things to highlight is just thinking about how would you actually go about learning about this? We've talked a little bit about doing randomized experiments. Here, the question is something like, if we give Andrey a little bit more feedback on his post, how is that going to affect whether he chooses to post again and whether he gives other people more feedback. Whether he logs in more often to his social media account. How would we actually do a randomized experiment with that? That's one of the big challenges in peer effects, is that the treatments are what your friends do, but we usually don't get to randomize people to what their friends do. Their friends just do whatever they want to do. That presents a real challenge for running randomized experiments.

                                                One approach is, what we already talked about a little bit before. The idea that sometimes you control a non deterministic mechanism for the peer effects. That the reason that you find out what your friends are doing is because it's communicated to you at the top of a social ad on Facebook. Then the experimenter can choose not to show that social information in the ad, and that'd be a way of learning about how big those peer effects are, at least via that channel. In other settings we don't have that opportunity, because the mechanism is deterministic. You post something, and somebody comments on it, sort of the quality of service guarantees from somebody like Facebook or Twitter or LinkedIn is that you then get to see that comment, right? How can we learn about cases where you get more or less feedback?

                                                The main strategy that we've used to study that in one of our paper's that's out is what we call a peer encouragement design, where we randomly assign people's peers to an encouragement to engage with that focal individual. We slightly nudge your friends to give you a little bit more feedback, just by subtly changing the salience of giving new feedback. For example by making the text box for writing a comment on your post open by default, or closed by default. That causes you to receive a slightly different amount of feedback on your post, and that can allow us to learn about the effects of receiving additional feedback.

                                                That's the method, is by using these small lightweight nudges, to a behavior of interest among people's peers, to learn about the effect of their peers behavior on their behavior. That comes back to this idea of big data versus small data. A lot of the nudges that we can use there are really really tiny nudges. I often describe them as sort of a feather nudge. That wouldn't work in a smaller dataset right? The only reason you're able to learn about peer effects by using these small peer encouragements, is because you just have potentially huge sample sizes. That is definitely a big difference, is that with big data you can often detect effects of much smaller interventions.

Speaker 1:                           Just to maybe frame this, what do you need the big data for? You can be in 1 of several situations. One is you can have a huge effect and you're going to be able to detect a huge effect without that much data. That could be one of two types of effects actually in your design. It could be that the encouragement is huge, so in one of the papers that I wrote, we studied an experiment where people were offered a monetary incentive to write a review, and that's a pretty big incentive that had a pretty large effect. Then there's the second part of that, which is what is the effect of that behavior on the outcome of interest? In your case would be the subsequent engagement of the user who got commented on. You kind of need either one or both of those to be large, otherwise you need to be in a land of an enormous amount of data because you're not going to be able to detect the effect.

Dean Eckles:                       Yeah. I think actually, unless the first part of that, the effect of the encouragement on the behavior that you're trying to encourage. Unless that one's large, you really need to have a lot of data. I think that's an important distinction to make, because it could be that say in your example where we're interested in seeing what happens if we get somebody to write an additional review. We could use either a big budge like a monetary incentive, or a small nudge like sending them one more reminder email right? Either way, the effect we're still interested in is mainly not the effect of just this incentive, but the effect of them writing the review, for example on how many reviews they write in the future, or them writing the review on the host that they're writing the review for. Or any number of things.

                                                That second stage effect is what we really care about. That effect could be quite large, even if our nudge is really tiny. A lot of times people think, oh well with big data the only advantage is that you find these tiny meaningless effects that are not important for your business, okay? But if you're using the big data to try out these small nudges as part of an encouragement design, then the effects that you're interested in are not the effects of the nudge, but the carry on effect of the nudge behavior on something else, and that effect can be quite large even if the first effect is quite small.

Speaker 1:                           Yeah, that makes perfect sense. One way to think about it intuitively is if your nudge is very small, then your effective sample size is just the people that got nudged. If only a hundred out of your hundred million people were effected by this collapsing of the comment section, then you're just not going to be able to detect the subsequent effect even if it's large enough to be of interest to you as an experimental designer.

Dean Eckles:                       Right, right. That's basically what happens in that study. Is that our first stage effects of encouraging people's friends to give them more feedback mainly by opening and closing comment boxes. Those effects are small, but the effects of receiving additional feedback on outcomes like choosing to post again, or giving other people feedback. Some generalized reciprocity. Those effects are actually quite substantial. We do find really good evidence for this sort of virtuous cycle in posting, receiving feedback, giving feedback, posting again in social media, but we use a feather touch to learn that.

Speaker 1:                           I want to get into a little bit of the technical issues. One issue that I'm just thinking about right now, is suppose that there are substitution effects. Let's say that you randomize at a post specific level, and so some posts seem more attractive to comment on than others, and so given that I have a limited amount of time, I'm not sitting all day on Facebook hopefully. I can only write one comment at a time. If I write a comment on your post, that means that I'm not writing a comment on some other post. Does that pose problems for your experimental design and how do you think about that?

Dean Eckles:                       Yeah I think that's a great question. That goes back to some of the different goals that we might have in running experiments. Whether we're trying to really try out and evaluate a particular policy, or whether we're trying to estimate some particularly effects that could be useful for designing future policies. I think if you were trying to figure out oh what's ... Should we launch this policy of opening or closing the text field for commenting for all posts? Then that kind of experiment is not going to be informative about that, right? Because yes there are going to be these big substitution effects that you described. On the other hand, if our goal is really to learn about what's the effect of marginal feedback. Feedback that barely occurs or doesn't occur depending on small factors changing. We want to learn about the fact that marginal feedback on continued posting, continued engagement, that we're just trying to estimate that particular effect, and we might use that model then to design a subsequent policy.

                                                For example, we might learn that some feedback is more valuable than other feedback. Some people maybe need more feedback than others, or some people are more responsive to additional feedback than others. For example, we may learn about exactly how much diminishing returns there are in receiving feedbacks. Getting your first like on a post is maybe more important than getting your 125th like. Learning about those effects would then be used to maybe design a new policy of targeting some of these nudges. Targeting some of the nudges in the network. It's not necessarily that the experiment is directly informative about a particular policy already, it's that it allows you to estimate some effects in a model to calibrate a model that might be used to design new policies.

Speaker 1:                           Yeah, and the new policies don't have to be budges. In your example of, if it's truly the case that only the first comment matters, is that something you found by the way?

Dean Eckles:                       We do find evidence that's consistent with diminishing returns. It's not just that only the first comment matters, but really what we find is actually that everything is pretty homogenous on a log, log scale, so that basically ... Which corresponds to that, you know ... The main result in that paper is that we give you 10% more feedback. You give other people 1% more likes. You give other people 1% more comments, and you produce slightly less than 1% more posts. Actually on that multiplicative scale, there's not a lot of heterogeneity. That suggests this idea that basically there are these multiplicative effects, and thus there's some potential diminishing returns of additional feedback.

Speaker 1:                           I see. One practical thing is out of this might be that how much you distribute a given post to the rest of the network is going to vary. If you think it's really important to get this individual to comment themselves, then you'll distribute it but of course that has a cost which is that other people's post aren't getting shown as much. There are these trade offs and this helps us think about such trade offs.

Dean Eckles:                       Yeah exactly. There are definitely a bunch of trade offs here which have to do with essentially, in something like a feed, the height of each post really affects all the subsequent posts. You really are trying to trade off, okay if we make each post bigger, if we expand some of the existing feedback, if we open up the text box where you could write a comment, et cetera, that's pushing the rest of the posts down. Those substitution effects are a key part of any policy that you'd want to consider.

Speaker 1:                           Another issue that I want to brink up with these encouragement designs, is the issue of local average treatment effects, or heterogeneity. Let's say you have this nudge, and just .1% of people actually were affected. Were responded to this nudge by writing a comment that they otherwise wouldn't have. That's a very small percentage of your population, so to what extent can you, and for something general, from the behavior of these almost outliers by definition?

Dean Eckles:                       Yeah, so this has been a big controversy in causal inference and econometrics recently, is when you have these kinds of inducements or experiments with non compliance, you get to learn about the effect on the people who are induced to comply. Who are induced into the behavior by your encouragement, but how much do you care about that? Have you sort of just settled for learning about something that's not what you wanted to learn about in the first place, and instead really what you wanted to learn about was a affect that would average over everyone, not just those people who would be induced to a behavior by your nudge.

                                                I actually don't agree with that mainly. I would say in many cases learning about the effects of these marginal behaviors is more important than learning about averages over the whole population. Thinking about the case of receiving feedback. You can make a post, and then you receive say some number of likes. Some of those likes are going to occur basically under any reasonable policy that we're considering. Right?

Speaker 1:                           Yeah, like if you got married or something.

Dean Eckles:                       Right. Right, or just that I log into Facebook and you're at the top of my feed because we have high tie strength because of high past interactions. You post something that I'm really interested in, so then I'm going to like it. Under all of the reasonable policies that we're considering, that like is probably going to occur. It seems not particularly useful. Not particularly practically oriented to try to know what the effect of this like is that's always going to occur.

                                                Or there's the people who are your friends who basically don't really like anything at all on Facebook. Then we could ask, oh what would be the effect of them liking your post? That's maybe also not going to happen under any of our reasonable policies, whereas a bunch of these behaviors, in this case feedback that's going to occur or not depending on small changes to the design, to circumstances, those are the things that we're actually going to affect when we make changes. That seems really policy relevant to me. The effects of marginal feedback or the effects of marginal behavior are often super relevant to decision makers, maybe even more relevant than averages over the whole population.

Speaker 1:                           Yeah, but I would say that there's a tension here. I agree on the decision maker part which is that you care about the margin because that's who you can affect in some sense, but for science, for social science, you don't necessarily care about the margin. I mean if the people who got additional comments because of your experiment are really weird, then you can't generalize from that population to everyone else, and to the extent that science is about learning general things, then maybe this ... You just have to be humble about what you've learned in some sense right?

Dean Eckles:                       Yeah. I think that is an important contrast. I think the other big issue is that the relevant margins can be very different for different nudges. The people who maybe respond to your monetary incentive to write a review are potentially different than the people who respond to other sorts of nudges. Then if those people are not similar and the effects of their behaviors on other outcomes are not similar, then what I said is not going to really be true. There's so many possible margins that we can't just settle for learning about only one of them.

Speaker 1:                           Yeah. One example that springs to mind is, from Netflix actually where when they put the photo corresponding to a movie, they will customize it sometimes to the person. If you've watched Kevin Spacey movies before, for House of Cards they'll put Kevin Spacey on the cover, but if you on the other hand like shows about politics, they might put something related to politics on there. These interventions are going to affect different types of people potentially.

Dean Eckles:                       Right, and that's going to be a very different type of nudge than the choice of whether to recommend House of Cards in position one or position 10.

Speaker 1:                           Yeah, yeah exactly. A couple more things that I want to ask you about is one of your passions is to conduct inference in a way that's less parametric. Can you explain a little bit about what randomization inference is and what bootstrapping is?

Dean Eckles:                       Yeah, yeah. A lot of statistical inference that we usually do works via an imagined series of experiments, or an imagined series of data collections, that we often care about what happens asymptotically as the sample size that we have goes to infinity, and then we use theory about that as an approximation to what's happening in our actual dataset which is always finite. An alternative to that is to, especially in randomized experiments, is to think about focusing on the fixed population that's involved in our randomized experiment. We have some fixed set of people, or fixed set of villages say. Fixed set of departments in a company that are in our randomized experiment, what then is random is their assignment to treatment or control. We essentially just have this vector that assigns everyone to treatment or control, and we know exactly how that vector was assigned. We know how everyone was randomized to treatment and control, because we did it.

                                                This allows us to have a huge amount of certainty about the actual probability distribution for this one variable, and then we can consider just this finite population of the people who are actually involved in our experiment. That can give us a lot of leverage actually. It allows us to do statistical inference without making really many assumptions at all, especially for things like testing the null hypothesis that our treatment had no effects.

                                                Often what we do in that case is we have our actual experiment that we conducted. We'll do something like compute the difference in outcomes. The difference in say revenues between treatment and control. We say okay, treatment had more revenues than control. On average five dollars more in treatment than control. Did our treatment do anything? How could we tell? One way is to say, actually under the null hypothesis that our treatment had no effect, then no matter how we assign treatment, all the outcomes would have been the same. Then that means that we can actually re-sample or [promute 00:57:38] the treatment vector that we have.

                                                We just keep randomly imagining that we conducted a different experiment, and reassign people to treatment and control, and look at what the difference between treatment and control would be in that artificial experiment. We repeat that a number of times, and we say does our observed difference between treatment and control, does that look extreme or unusual compared with the distribution of differences we would have observed if treatment had no effect at all? If it's extreme, then we can confidently reject the null hypothesis that there were no effects.

                                                Essentially that whole machinery doesn't require any of the normal parametric assumptions, asymptotics that go into statistical inference where we say okay, we're going to assume that this is approximately normally distributed, or that at least our test statistic, like this difference in means, is normally distributed. We don't need any of that.

                                                That can be a really powerful tool that in some cases can be applied in cases ... Yeah, in cases where we don't even know what the relevant asymptotics would be, which is especially people in networks.

Speaker 1:                           Yeah. I think this is a very good quest to go on, because at least some of the recent work on this topic has suggested that a lot of the assumptions that people typically make when they're testing whether an experiment had an effect or when computing the uncertainty regarding that effect, those assumptions are wrong. Coming up with an alternative way of learning about the data in that case is really important.

Dean Eckles:                       Yeah I think actually there's a recent paper in an economics journal, reanalyzing a whole bunch of economics papers in this way right?

Speaker 1:                           Yeah, yeah exactly.

Dean Eckles:                       Right, and they find that a number of the results that were statistically significant and that were key results to their original papers, if you drop the parametric assumptions and use [fisharian 00:59:39] randomization inference in just the way that we describe, that those results aren't so significant anymore. Many of the results in the literature are conclusions there, we're really sensitive to the choice of those parametric assumptions. That's worrying.

Speaker 1:                           Yeah, agreed. All right so we're almost out of time. One last question I'd like to ask you. Oriented towards the academic audience that's listening is do you have any advice for academics or working with companies? Very broad question I guess.

Dean Eckles:                       Yeah. A lot of times when I was working at Facebook we collaborated with many academics. I had the pleasure of having multiple faculty members as my visitors there, and there was other useful collaboration. I've sometimes thought about things from that perspective. I think one big issue is that a lot of times academics are working on a slightly different time schedule than business. Especially smaller internet businesses that can move so quickly. As we discussed, they can do an experiment and look at the results in a week time.

                                                That's usually not how academics operate. Being willing to say oh, okay, there's this period of time when I can dedicate my attention mainly to what's going on in your business, and so we can be synced up on approximately the same schedule, that's going to be key to getting into somebody's process for iterating on doing experiments. Iterating on research. Otherwise it's often the case that academics are solving problems on just a totally different type schedule than is relevant to the business, and I think that doesn't really help establish their credibility and usefulness, and it often means that they're not able to get as much help from the company in implementing some of the research ideas that they want to implement.

Speaker 1:                           Yeah I can second that. It's actually I think maybe the key challenge other than getting your foot in the door is this time scale. Because academics, they're working on many projects at once. More senior academics especially aren't even doing the analysis for the projects that they're working on. They don't really have even the ability to devote let's say a month to work on just one project. Creating those opportunities and having an institutional setting that is understanding of that I think is really important.

Dean Eckles:                       Yeah I think one other comment that I would make is a lot of times outside researchers just assume that there's a magic dataset. A magic comma separated file or database table that corresponds exactly to their research question, and that all that needs to happen in order for them to do useful research for academic purposes and to help the business would be to have that CSV and analyze it. That's usually not the case. Usually the data has been created for some other specific business purposes, and maybe is not in exactly the format you want. Also, I think that kind of thinking is really limiting because often the best data is the data that you create yourself, whether that's by running a randomized experiment so that one of the columns in that CSV is your randomization, or whether that's because you've gotten involved in actually how the outcomes that you care about, or the exposures you care about are logged. Actually how everything is instrumented.

                                                This idea that it's just like oh there's some CSV out there that's the magic one that I want is both unrealistic and really limiting compared with what can actually be possible if you can intervene or you can measure new things.

Speaker 1:                           I completely agree with that. Although, I will say that this depends very much on your arrangement. You're very much thinking about the case where a person is literally coming in to sit in the company and work with a person there, but often times the company doesn't have the resources to even dedicate a person to you, or they don't have dedicated researchers, and in that case your ability to influence the company to run experiments or to instrument things is quite limited. In those cases, I would actually say that one of the important things is, going back to your first point, is to show something useful or quickly in order to gain trust. It's a very iterative process. You ultimately want to get to the position where you can have productive input on experiments and instrumentation and other things that you may care about, but that's a long run. That's always a long run goal rather than a short run goal in my experience.

Dean Eckles:                       Yeah, I think there's a lot of truth to that, though I'd also say especially for small fast moving internet companies, the work required for them to run a new randomized experiment that does sort of something that would be useful for your research and useful for them, that effort is often smaller than the effort required to construct some historical data set that is this magical CSV that you were thinking of. Because their archiving might not be that great. Formats of things have changed over time, but they're running AB tests all the time probably, and so running yet another one might actually be a lot easier than grabbing historical data. Having some appreciation of what is easy, what's hard given their data infrastructure, given their experimentation infrastructure, can be useful.

Speaker 1:                           Yeah, although I will say also one more thing which is that it depends very much on what the experiment is. It might be easy to change the copy on the site, or the color of a button. It might be more difficult to do an experiment on prices for example, or on some other major part of the website which the users have expectations about.

Dean Eckles:                       Yeah definitely. That comes back to what we were talking about earlier, is that some experiments tweak or change parameters that already exist, whereas other experiments, if you're trying to take a huge step in the design space, the work is not necessarily in setting up the experiment. The work is in designing that alternative that is very far from where you are now. That might require a lot of design and engineering work.

Speaker 1:                           Well all right, thanks so much for joining me, and I've learned a lot and I hope our listeners have as well.

Dean Eckles:                       It was my pleasure.