Hao Li likes to push the cutting edge around 3D graphics reconstruction and graphics. Hao is an Assistant Professor of Computer Science at the University of Southern California where he is the director of the Vision and Graphics Lab. He is also co-Founder and CEO of Pinscreen, a startup currently in stealth mode. In 2013, MIT Technology Review named Hao as a top 35 innovator under 35.
His experience is around creating special effects for the movie industry. And now his research is more around improving the virtual reality experience including understanding facial patterns under the VR headset, and taking 2D video and converting it to 3D. It’s fascinating stuff.
Here are some other things we talk about:
-How did you get interested in graphics?
-How do you take a project from beginning to end?
-How far along is the facial performance sensing tech?
-How does your facial sensing tech work?
David Kruse: Hey everyone. Welcome to another episode of Flyover Labs. Today we are lucky enough to have Hao Li with us, and Hao is a Computer Science Professor at USC and Co-Founder of Pinscreen and he’s worked on some amazing projects in his past. So his experience is around creating special effects for the movie industry and now his research is more around proving the virtual reality experience, including understanding facial patterns under the VR headset. So it’s pretty fascinating stuff. So I invited Hao on the show to learn more about his background, how he thinks and approaches projects and what he’s really interested in now. So Hao, thanks for coming on the show.
Hao Li: Well, thank you for inviting me.
David Kruse: Definitely. And so let’s start off with your – you got a really interesting background. Can you tell us a little bit about your background before we dive into what you are working on?
Hao Li: Yeah, sure. I’m a computer scientist, so I study computer science in Germany. I’m actually German. I did my PhD in Switzerland at ETH Zurich, spent some time as a post doc at Colombia in Princeton and went for ILM for a year and then I started as an assistant professor at USC.
David Kruse: Got you, okay. And so how did you make it over to the United States. What was your first, your first visit here?
Hao Li: Yeah, when I was a PhD student there was a – I had a summer internship or spent a summer at Stanford, that was pretty cool. So got a lot connections there, mid level connections and then later on I did another internship at Ireland. I was also doing my PhD. So that was kind of like my first contact with the U.S. But I think when I was about to graduate, I was thinking about like if I wanted to go to the industry or I wanted to stay a little bit more in academia and I went to give a few talks at a couple of schools in the U.S. and I was so impressed by the students and the research environment. So I thought I have to like spend at least one or two years as a post op there, just to see how things work here and I think I learnt a lot, and since then I just think like you know it’s pretty hard to go back.
David Kruse: Really, that’s interesting. So why were you so impressed, you know compared to Europe.
Hao Li: Yeah, I think the entire work attitude is quite different. I think it’s a lot more competitive here. The students and professors are a lot more hands on. The research projects are a lot more riskier, less incremental and I think this is just super exciting and that really allows you to do big things and you know sometimes think out of the box. I think that’s sort of like in the culture here.
David Kruse: Interesting, okay. And how did you get interested and involved in computer graphics and then computer vision and…
Hao Li: Yeah, so I was always amazed by you know the effects, you know that was back in the 90’s. You know everyone just felt like the first time really realistic photo reel, the effects in moves and you know that got me hooked up. And when I was studying computer science, computer graphics, I reached a level where you know it’s already really, really good and then the question was what is left to do, right and what are the interesting problems and I think since then I’ve been looking at lot into 3D view construction problems, which was more of a topic that is computer vision. And yeah I mean since then I’m always working in the boundaries between graphics and vision.
David Kruse: Interesting, okay. And do you remember what was one of your first projects that you worked on around computer?
Hao Li: Yeah, yeah absolutely yeah. My first undergrad project was building a 3D scanner. That was a long time before you know Connect and all that stuff came out. So the idea was to build a 3D scanner that isn’t based on extensive equipments, but something that can be easily available right. So things like a projector and a SLR camera and I started building like a triangulation systems that can allow you to put any object in the screen and you can visualize that in 3D. That was just to capture static objects and then very quickly you wonder how can you capture dynamic objects and how to make sense out of them. So that was sort of like the entire topic during my PhD.
David Kruse: Got you. Wow! So did you get it to work by the end?
Hao Li: Yeah, of course.
David Kruse: Of course.
Hao Li: Yeah, so I mean in the undergrad we have like, we build this like static scanner, we got like a lot of interesting 3D objects. We can visualize them in 3D that was really cool. But later on you want to see what happens if you captured the entire full body person, a 3D face, you want to see the temporal information of that and so that kind of leads to – it’s sort of like the time when Connect came out, you have all these ideas on 3D scanners. And so this topic is still an active research area, right, but there’s a lot of people trying to create 3D point videos for VR utilization. You want to see like the existing videos of a lot of things.
David Kruse: Yes, definitely.
Hao Li: Right.
David Kruse: And could you give maybe an overview on some of your favorite project in the past, just so people get a feel for what you have done?
Hao Li: Yeah. So I think the most interesting part is during my PhD. The first project I was working on is given you have a real time scanner; it’s a real time 3D scanner. It basically a Connect, so that was actually before the Connect came out. It’s a device that can capture a 3D environment from a single point of view and you get this like, instead of getting RGB colors you get a 2.5 D map. The problem with this information is that it looks pretty cool and you get a 3D point cloud, but the computer cannot make any sense out of it. And the second problem is that the information is incomplete. You only see things from a single view. In order to make 3D content interesting, you need to be able to visualize it from many different views; you need to be able to see everything. So this 3D or even when you take a look at the 4D reconstruction problem, it requires you to solve another problem in computer vision that’s called correspondence problem, which is how can you relate different shades that been captured either from different viewpoints or adjacent timing sequence and how can you align them to each other, right. And one of the milestones I achieved there in my PhD was now you are able and that can again study those correspondences using a continues optimization problem. So what I mean is that given a sequence of 3D scans and you can try to like assemble them together. So what you can do with them is you can obtain a complete model even though you can only see one view at a time. Another thing you can view with it is you can track the surface of a 3 dimensional object, which is very interesting, because traditionally if you want to do, if you want to track motion of a person you would need to put markers on the subject, whereas here you get a dense surface trajectory on any human, and the person can wear anything, can wear you know, you can wear jeans, you can wear a dress. So you can do all these kind of new things. What’s really impactful was that you can actually view very realistic facial tracking and so that was one of the things that had a great impact also in the visual effects industry, where people try to create or capture realistic facial expression and for these kind of things you need a marketable solution that can capture really dense information of the surface.
David Kruse: Interesting. And so can you walk us through an example of – so are you reconstructing or taking a 3D image of a entire room, or it will be more like an object, like a ball, and…
Hao Li: My research focuses a lot on humans. So instead of trying to capture the – so there is a lot of work on 3D construction. A lot of people work on trying to reconstruct an enter room. There are related problems there, but my focus is more on capturing human, lost humans, the faces of human including their hair.
David Kruse: Got you and so if you have a 3D, let’s say a 3D image of like a front of a person and the sides and the back like can you start putting together like a full model or what do you need in order to kind of put the full 360 model together.
Hao Li: Yeah, yeah, yeah. So the challenge here is – first of all if you capture people from different views and your acquisition setting is uncalibrated, so you don’t know from which angle it has been captured. So you need to be able to align them together. The problem with the human body which is interesting is that it’s – the person is not fully static. It might be moving because it’s a life or because the person is like doing a certain performance. So you want to be able to find these correspondences, yeah.
David Kruse: Interesting, okay. And what’s the challenge around hair. Kind of I know you have done some projects around that.
Hao Li: Yeah, yeah. So hair, I was on the certain aspect of capturing hair that’s always trying to get the geometry right and then the second one is trying to capture the animation. So both problems are extremely opposed. The problem is that the human hair consists of a different structure than the skin or the garments, because it’s not a surface. It has really intricate you know, very complex structures like you know little strands or like curly hair like lot of inclusion. But the problem is that you only see the outside of the hair but you don’t see the volumetric structure inside what is in the hair. But if you wanted to capture something that is useful, in a sense that you want to be able to reanimate it or you want to simulate it, you need to be able to see what’s inside and the usual process for an artist to create the hair is to either to do it like semi manually combined with some care generation. What we try to do is we try to automate this process by capturing the hair from either different views of even up to just a single photograph and reconstruct the hair model.
David Kruse: And is this is a 2D photograph that you use or would it be a 3D, I would assume?
Hao Li: We can reconstruct 3D hair models from the single 3D image.
David Kruse: Really? Wow!
Hao Li: Right. So its similar – the way to think about this is that you can just give an artist a reference picture of yourself or another person and he can interpret the entire geometric structure just by looking at it. So he can use his imagination to fill in all the unseen region. And what we are trying to do is we are trying to create data driven and learning algorithms that can actually be exactly the same thing.
David Kruse: Wow! That’s clever. Okay, I never thought about it, but that’s a good example to think about how an artist would do it, interesting. And how do you figure out what projects to tackle and to work on. I mean there must be so many different potential projects you can work on. Why did you get involved with hair? What are you interested in now?
Hao Li: Yeah I think, I mean the hair is an important part of the human body. So I’m interested in the human body because it has an extremely wide range of application, right. So application driven, if you can capture a human body you can create avatars, you can change the way people communicate. I mean in the digital world, you know we are always trying to bridge the physical and the digital world and the human body is extremely different to digitize. So the hair is an important part of it. So there is – we can’t leave it out right. So that’s basically the choice of like what kind of hand books we are working at, but on the algorithmic side we are exploring a specific type of algorithm to solve this type of application. So the type of algorithms that we are looking at are mostly data driven and these learning methods where you can use a lot of data to train a system that can try to infer these type of geometric shapes. The reason behind that is that a lot of things are very difficult to simulate and a lot of things, we acquire a lot of artistic work or manual work and then we are trying to automate this process.
David Kruse: Yeah, well that’s – Wow! And then like you said with the VR and AR and the special effects that’s going to be important to automate a lot of this because you can’t manage with those…
Hao Li: Exactly, I mean it’s not limited to AR or VR right. I mean it’s an important application. It’s an important platform for deploying visual content nowadays. But in a very general sense what I’m trying to do is focusing on content creation, but content creation that can be deployable, used by anyone or by no one you know, it’s just creates by itself.
David Kruse: Interesting. So with your algorithms could you get a 2D picture of almost any object and create a 3D model or are they finely tuned.
Hao Li: Not yet.
David Kruse: Okay.
Hao Li: Not yet, but that’s the way to go right. There is a lot of research going on in this direction. Very often people constrain themselves in a very specific area. For example static objects like furniture’s or you know – furniture’s are important for robots to navigate indoor. What I focus on are dynamic objects with really complex declamations and the human is actually one example of that.
David Kruse: Yeah, so you like tough challenges. Human’s hair is a little bit tougher than a table, I would imagine.
Hao Li: Well, you know, it sounds tough, but there is like things like that we can actually exploit right. There is a lot of human data out there as compared to for example animals, right. So 3D animals there is a lot less simply because people haven’t captured them, but if you just look at the human face, there is a so much on Facebook you know that actually helps a lot of these problems and in computer vision a lot of people are focused on facial recognition. Just you know one reason why it’s so advanced is that there is so much data out there.
David Kruse: Yeah, definitely. And okay one technology that it’s kind of near and dear to my heart, because I like meeting people face to face and its part of virtual reality, I think that could be possible, but the problem is you can never see somebody’s face and so – so could you tell everyone a little bit more about your facial performance sensing the technology and how it works.
Hao Li: Right, yeah. So the idea seen from a conversation with Oculus Chief Scientist, Michael Abrash when he came to visit us at USC and he gave a talk and we were talking about like, Ah! We want to track faces and all that stuff and we just had this conversation like okay, so how do you want to track a face that is secluded, and we just said how about we just work on that, on a research project. So I mean the thing you want to do is you want to be able to capture as much as possible. So what we did is we built a prototype system where anything that didn’t include it, whether it’s the lower part of the face, including mouth and some parts of the cheek, these things you can capture using an external device. And so we mounted a camera on the authorization and ACS inside we were thinking about like its very dark, it’s very constrained so we used contact sensors there to capture the facial expressions around the eyes. And later on we had a new – we have a new work that is going to be presented at Asia later this year. So that work just got accepted and in this work we changed a little bit the way we are acquiring it. So the mouth region is still captured using the camera, only with the 2D camera this time, and the eye region is captured using integrated camera. So we are using another HMD from outside so we have these like eye gaze tracking cameras, but it sees enough of the region around the eye so that we can install realistic but full, realistic full facial expression.
David Kruse: Wow! But how does the camera see inside the headset, because it’s not even lit-up.
Hao Li: Yeah, I mean it uses an IR camera and some IR laminations. So even inside a completely dark HMD you can see, you can see actually the region around the eye and based on how it performs. You can’t really see the eye ball, but based on how the eye moves you can actually incur how the eyebrow is moving and the reason behind that is, that there is just few muscles around the eyes and if there is a way to actually draw how these muscles are moving you can make the face move. But we are not exclusively simulating muscles. We’re using a much simpler model that is based on linear bend shapes.
David Kruse: Got you, interesting. Yeah, I mean so it sounds like a lot of work is around, I don’t know if estimating is the right word, but estimating geometric shapes and the reactions too, yeah interesting, okay. How far along is that project with the facial sensing under your VR headset?
Hao Li: Yeah I mean, yeah I mean I think it works. I mean it’s a good solution there, it’s a good solution. We have some – I haven’t put out the video but I can send you some examples later.
David Kruse: Yes, I need to see that video. Like I said, that would be, you know I always think of everyone hates the conference calls because there is always technology issues. But everyone has your headset, oh man that would be a good deal.
Hao Li: Yeah, I would say the limit right now is we can create characters that are Pixar like, so a little a little animation like. For very realistic faces, I still think it’s still very difficult to create the animation.
David Kruse: Okay. So to have like my actual face, that’s probably down the road away, but to have a decent aviator right now. You can probably do that right now.
Hao Li: Exactly, exactly. So we have ways to build an aviator from a single image. But having something that is for real, that is indistinguishable from the out here, is still very difficult to do it from a single image. Even with a high end capture setting, people are still struggling like making something in an automotive way. Like it still involves a lot of artist work, especially the animation, getting everything right is still very difficult.
David Kruse: Got you, okay, that makes sense. And can you just tell us how or give an example of your looking at the persons eye under the VR headset and what do you – like is there – do you have points that your probably kind of tracking or looking at and if this point moves you move on to the aviator, is that or how do you set it up?
Hao Li: You mean how do I animate the aviator from the camera?
David Kruse: Yes, right, and how do you take what you have learnt from the persons expression and then move it to the aviator?
Hao Li: Yeah, so the traditional approach, that’s what most people do for facial tracking, is that they would try to like find robust features on the face and try to track the features over different frames and usually they have very robust computer vision algorithms like a model, these have a signature that can infer a 2D or you know a 3D motion from the face. One thing that we developed here is that instead of exclusively trying to find specific facial features. The features for example like lip contours is that we are taking the entire image and we are using these learning to infer from these entire images to a very complex advanced model. So that’s our latest finding and the important thing to notice here is that we finally have a way to create motions that are very difficult to see. For example when you have a conversation, your lips, your lips have really complex shapes. You might be biting the lower lip and the lower lip can’t be even seen. So existing methods what they do is, they would try to track the lower lip even though it’s not seen. So you get something that’s wrong. Whereas what we do is we take the entire, every pixel of the image and map that directly to a complex characteristic.
David Kruse: Interesting, okay. Wow! And you mention the nuro networks. What type of algorithms are you using for a lot of your research?
Hao Li: Yeah, so the type of new network that we use are coalitional nuro networks and so this is something that is at this point you know really changing the entire field of computer vision. What you are able to do with that is you can classify things much more accurately than any hand model feature description that you will do in a classic way. The only limitations there is that we need to have a lot of training data in order to have an accurate entrance model. But the training data can either be learnt, can either be simulated or we can – we can also have a very elegant way of collecting them. In our case for the HMD projects we used sounds. So we ask people to speak specific sentences and we use the audio signal to dynamically time work all of them, so that we can actually label the data. So for these units, especially coalition units, you need a lot of training data in order to learn the model.
David Kruse: And what was that project for, the one that you’re using sound?
Hao Li: That’s actually for the HMD. So the HMD project, we are – during test time we are tracking the face without sound, but during the training part, we are actually using the sound of the person. So we ask people to say specific sentences where we know where it’s corresponding too. So we use the sentences called Hardware sentences which have a balanced distribution of full names, right, and that actually helps us to produce really realistic speech animations.
David Kruse: Interesting and the HMD is that heading on to display. Okay, got you all right. Interesting. But I didn’t think of it. That makes a lot sense that you want to actually use voice so you can mirror that on the aviator, interesting. Okay. So how many samples do you need?
Hao Li: Right, right. It depends, right. So usually we don’t know exactly what is the right number, but you know after data organization we have around 0.5 million frames. So that’s actually quite a lot, but it might work with less, it might work better with more. But usually because the training takes a while, it can take – I mean if it’s just fine tuning it takes a day. If you are training something from scratch, it could take up to a week.
David Kruse: Wow, okay yeah.
Hao Li: Right.
David Kruse: And probably it depends how many GPUs you have running.
Hao Li: Yeah, right now I’m just using a better GPU right and video type connect physically and then yeah.
David Kruse: Nice. So can you share anything on Pinscreen, your start up. I know it’s in stealthy model, so it’s fine if you can’t.
Hao Li: Yeah, we were showing. At Pinscreen we are showing a few features at a couple of conferences. The feature is basically building visual aviators from a single image. So what we can do now is we can – you can take any image right and you can take a picture, you can take a image from the internet and it doesn’t have to be in the control. It’s just like whatever image and you can build a 3D aviator from that, including the hair. That is something that nobody really has. And what we can do is we can just animate it using a similar webcam, so we can create well realistic facial expressions from an aviator that’s being deconstructed from a single image, but that’s one feature of what we are actually building at Pinscreen. So Pinscreen is going to be a social media platform. It’s not going to be the same thing as Masquerade or Snapchat lenses, but it’s something I think more fun. Right so it’s something that is complementary to something what people have done before, so definitely something new.
David Kruse: Interesting. Well, it seems timely with AR and Pokemon Go, so maybe it’s good timing, what you are working on.
Hao Li: Yeah, yeah correct, correct.
David Kruse: People are finally realizing the uniqueness and the power of the AR just, okay well. When do you think you will launch? Do you have any idea with Pinscreen?
Hao Li: Hopefully very soon. We are crunching right now. Hopefully in a month or two we might have something.
David Kruse: Interesting, okay. Well I’m excited to check it out, that’s for sure.
Hao Li: Yeah.
David Kruse: And so I’m curious, we are nearing the end of the interview unfortunately, but in the next three to five years where do you kind of want your research to be around the tracking of humans in motion and capturing all that digitally?
Hao Li: Yeah, I mean on a low level trend I think what I hope I can achieve is building the entire human bodies from very few views. Have a deeper understanding actually of what we are – I mean there must be a purpose for whatever we capture. We want to be able to understand what the people are actually doing and right now we are still scratching the surface of low level problems like 3D reconstruction, but we want to get more. We want to get a lot more out of it. We want to be able to understand what is the person’s intension. I mean you can think about, I mean someday you will have self driving cars that can you know have social skills maybe, right. They might be able to interact with pedestrians and the problem with this is that you can either go the way of using traditional computer vision techniques where it just localizes the box and says, this is a human or it can actually perceive humans, I mean it can perceive other humans like as good as a human would do, right. I mean if you look at a person, if you look at them and say, if you look at its actions you are actually extremely complex, the thought process, the reasoning process is extremely complex and I hope that I can kind of bridge closer this whole idea of you know 3D reconstruction, facial temporary reconstruction with the whole idea of AI and reasoning.
David Kruse: Interesting. Well, that will be brilliant. And I mean it seems like it could be quite helpful in robotics too, like just train robots.
Hao Li: Absolutely, yeah. Absolutely I mean you want to have robots that can interact with people. You want to be you know robots that can assist in you know all kinds of stuff right, rescue missions, you know all kinds of things which isn’t possible, right. So it’s going to be a step, I think an important step towards that direction. I think AR and VR is an interesting platform. It was actually an important platform, but it’s not limited to this. I think the real challenge there is how can we use these type of technology to improve AI, communication, etcetera.
David Kruse: Oh! Yeah, exactly great. Better conference calls which for most people I mean I want that, but that’s what I want to, so…
Hao Li: True, true.
David Kruse: I know it sounds very exciting. Because of course there is also a lot to look in probably medical applications and that’s yeah, interesting okay.
Hao Li: Yeah, hopefully we can find out some algorithms for other things, right.
David Kruse: Yes, yes. Yeah, I mean I can see – I know companies that are working on tracking peoples gates for medical reasons or just like number of steps. I mean you can take that to another whole level of like hey, you’re…
Hao Li: Hopefully.
David Kruse: Yes, down the road, down the road, interesting. And then last question I have for you and this is a little more on the personal level. But on your LinkedIn somebody says that – gives you a recommendation and said that you’re kind of crazy, but like in a really good way. And so I love that. That’s like the best recommendation I could think off. And so I’m curious, from your standpoint your know I’ve always liked to be a little more crazy, because that’s what I think makes life interesting.
Hao Li: I don’t think I’m trying to be crazy, I think – I just try to be myself, right, so…
David Kruse: All right, fair enough. All right, so I was curious why he said you were crazy. Maybe it’s because you just have…
Hao Li: I think that guy is crazy.
David Kruse: Yeah, I think, that guy – fair enough, and it could just be because you start – I mean you have lots of good ideas and you can tell you are quite passionate about what you are doing and you keep…
Hao Li: I mean it’s a very fun field and I think an important thing is that we are all – I mean this field is like really it’s changing a lot at this moment, right. So I mean the computer vision community is changing. I mean there is a lot of advances happening. So I think all of that together makes this, its super cool actually what we’re working on.
David Kruse: Oh! Yeah, yeah. And there is just so many intersection, there is a lot of different industries and fields and oh yeah, maybe you’re in the very interesting space. And I think that just about does it for the interview, which is too bad. I could talk for a while, but this is fascinating and I really appreciate you telling us more about your research and your background and I know I’ve learnt a lot and I hope everyone has too.
Hao Li: Cool, thank you.
David Kruse: And thanks everyone for listening to another episode. I appreciate it and we’ll see you next time. Bye.
Hao Li: Yeah, see you. Thank you. Bye.