Casual Talk, Space, and Bitemporal Tables
Jul 26, 2021
Jordy and Antonio meet after a vacation haitus or two. They talk space, life, and are even able to sneak in some data science.
If you’d like to see our show continue, reach out to the hosts.
3 Projects That Can Make You a Household* Name
Jun 08, 2021
*Household name is a bit of exaggeration – but certainly a known name in the ML community.
Today we are joined by Alex Beutel, KDD Cup co-chair and senior staff research scientist & tech lead / manager at Google to discuss the three projects for this year’s KDD Cup. It’s amazing the diversity in these projects and possible ways to solve it. Even if you didn’t get a chance to officially enter the competition this year, think about how you would solve it, and then listen to how the winners of the contest were able to achieve top scores.
These are the three projects that we discuss during the show:
“… KDD 2021, the longest running and largest interdisciplinary conference on data science where the biggest names in the industry and academia come together to drive innovation in AI, machine learning, computer vision and more. Originally planned to take place in Singapore, the conference will be 100% virtual this year. As an added bonus, organizers will offer key content twice, on eight-hour intervals to ensure equitable access across time zones.”
Okay, I borrowed that last part from one of their press releases but if you don’t know KDD, you better ask somebody.
‘Data Science’ is an evolution. What does it mean now in the Workplace? w/ Erin Stanton
May 31, 2021
We didn’t always have ‘Data Science’ but we’ve had data for ever. We’re joined by Erin Stanton to learn about the evolution of Data Science in her work and personal life.
Erin Stanton has 13+ years of experience in big data and data science and she currently runs the Global Client Support organization for Virtu Analytics. Erin is known for her big energy which she brings to everything she does and has been more recently championing potential machine learning and AI techniques to answer client questions within her day job following her recent completion of her Masters in Data Science from UC Berkeley.
What are NFTs and why do you care? (Fixed episode link)
Mar 29, 2021
Shain and Calvin join us to tell us all about NFTs. NFTs stand for non-fungible tokens – meaning they represent unique items – unlike currency which is fungible and can be aggregate or divided.
NFTs could become a bigger industry than digital currencies so will you be an early adopter or late comer?
I don’t get the vaccine rollout … is using tech for a vaccine appt immoral …
Mar 09, 2021
Antonio and Jordy are back at it. No guests, just the two hosts catching up after 3 months of relative silence. Antonio explains to Jordy why he is befuddled by the vaccine rollout … in addition, he asks whether the script he wrote to find a vaccine appointment is immoral …
We’d like to hear from you so please reach out.
Can you remember this? We talk memory with John Graham, Grandmaster of Memory, and USA Memory Champion
Dec 28, 2020
What would you remember if you had better memory? In this episode we speak to the 2018 USA Memory Champion and Grandmaster of Memory, John Graham aka memoryjohn.com.
As imposters, we have lots of questions. John answers them with the exactness that you’d expect from someone who continues to train their memory. John is a great coach and he’ll even run us through a mental exercise where we memorize something – listen and see if you can follow along.
If you’re interested about learning more, check out John’s website. You can also join Jordy and Antonio as they read Moonwalking with Einstein. Imagine what’s possible if you applied these techniques into 2021.
Math in Data Science with Professor Margot Gerritsen
Dec 14, 2020
Today we are joined by Professor Margot Gerritsen from Stanford University today to talk to us about Math in Data Science, Diversity in Data Science, and some of the ideas behind a Growth Mindset.
She is co-founder and co-director of the global Women in Data Science (WiDS) conference, reaching more than 120,000 participants annually in more than 60 countries and inspiring thousands of women around the world to pursue careers in STEM or data science.
Margot recommended the book Mindset by Carol S. Dweck, Ph.D. Antonio is already a few chapters in. Here’s the Goodreads link for you interested in learning more.
Additional details about Margot:
Margot Gerritsen, a prominent Stanford professor in her field who radiates a passion for the computational sciences, including data science. She is professor of energy resources engineering at Stanford, the former director of the Stanford Institute for Computational and Mathematical Engineering, and until recently, the senior associate dean at Stanford’s School of Earth, Energy & Environmental Sciences. Margot is also the Chair of the Board of Trustees of SIAM. She has spent nearly her entire lifetime using mathematics to solve a wide variety of complex problems, including reservoir modeling, coastal ocean flows, and sail design for America’s Cup yachts.
Discovering interesting rules: Association Rule Learning
Dec 01, 2020
We’ve been doing this podcast for over three years and I’m surprised that we had never come across and discussed the topic of association rules. Association rules is a mining technique that allows you to find rules in your data. It’s pretty intuitive when you see it but we were so surprised how hard it was to talk about during the show.
Some resources that we encountered to get our head around the topic:
We know that talking about some of these topics is difficult and sometimes it’s hard to follow and sometimes we don’t actually convey the message easily. We hope that our conversations spark some ideas, ignite you to try some of the code yourself, or do your own research.
Please email us if our shows have spoken to you in way way or another. Also, the first person to email us will get a Data Science Imposters Podcast face mask.
Stock Market: Investing and Data Science
Nov 17, 2020
Jordy and Antonio talk about the stock market. They focus on the data behind the stock market movements and also how you can get started investing in the stock market. Whether you have never traded in the market, are a buy-and-hold investor, or are an avid investor – you’ll enjoy this spirited conversation that could have gone on for more longer.
We were excited to do this episode and when we were done, we realized that there’s so much more to talk about but we won’t know if we should or shouldn’t do a follow-up episode unless we hear from our listeners. We look forward to your feedback.
How Emerging Data Scientists Can Path Creative Careers (And How Employers Can Help Get Them There)
Nov 02, 2020
This episode is the panel discussion that we moderated two weeks ago for the Tom Tom Cities Rising Summit (https://www.tomtomfoundation.org/cities-rising-summit).
Here’s the description provided for the event: Join regional data science leaders for a discussion that will touch on best practices for college graduates to mid-career professionals who are seeking rewarding employment opportunities, as well as insights into how organizations can build the data science team that best suits their needs.
Listen to the perspective of these fantastic panelists:
Miriam Friedel – Director of Software Engineering at Capital One
Kerry Guerrero – Senior Data Scientist at S&P global
Renee Teate – Director of Data Science at HelioCampus
Subscribe, tell your friends, and if you’d like us to moderate an event at your conference, please reach out.
Is Feature Engineering a low hanging fruit?
Oct 19, 2020
In this episode, Antonio explains to Jordy what he knows about feature engineering from work, Kaggle projects, and general research. Antonio talks about featuretools which he was able to use recently. Feature Engineering appears to be an area that could really enhance machine learning in significant ways. What do you think? Do we need experts to tell us what’s important or is there another way?
Can Team Sports Take Advantage of Reinforcement Learning?
Oct 05, 2020
In team sports, you have players – sometimes a couple, sometimes many players – all going for a goal. This is very different than games or sports with one player. The idea of collaboration makes everything harder.
In this episode, Edward Rusu tells us all about Multiagent Reinforcement Learning (aka MARL). Edward is a researcher and developer. He’s very excited about the work he’s doing and cannot wait to share it with you all. We enjoyed our time with Edward and he’s always up for connecting to anyone interested in the topic. If you’re interested, feel free to reach out to him via LinkedIn https://www.linkedin.com/in/edward-rusu/
This is our second show about reinforcement learning. We strongly believe that hearing about a topic makes it easier to relate to and will also remove some of the fear of trying out the techniques yourself. Whether you are an avid developer or just come here to learn a bit more, do not be afraid to jump in and try new things.
If you’ve gotten all the way down here, drop us an email, give us a nice review, send us an advertiser, or just smile that you read this.
Do computers learn using positive reinforcement or negative reinforcement?
Sep 22, 2020
Nothing feels more like Artificial Intelligence than when a computer learns by itself through repeated simulations. Computers can now master games by simply playing the game over and over against itself. This is reinforcement learning.
Today we have David Stroud on the show to explain to us a bit of reinforcement learning. He’ll give us the base needed when hearing and understanding Reinforcement Learning. David Stroud is a Reinforcement Learning Researcher and Lecturer of Information Systems and Quantitative Methods at Troy University.
After listening to the episode, do you think computers learning using positive or negative reinforcement? We want to hear your thoughts.
Did AI write this blog or was it Liam?
Sep 08, 2020
Liam Porr wrote ‘Feeling Unproductive? Maybe you should stop overthinking’ The post had thousands of visitors and made it to the top of some blogging site. Liam did come up with the title and some of the text but he didn’t do it alone. Liam had help from GPT-3. GPT-3 is the latest readily available natural language processing technology.
Should we be worried that our podcast will soon be obsolete? NLP to sound and we’re done.
A/B Testing, Data Science, & more with Eric Schles
Aug 24, 2020
We asked Eric Schles to come on the show to explain A/B Testing to us. He didn’t stop there. He explained a few other things along the way and as you expect with someone with his vast experience in technology and data science, our conversation took some twists and turns.
Data Analytics and Predictions made easier
Jul 28, 2020
We connect with Shanif Dhanani, Co-founder and CEO of Apteo, to learn how he and his team are making data analytics and presictions easier. If you’re looking to do this in your organization, like we’ve done, you will want to listen and hear about some of the hurdles and how they are being solved.
Big Data through the eyes and mouth of Arun Murthy, co-founder of Hortonworks & CPO of Cloudera
Jul 13, 2020
In this episode, we get to talk to an insider of the big data movement. Arun Murthy is one of the founders of Hortonworks and has spent well over a decade in the big data industry. We ask Arun to walk us through what has challenged in the industry, his journey, and what is next at Cloudera and what that brings for enterprise users.
Subscribe, join our mailing list, and send us a note to say hello.
Police Encounters: A data project w/ D. Brian Burghart
Jun 15, 2020
According to FatalEncounters.org, there have been 934 police-involved deaths between January 1, 2020 and June 12, 2020.
How does a lifelong, award-winning journalist, the former editor/publisher of the Reno News & Review and a former journalism instructor at the University of Nevada, Reno create one of the leading data sources for US fatal encounters with the police? This is our conversation today.
Fatal Encounters is a 501(c)3 public charity. If you’re interested in donating your time or money, please visit https://fatalencounters.org/donate/
Post-edit Corrections: “The Geezer grant was for $12,000, not $17,000 and the university I was doing the AI stuff with was UMass, not Harvard.”
Broadcast Radio-inspired AI technology personalizes song transitions w/ Zack Zalon
May 28, 2020
“The future of streaming music is more than just playlists of songs with long gaps of silence,” says Co-Founder and CEO of Super Hi-Fi, Zack Zalon. We are excited to have Zack on the show to tell us a bit about his company, data, and technology he uses.
Like our show? Subscribe. Write a review. Tell us: (antonio|jordy)@datascienceimposters.com, on twitter @dsimposters, or on our reddit channel r/dsimposters
The OODA loop & the Onion Diagram in Data Science w/ David Purdy
May 18, 2020
We’re still in the midst of COVID-19. We wanted to see how someone like David Purdy would approach this problem from a data and data science perspective. He didn’t let us down by introducing high-level concepts like the OODA loop, developed by Colonel John Boyd for combat operations process, and his own Onion Diagram which describes the layers within data science and strategy.
David Purdy has been at Uber, Goldman Sachs, worked on autonomous vehicles, and advises startups on data science and strategy. David is an old friend of the show and has one of the most downloaded episodes on Building a Machine Learning Platform: https://datascienceimposters.com/2017/11/26/building-a-machine-learning-platform-interview-with-dr-david-purdy/
Customer Analytics w/ Raymond Collins
May 04, 2020
How do you think about your customers? Not in business? How are businesses thinking about you?
Customer Analytics is interesting because the techniques can be applied to a number of different disciplines.
Raymond Collins is a friend of the show and has 12 years of experience leading data science, customer analytics, and data strategy initiatives.
The Data Science Imposters Podcast is a show for all of us on data, technology, and science. The show has been running for over 3 years.
Supply Chain and Data Science with Tim Gagnon
Apr 20, 2020
In this episode, we discuss supply chain and data science with Tim Gagnon. Tim is the Vice President of Analytics and Data Science at C.H. Robinson and has recently been tapped to lead the newly launched Robinson Labs.
We are joined by four people with four different perspectives about product management. We interviewed them separately and then combined their responses to seven questions about their field.
Soumeya Benghanem is the head of product management at Strategy & Action.
Clement Kao is a product manager at a Financial Technology company based in San Francisco. He is also the Co-Founder of Product Manager HQ.
Alex Santos is a product manager at a Marketing Technology company in New York City.
Lot Serebour is a technology consultant with prior product management experience.
Data Science and Politics
Mar 23, 2020
Today we have a guest that will help us understand data science in politics.
Dr. Andrew Therriault is a Data Science and Strategy Consultant. He teaches a course entitled ‘Data Science for Political Campaigns’ at Harvard. Prior to this, he served as the Chief Data Officer for the City of Boston and the Director of Data Science for the Democratic National Committee.
Scrum Mastering
Mar 09, 2020
On today’s show, we are joined by Hector Gutierrez and Heath Herel to discuss what it means to be a Scrum Master.
Hector is an old friend of the show. Over the course of his 15+ year career, Hector has had the opportunity to work on IT integration projects spanning all aspects of his respective companies.
Heath has been a Scrum Master for 10 years, in several industries. He is the author of Practical Scrum: A User’s Guide, just published in January 2020.
Data Science Buffet
Feb 24, 2020
In this buffet, you get to bear us talk about data science jobs, Tesla removing features, Appl slowing down phones, the terror of interviewing for jobs, dynamic programming.
We hope you are full after this meal.
Data Science Retreat, Social Good, and ISAs with Jose Quesada @quesada
Feb 11, 2020
Jose Quesada (@quesada) is the CEO and Founder of Data Science Retreat (@datascienceret) in Berlin, Germany. DSR is one of the top market leading, bootcamp-style schools in Europe.
Jose has mentored over 200 projects and has a talent for choosing a project. He gives us his thoughts on what constitutes a successful Machine Learning or Artificial Intelligence project.
Jose introduces us to the idea of income sharing agreements and is a huge proponent of them. He makes compelling arguments in their favor.
Jose makes an even bigger argument about how AI, ML, and Data Science can help humanity. He paints a picture of progress and innovation that make any pessimist reconsider their stance.
In this episode, we are joined by Kelsey Hightower who schools on Kubernetes. Kubernetes is the conductor to the orchestra of containers (or something like that).
Kelsey is one of the real people on Twitter answering questions and having conversations. His handle is @kelseyhightower. That’s not his only qualification: he’s also the author of the tutorial Kubernetes The Hard Way which can be found here: https://github.com/kelseyhightower/kubernetes-the-hard-way
Kelsey’s analogies and his clear and concise explanations make it easy for us, data science imposters, to understand and appreciate the power of containers and Kubernetes. We hope you do too. Also, sign up to our newsletter if you’d like to hear some additional content related to this show.
License Plates for Space Satellites – with John Lee @jclee20100
Jan 08, 2020
On this episode, we talk with John Lee, @jclee20100, about ELROI (www.el-roi.space).
Whether it classifies as fully functioning satellites or space trash, there’s a good amount of material orbiting our earth. Locating your own satellite becomes a challenge. Listen to find out how ELROI does it.
By the way, these satellites are getting smaller allowing more corporations or even individuals to launch their own. Hey, we might have to send one up to space when we get that syndication deal.
2019 is a wrap.
Jan 02, 2020
In 2019, our podcast hit major milestones.
* 32 new episodes * 100,000 downloads * Sponsored by the University of San Francisco www.usfca.edu/dsi * Most importantly, we met and interviewed some amazing people: James Savage Haftan Eckholdt Noemi Derzsy Prem Ganeshkumar Maria Victoria Arroyo Brian de Heus Miguel Garcia Kenneth Reitz Alaa Moussawi Elissa Shevinsky Leigha Jarett Vikal Kapoor Bruno Gonçalves Christian Gonzalez Reshama Shaikh Jon Krohn, Ph.D. Claire McKay Bowen Mihai Maruseac John C. Lee Eric Wong Peter Lorentzen Kaitlyn Wojtaszek Stefan Woort-Menker
Email me at antonio@datascienceimposters.com if you’re interested in sponsoring us, have some ideas for the show, or would like to join us for a chat.
Thank you all for your continued support.
#podcast#2019highlights#reflections
Privacy and Computer Science with Dr. Mihai Maruseac
Dec 17, 2019
In this episode, we have Dr. Mihai Maruseac give us a different perspective about Differential Privacy. His perspective comes from Computer Science instead of Statistics.
Like all of our episodes, we jump around from privacy to other topics to Boston and back.
Want to share your thoughts? Email us at antonio@datascienceimposters.com or jordy@datascienceimposters.com
Fight Artificial Intelligence with this …
Dec 02, 2019
Today on the show, we have Eric Wong who tells us about how you can fight some of those machine learning / artificial intelligence programs with a little of their own medicine.
Eric is a PhD candidate at Carnegie Mellon University and his work is focused on solving the problem of adversial examples. I’d say he was on the machine’s side if you ask me.
Differential Privacy? w/ Dr. Claire McKay Bowen
Nov 18, 2019
How private is that survey data? Differential Privacy allows us to quantify this and other questions about data and it’s privacy.
On this episode, we are joined by Dr. Claire McKay Bowen. She is currently a postdoctoral researcher in the Statistical Sciences Group at the Los Alamos National Laboratory studying methods of data privacy, specifically differentially private data synthesis methods.
Please see the article that Dr. Claire McKay Bowen wrote with Dr. Joshua Snoke from RAND Corp.
Deep Learning ‘Spoken’ with Dr. Jon Krohn
Nov 04, 2019
We’ve done our best as imposters to talk about machine learning and in this episode we have an expert join us. Dr. Jon Krohn is the Chief Data Scientist at untapt and is the author of Deep Learning Illustrated.
Jon provided us with a ton of insights about the space and also about what deep learning is not. We think you’ll learn, laugh, and sometimes be surprised in this show.
Want to support our podcast? Two things anyone can do: subscribe and give us a review on Apple iTunes.
Mentoring with Reshama Shaikh
Oct 21, 2019
In this episode, we talk about mentoring with Reshama Shaikh. She explains some of the do’s and don’ts of mentoring.
Reshama is an independent data scientist/statistician and MBA with skills in Python, R and SAS. She is an organizer of the meetup groups NYC Women in Machine Learning and Data Science and NYC PyLadies.
If you’d like to learn more about Reshama or her work, including the post about mentoring, check out her blog.
Did you get my message? Messaging and Queues
Oct 07, 2019
Antonio asserts that this is one of those topics that is rarely talked about and less rarely taught in colleges and universities but is a staple of any enterprise system.
In this episode, Antonio tries to explain the concept to Jordy. Don’t worry, he’ll quiz him later.
Our sponsor is the University of San Francisco. To learn more about their Master’s in Applied Economics , go to usfca.edu/dsi
How does Manufacturing evolve? With software and data, of course.
Sep 30, 2019
We know nothing about manufacturing. Okay, maybe Jordy knows a bit more than Antonio but we, collectively, still know nothing.
In this episode, we talk with Adam Montoya and Christian Gonzalez who give us a lesson in manufacturing and also tell us about Bright Machines. Bright Machines’ tagline is Software-Defined Manufacturing and we want to know what that is.
If you’re down here, we know you’re still listening. We are still actively looking for a venue for a one-day data science conference. Email us if you’d like to partner with us.
Starting an Online Company: Talk with Moonllight.com Founders
Sep 23, 2019
In this episode, we talk to Stefan Woort-Menker and Kaitlyn Wojtaszek, founders of Moonllight.com, about their journey to start an online company.
This is an episode about the challenges of getting your product to market, how the idea was formed, the pitfalls to avoid, and the hype versus reality.
I would say more but then you’d never listen to the episode.
We are working hard to create these episodes and provide appropriate content. We’d love to hear from you about how we are doing and would appreciate any leads for potential advertisers. Email us at Antonio or Jordy @ datascienceimposters.com – Thanks for the support.
NVIDIA takes the driver’s seat
Sep 16, 2019
Rules first. Here we go …
What does the manufacturer of video cards have to do with autonomous or driver-less cars? Find out from our next guest, Nico Koumchatzky.
Nico, Director of AI Infrastructure, helps us navigate through our understanding of this technology. He’s even got advice for those of you starting your career in data science, artificial intelligence, or a related field.
Looking forward to hearing from you on this question: What happens to the passenger side front seat when cars are completely autonomous? Do they become a relic of the past?
Complex Systems Do Not Add Up
Aug 26, 2019
The Complex Systems Society has this on their website ‘The most famous quote about Complex Systems comes from Aristotle who said that “The whole is more than the sum of its parts”‘
In this episode, we invite Dr. Bruno Gonçalves to explain it to us, the data science imposters. Bruno gives us a nice mix of examples blended with the academic references associated to them.
If you deal with data that has relationships, is interconnected, or networked in some way you need to understand the concepts of Complex Systems.
The Art of Digital War: Enter the Blockchain
Jul 30, 2019
Would you give up Control for Convenience? Blockchain will protect you.
Vikal Kapoor joins us on this conversation. Vikal is an experienced entrepreneur whose latest venture as CEO of Dapps is solving customer experience problems using Blockchain technology. Specifically, they are building customer relationship management (CRM) software on top of an Enterprise Blockchain computing platform.
Vikal reminds us that for every convenience that we are afforded by technology, we may be losing a bit of control – control of ourselves and our data. Our Twitter poll taken by 12 people shows that more would give up control for convenience. We think that percentage is much higher (our poll was just very limited).
Vikal paints Blockchain as the hero we didn’t know we needed. Is he right? Only time will be able to tell.
Democratize your data. Do this first to get real results…. (w/ Leigha Jarett)
Jul 15, 2019
Your company wants to catch up with the buzz of data science, machine learning, or artificial intelligence. What do you do first? Hire a PhD in Physics to build artificial intelligence algorithms? Hire a team of consultants?
Leigha Jarett join us to give us her view. As an employee of Looker, she’s had this conversation with her clients many times. Democratize your data.
As always, we get to know our interviewee and hopefully Leigha teaches Antonio how to perform a hockey stop because we’re tired of him crashing to stop.
In this data science podcast, we mix technology, data science, AI, and a bunch of other data in a casual conversation.
How do I become a serial tech entrepreneur? (with Elissa Shevinsky @ElissaBeth)
Jul 01, 2019
Elissa Shevinsky, CEO of Faster Than Light, gives us some insight about entrepreneurship in technology. As a successful entrepreneur, she’s seen a few things and has gained much knowledge on entrepreneurship which she shares on this episode.
Join us as she walks us through the journey of starting multiple technology companies.
Give us 1 star.
Jun 17, 2019
Antonio and Jordy talk about the company in California that’s so fed up with Yelp that they have requested 1 star reviews – they’ve almost made it mandatory given that they give 50% off their otherwise very expensive pizza pies.
Their conversation then devolves into other tangents.
What’s a p-value and can you trust it?
Jun 10, 2019
The notes below are inside look on how we structured this week’s episode …
If your p-value is higher than .05, you can’t publish research.
Antonio: What is a p-value?
Jordy: The P value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H0) of a study question is true
Antonio: Okay, so if you’re like me … not a statistician … you want to have simpler language even if the explanation is longer. I reread that and think man … what is this null hypothesis? That’s really what got me hung up. How about you Jordy?
Jordy: ….
Antonio: So, the null hypothesis is that whatever you are trying to test has no significant difference from the population. Your hypothesis, whatever you are trying to prove,is called the alternative hypothesis.
Okay, so let’s make up an example. I’ll use Penn State’s List of 7 steps.
Define null hypothesis
So, let’s make up a null hypothesis – College freshmen students gain an average of 10 lbs during their first year of college. Let’s say we have the standard deviation of 3 lbs. Our null hypothesis is that there will be no significant difference between this population and our sample.
2. Define alternative hypothesis
Our alternative hypothesis is that students that are given an electronic scale at the beginning of their college year will impact their weight by the end of the year.
3. Set probability / alpha
.05 – 5% ?
Why would you make this number lower than 5% ? What if you get it wrong?
4. Collect Data
Experimental or Observational
5. Calculate the test statistic
Okay, I think where the Statistician magic happens the most. Essentially, this statistic measure compares the data that we collected or observed compared to our overall population.
I say that the most magic happens here because you need to know a bit about the distribution of the data and the pluses and minuses of each statistic. You then need to plug in the data to the formula – even I can do that part.
6 / 7 – Based on that measure – which is sometimes just how many standard deviation a our data is from the population – we then need to measure the likelihood of that
Now you have the likelihood on it and then you compare if that’s lower than your alpha value. If it is, you can now reject the Null Hypothesis.
….
It sounds decent to me so why is there such issue with p values? Well, I think when people are given one measure for success, they’ll figure out how to beat it, fudge it, or bend the rules a bit.
Inflation bias, also known as “p-hacking” or “selective reporting,” is the misreporting of true effect sizes in published studies (Box 1). It occurs when researchers try out several statistical analyses and/or data eligibility specifications and then selectively report those that produce significant results [12–15].
Common practices that lead to p-hacking include:
conducting analyses midway through experiments to decide whether to continue collecting data [15,16];
and stopping data exploration if an analysis yields a significant p-value [18,19].
recording many response variables and deciding which to report postanalysis [16,17],
deciding whether to include or drop outliers postanalyses [16],
excluding, combining, or splitting treatment groups postanalysis [2],
including or excluding covariates postanalysis [14],
According to one paper we found,
The Extent and Consequences of P-Hacking in Science, while p-hacking is probably common, its effect seems to be weak relative to the real effect sizes being measured.
270 authors work to repeat 100 experiments. ‘Even with all the extra steps taken to ensure the same conditions of the original 97 studies only 35 of the studies replicated (36.1%), and if they did replicate their effects were smaller than the initial studies effects.’
Climate Change – how can we use data?
Jun 03, 2019
Two days after recording this, we saw the article ‘Trump Administration Hardens Its Attack on Climate Science’ – https://www.nytimes.com/2019/05/27/us/politics/trump-climate-science.html
Climate Change is a topic that requires arguments based on data not rhetoric. Science not politics. Facts not opinions.
During our conversation, we referenced the following:
Here’s our simple, straightforward guide to starting a podcast. Yes, you can do it.
For our hardcore data science listeners, we’ll be back next week.
How does New York City use data?
May 13, 2019
Dr. Alaa Moussawi joins us to talk about how the NYC Council is leveraging data and how this data is helping New Yorkers along the way.
NYC makes their data available so you too can explore it and answer the questions that you care about – https://opendata.cityofnewyork.us/
To learn more about Alaa’s Data Science Team, visit https://council.nyc.gov/data/
Predicting crimes without bias and prejudice – is it possible?
May 06, 2019
Predictive policing companies have received some unwanted attention. Police departments are minimizing their use, if not eliminating these programs altogether.
Predictive policing aims at leveraging more data when determining crime areas, identifying perpetrators, and allocating resources.
What do you think? Tell us on Twitter @dsimposters or email us at jordy@datascienceimposters.com
DSI Week in Data Science-y News with Miguel Garcia
Apr 29, 2019
Our friend, Miguel Garcia, pops in on this episode. We always love having Miguel on the show. This is especially true today when we get his take on the latest news events we are covering in this episode.
While we have Miguel on he talks to us about his role as a Senior Solutions Engineer at Looker. Miguel also tells us about his journey from a risk analyst to his new role and what helped him make that transition.
We discuss the Professional Engineer’s Creed and whether the Data Science industry needs a similar oath.
Interview with Kenneth Reitz
Apr 22, 2019
Our agenda with Kenneth was to have an honest, unscripted conversation and we succeeded.
If you’ve never heard of Kenneth, you may have heard of the several Python tools that he has helped to authored requests and pipenv. (According to https://pypi.org/project/requests/2.9.1/, Requests gets 43 million downloads per month.) Or you’ve read his book Hitch Hiker’s Guide to Python. Or you’ve listened to his song – don’t worry, we have a sample of one of his tracks – ‘Push and Pull’ – at the end of the episode.
Listen to the episode then check out his site: kennethreitz.org
Containers & Japan with Brian de Heus
Apr 15, 2019
Brian de Heus is the Chief Technology Officer at Adgorithmics in Tokyo, Japan. We leveraged social media to reach out to him and ask him to join our show remotely. The first challenge was getting over the 14 hour time difference. The second hurdle was making sure that we stayed on topic since Antonio has always wanted to visit Japan and here was Brian telling him how great it was.
Brian explains to us what his company does, the problem that existed for his company, and how he solved it with containers. That’s really how you should approach technology. You have a problem, you look for a solution – new or old technology, and you apply it. Don’t find a technology and then look for a problem – we’ve said it before and now we’re writing it down.
Listen to this episode to learn more about Container technology, Brian, Adgorithmics, and living in Japan. We hope you enjoy and please send us a note, give us a review, or send us a tweet. Until next time …
Economists do Data Science Differently w/ Dr. Peter Lorentzen
Apr 08, 2019
If all you know about economics is ‘supply and demand’, you may be in for a good awakening with our next interviewee. Dr. Peter Lorentzen is professor in the Economics Department at the University of San Francisco and he is starting a new graduate program in Applied Economics.
With the recent teams of Economists that firms like Uber, Apple, and other Silicon Valley employees are hiring, we wanted to get more details. After listening to Peter talk to us about the vision for his program, we fully understand the motivation by these companies. These companies are not the first to hire economists; Bell Labs has done this previously. According to Peter, the requirements to obtain an economics degree have not changed on the way in and USF is trying to change the skill sets economists get on the way out.
If you’re interested in reading more, you should also read these articles:
1.Why Tech Companies Hire So Many Economists – 2019 https://hbr.org/2019/02/why-tech-companies-hire-so-many-economists
In 2011, Marc Andreessen wrote a fantastic article Why Software Is Eating The World. This motivated our title for this podcast. In this podcast we talk about software bugs that have fatal consequences, major financial, and explore why these bugs happen so frequently.
Which software are you worried about in your life? Your car? Your plane? Your money stored in digital form? Were you worried about Y2K or thinking about the Epoch Bug?
Public Service Announcement: If you haven’t read Marc’s article, please do.
GDPR, huh, what is it good for? Absolutely something?
Mar 25, 2019
If you’re outside of the EU, you may not know about GDPR. In this episode, we invite our friend Maria to discuss the General Data Protection Regulation. We discuss the What? Who? When? Where? We also talk about the business impact of these rules, what it means to implement them, and what is still to come. If you’re in the USA, you should know that some rules are coming to California then – we should all think about the importance of these rules – as individuals and as business owners.
Natural Language Processing with Prem Ganeshkumar
Mar 11, 2019
How do you make a computer program understand a language, a natural language, like English? Can we create a program that can create sentences, paragraphs, articles, or stories?
In this episode, we explore the basics of Natural Language Processing from Prem Ganeshkumar who is a Lead Natural Language Processing Research Engineer at Agolo in New York City.
After this episode, you should be able to pass our Data Science Imposters Natural Langugage Processing Quiz – https://docs.google.com/forms/d/e/1FAIpQLSdmgFmnoLfyxfu0tc_xjk4HS4u3ZxaAi4X-XxuoYfABLPkWmg/viewform
Fight or Flight?
Mar 04, 2019
At Antonio’s latest visit to the Motor Vehicle Commission, he finds out two things. First, sometimes it’s hard to change data once a mistaken occurs. Second, he will take flight before a fight.
We then talk about Amazon and the second headquarter that never was. Apparently, Amazon feels like Antonio – they also took flight.
With or without Amazon, NYC’s tech scene continues to flourish.
An interview with Dr. Noemi Derzsy
Feb 19, 2019
Dr. Noemi Derzsy is currently a Senior Inventive Scientist at AT&T Labs within the Data Science and AI Research organization. In addition to getting to know what her position entailed, we wanted to know how she got there (especially given her Physics background), how she thought about the data science discipline, and where she saw the future of the field.
Dr. Derzsy gives us her take on data engineering vs. data science teams. Antonio and Jordy seemed to agree. She gives advice to those looking for opportunities in the field. She uses her words more carefully than we do and used the term ‘geographically isolated’ for one of the places that we three had in common. We could not agree more with her assessment.
We are in awe with how much Dr. Derzsy has done in the field thus far and will certainly keep tuned for newer developments.
Ep. 49 – Interview with Dr. Haftan Eckholdt – Chief Data Officer & Chief Science Officer at Understood.org
Feb 01, 2019
In this episode, we interview Dr. Haftan Eckholdt – Chief Data Officer & Chief Science Officer at Understood.org. Understand’s goal is to help the millions of parents whose children, ages 3–20, are struggling with learning and attention issues.
Haftan takes us through his journey from rigging computers for distributed computing to generating models associated to the insurance industry and then onto technology companies (including Plated and Audible).
Haftan has plenty of ideas about how to integrate data and technology in his latest endeavor. He discusses direct and indirect data and examples of how data can be leveraged in positive ways. We certainly look forward to hearing about his success.
Cheap Noodles, Beers, and Bayesian Econometrics with Jim Savage
Jan 17, 2019
Jim Savage is the Head of Data Science at @lendableinc He joins us on the show to take us to school about statistics, his company, the work he’s been doing, and cheap noodles.
Jordy and Antonio have Jim make sense of this blog post for them: https://khakieconomics.github.io/2017/01/01/Building-useful-models-for-industry.html
Along the way, we learn how Jim’s life really revolves around pizza. (Please note that he had pizza for dinner). He also talks about commitment and how that translates to short term goals (30 days) and long term goals (6 years).
We made a new friend and we hope you enjoy the conversation as much as we did.
S.O.S. – 5G is coming
Jan 08, 2019
As data science imposters, we get to think about the history of things that came before 5G or whatever new whizzbang technology. In this episode, we talk about the origins of Morse code and related items.
Did you know that you could play songs using an old phone? Check out this site for some ideas: http://www.yak.net/carmen/phone_songs.html
We are excited about some of our future interviewees, we’d love to hear from you – rate us, like us, and keep listening.
Happy Holidays!
Dec 25, 2018
Holiday wishes from the DSI crew.
Logic, data, beers, and free talk
Dec 17, 2018
Fueled by a few cold IPA beers, a microphone, and a quiet room provided by a friend of the show – the Data Science Imposters grab the mics and have a good conversation. Join the conversation on Twitter @dsimposters
How much would you pay for the first auctioned AI-generated portrait? Make your own with GANs
Nov 26, 2018
Can a computer be artistic? If so, how valuable can their pieces of art become? The Obvious Group certainly thinks so and they have made their first sale of a computer generated portrait for $432,000. They’ve done this by doing some great promotion but also leveraging code published on the internet built on GANs (generative adversarial networks).
No.43 – Prisoner’s Dilemma and the Nash Equilibrium
Oct 29, 2018
Have you seen the movie a beautiful mind? Have you ever been facing jail time if you cooperate against an accomplice?
Well, let’s focus on the movie first – there’s a scene where five guys, one of those guys being John Nash, and they are at a bar when a group of women walk in …
In the movie, the focus of the men is on the blonde – one of the women.
They then go on to quote Adam Smith – “In competition, individual action serves the common good” – we will return to that in a second. Here’s a bit about Adam Smith from Adamsmith.org if you have never heard of him – ‘Adam Smith (1723-1790) was a Scottish philosopher and economist who is best known as the author of An Inquiry into the Nature and Causes of the Wealth Of Nations (1776)…’
In the movie, John Nash claims that Adam Smith needs revision. He says it’s what’s best for the individual and the group. He then lays out that if they all went for the blonde, they’d all get rejected and having a chance with her friends would then be impossible since no one wants to be second choice.
So, is the movie trying to depict the Nash equilibrium and if so does it do correctly?
Okay, that’s not fair – read the definition from Wikipedia, ‘In game theory, the Nash equilibrium, named after the late mathematician John Forbes Nash Jr., is a proposed solution of a non-cooperative game involving two or more players in which each player is assumed to know the equilibrium strategies of the other players, and no player has anything to gain by changing only their own strategy.’
What does that mean to you?
Okay, let’s see if this helps. The prisoner’s dilemma –
There are two people convicted of a crime – let’s say Antonio and Jordy. You’ll normally see this example in a table format so imagine this in your head.
Jordy has two choices and Antonio has two choices. Keep quiet or betray their partner. The person that confesses against their partner gets out Scott free. If they both confess, they both get five years. In one confesses and the other doesn’t then one gets 10 years and the other gets nothing.
So, what do you do?
Nash showed for the first time in his dissertation, Non-cooperative games (1950), that Nash equilibria must exist for all finite games with any number of players. Only 27 pages!!!
I like to think of it as the point when all players have played their best strategy knowing what other players can do.
Adam Smith does write ‘It is not from the benevolence of the butcher, the brewer, or the baker that we expect our dinner, but from their regard to their own interest.’ – Wealth of Nations
Cooperative vs. non cooperative games – in cooperative games there is some sort of binding agreement.
So, let’s return to the Beautiful Mind movie. What scenario would be a Nash Equilibrium?
#gametheory #gametheory
Entrepreneurship and Deep Space with Jake Kramer of Fed Tech
Oct 22, 2018
Jake Kramer joins Jordy and Antonio to display his love for deep space, give background and context of his company, share his thoughts on entrepreneurship, and wow them with some of the technologies he’s been able to explore. Jake Kramer is a managing partner at Fed Tech, a unique private venture program, funded by federal agencies and corporate partners, that connects entrepreneurs to technologies developed across DoD, NASA, DOE and other laboratories.
Jake received his MBA in Entrepreneurial Management from the Wharton School of the University of Pennsylvania and his BBA in Computer Information Systems from Hofstra University.
Most importantly, Jake listens to Data Science Imposters and you should too.
Check out Fed Tech’s website – https://www.fedtech.io
Summer Break is Over! We’re Back!
Sep 24, 2018
Jordy and Antonio catch up after a long hiatus. Come catch up with us.
Jordy even commits to reading, nay listening, to a book for our next episode. Antonio discusses the latest book that he finished ‘When’ by Thomas Pink.
Welcome back, we’ve missed you.
Reading Rainbowish Episode.
Apr 22, 2018
The data science imposters have been doing these book reviews for a while now and Jordy was finally able to find time to read and discuss a book. We are all starting to doubt he was ever on the show Reading Rainbow.
Jordy tells us about:
Guns, Germs, and Steel: The Fates of Human Societies by Jared Diamond
Antonio and Jordy talk commuting in the NYC metro area, thats where all the reading time comes in.
Antonio discusses:
Black Edge – Inside information, Dirty Money, and the Quest to Bring Down the Most Wanted Man on Wall Street.
Web of data, companies, private information, Facebook, and those friends of yours that signed up for that questionnaire
Apr 09, 2018
In this episode, we discuss what we have found out about the Facebook ‘data breach’, Cambridge Analytica, and the connections from companies to people to political parties to Facebook and ultimately to you (or one of your friends).
Education, Technology, and Analytics with Josh Powe
Mar 27, 2018
Josh Powe, CEO and co-founder of LinkIt!, joins us this week and he takes us back to school (I know, bad dad joke). He gives us an understanding of his industry, how technology has evolved in K-12 education, and he tells us what we may expect to see in the coming years.
LinkIt! is a robust K-12 assessment platform with powerful reporting and longitudinal data analysis tools.
The name of the IBM machine that beat Ken Jennings on live TV
Mar 19, 2018
‘Who is Watson? ‘ … That’s correct for 200. In this episode, we talk about Watson and Jeopardy as we review “Final Jeopardy: Man vs. Machine and the Quest to Know Everything” by Stephen Baker. We also have some fun by throwing in a few Jeopardy answers throughout the show.
Before we get into that, we talk about March Madness!!! It may be too late for this year, but next year we need to get those data science models in place to make our bracket picks. Is the stock market easier to predict than the NCAA championship?
Antonio finds this article which explains information gain and entropy but has a hard time explaining it to Jordy. Check out the article and let us know if you could do better: saedsayad.com/decision_tree.htm
Entrepreneurship has no roadmap – Interview with Jorge Nuñez
Feb 25, 2018
Jorge Nunez is an entrepreneur, investor and as Forbes puts it, an “idea man”. He shares his story of starting his company Remote Reactivation, from early days in event promotions to becoming a player in the world of dentistry. Entrepreneurship has no roadmap and Jorge combines technology, data, and his perseverance to succeed in an otherwise neglected field.
He leverages relationships, and the benefits of networking. He describes keys to his success, such as understanding who the decision maker is, his belief in the 80-20 rule, and leveraging data to improve his client’s work-life balance.
Story time: ATMs Spitting Cash, Crypto Paradise in Puerto Rico and Uber Making Deals
Feb 12, 2018
Stories about ATMs Spitting Cash, Crypto Paradise in Puerto Rico and Uber Making Deals
Antonio reminisces about his high school days and attending 2600 meetings at the Citi Corp building while discussing the ATM hacks or jackpotting.
Did you know the “whistle” provided in the Captain Crunch cereals could be used to make phone calls?
Did Antonio find a glitch in the Starbucks Card method?
…
Last couple days have been tough on Crypto Currency, but didn’t stop the NYTimes from publishing some information on folks trying to make Puerto Rico a Crypto Utopia.
Are these Crypto Currency folks going to the Rockefeller’s of our time?
…
Did Uber make the right call? They made a deal with a Hacker, and are paying the price for it. Maybe they should have just done a “bug bounty”
Can you read and understand this better than Alibaba or Microsoft’s AI? If you get sick will you call 311 or Yelp first?
Jan 28, 2018
Microsoft outscores Alibaba which outscores humans in reading and answering questions! Well, the truth is that humans also outscore Microsoft and Alibaba. It depends if you are using the Exact Match (EM) or F1-score scoring methodology. While we discuss some of the technical aspects of this, we do not lose sight of the socio-economic impact that these technologies can have on society.
In data science, we often try to use other data to gain more insight into a particular problem or situation. In our second segment, we spend some time exploring an article where they use Yelp as a proxy for identifying food borne illnesses in NYC.
Rebooting … Same purpose, different format
Jan 15, 2018
We start the new year with a book review of Peter Thiel’s Zero to One. As always, we add some levity by reading the 1-start reviews on Goodreads. We ask ourselves if there’s a Zero to One idea within Data Science. We come up short. Antonio loves libraries and defends them as one of the best institutions in our country. Do you agree? We finish the episode with a discussion of David Robinsons blog post shared by one of our listeners – thanks Diane!
Episode has been archived here: http://traffic.libsyn.com/datascienceimposters/DSI_Episode_33.mp3
Are you misbehaving?
Dec 10, 2017
Imagine that you go into a sportswear store to buy a snowboard to join your friends for a trip in Vermont. You see a used snowboard for $100 and you know the salesperson from high school and they tell you that you can find the exact same thing 15 minutes down the road for $30 dollars less. What would you do? Imagine if the snowboard is $500 instead of $100; would you drive 15 minutes to save $30 dollars? Antonio spent one flight from Dublin to New York and the following week finishing ‘Misbehaving’ by Richard Thaler. Antonio shares what he gained with Jordy.
Building a Machine Learning Platform – Interview with Dr. David Purdy
Nov 26, 2017
We start this episode addressing the ‘disease’ (our words, not his) of Impatient Data Science and how to cure it with a platform. We are excited to have David share his thoughts and practical experience building machine learning platforms.
Dr. David Purdy is currently a Senior Data Science Manager at Uber. Most recently, David architected Uber’s Machine Learning Platform and its real-time spatiotemporal forecasting platform which are the basis for driving Uber’s competitive advantage. Throughout his career, David has led the architecture of five such platforms. David holds a PhD in Statistics from UC Berkeley, and his career in data science and machine learning spans multiple industries including: finance, personalized medicine, transportation, and web search.
One idea that resonates well with us is the thought that you go from zero to something and iterate between something and the nth somethings until you get it right.
If you are interested in more details about Uber’s Michelangelo Machine Learning Platform, you can visit here: https://eng.uber.com/michelangelo/ In addition, you can send us any questions that you may have.
We invite a friend to join us this week (actually two weeks ago) to relive some of our favorite episodes. We invite you to join this casual conversation.
Next week we are excited to have a special guest on the show to discuss his experience building Machine Learning Platforms.
We got an IDEA, actually we got lots of ideas – Part II @RPI
Oct 26, 2017
In this episode, Dr. Bennett takes us back to school and teaches us a few things about machine learning, artificial intelligence, data analytics, and visualization. Along the way, we discuss how to incorporate teaching of these topics in colleges and high schools and some of the moral issues that may arise with artificial intelligence.
‘Dr. Kristin Bennett is the Associate Director of the Institute for Data Exploration and Application and a Professor in the Mathematical Sciences and Computer Science Departments at Rensselaer Polytechnic Institute. Her research focuses on extracting information from data using novel predictive or descriptive mathematical models and data visualizations … to support decision making … in science, engineering, public health and business. She has 25 years of experience and over 100 publications in these areas.’ Read more about Dr. Bennett here.
On the road: @RPI Homecoming 2017 (Part I)
Oct 22, 2017
We return to Rensselaer Polytechnic Institute (RPI) for the 2017 Homecoming Weekend. We share our experience and reminisce about good ol’ RPI. This episode is less structured and less ‘data-sciencey’ than most of our other episodes. We hope you enjoy this casual episode and tune back to Part II when we jump back into the depths of data science …
Here’s to old R.P.I. Her fame may never die, Here’s to old Rensselaer, she stands today without a peer, Here’s to those olden days, Here’s to those golden days, Here’s to the friends, we made at dear old R.P.I.
Alice and Bob have a secret
Oct 16, 2017
Alice and Bob have a secret that they want to share but they don’t want you or anyone else to know what it is. In this episode we talk about cryptography at a high-level. We touch on symmetry, hashing, and even steganography (not to be confused with calligraphy). How would you hide your secret? Is there a hidden message in this message?
The 7 things we had to discuss this week with @MickeyGarciaD
Oct 09, 2017
This week we are joined by Miguel Garcia. Miguel is a friend of the show and an avid listener. It’s good that he joined us since we had lots of topics to discuss including:
The Google Machine Learning tool – https://teachablemachine.withgoogle.com/
Language processing and Forensic Linguistics – Have your cake and eat it too – https://en.m.wikipedia.org/wiki/Manhunt:_Unabomber
The Trolley Problem and Driverless Cars – Inspired by Radiolab episode – check it out – http://www.radiolab.org/story/driverless-dilemma/
Tattoos that shows dehydration and Knockout Bands – https://news.harvard.edu/gazette/story/2017/09/harvard-researchers-help-develop-smart-tattoos/
Pixel Head Phones – Live Action Translation – Here’s Google’s page: https://store.google.com/us/product/google_pixel_buds?hl=en-US
A little bit about out guest this week …
Miguel is a data analytics young professional from Puerto Rico currently working with Looker in NYC. Prior to this he worked at Etsy where he helped start the finance data analytics team and as an analyst at Goldman Sachs. He enjoys learning about data pipelines and data science workflows. He likes to spend his spare time cycling, hiking, and cooking. Find him on twitter as @MickeyGarciaD.
Programmer’s Paradise (in OOP)
Oct 01, 2017
As I walk through the valley of the shadow of code
I take a look at my programming style and realize there’s none of it ‘Cause I’ve been copyin’ and pastin’ so long That even my momma thinks that my mind is gone But I ain’t never crossed a problem that didn’t deserve it Me be treated like a newbie, you know that’s unheard of
This week, I get as nerdy as I can without losing half of my audience. Jordy seemed exhausted after this conversation. Let us know how you think we did.
Your credit score and identity may be in danger.
Sep 25, 2017
Jordy and Antonio meet to discuss the massive data breach at Equifax. In the United States and abroad, Equifax houses some of your most personal information allowing you to be recognized as a financially credible person to banks, companies, and others. With a data breach of this magnitude, what should you do? What happens next? We leave a question for you – is there a better way?
Catching a Fraud
Sep 14, 2017
In this episode, we learn how insurance companies are using a topic we talked about before – graphs – to determine which car insurance claims appear to be fraudulent. We dive into the paper, ‘An expert system for detecting automobile insurance fraud using social network analysis’, published by Lovro ˇSubelj∗, ˇStefan Furlan, and Marko Bajec available on https://arxiv.org/pdf/1104.3904.pdf
Your credit has expired. Please buy more hours or upgrade in order to continue transcribing files.
Episode transcript below
Note: This is an automated transcript and still needs a human touch.
It’s orchestrated fraud. It’s all about it. And the more people you have in collusion scheme and you know in a scheme like that the less likely it’s to happen right. If a hundred 100000 people are in this scheme then you say well is that really a scheme or part of a part of it. Yeah I mean Tony in Germany where the data scientist impostures Jodhi one of the things we we told our listeners we would do is take difficult topics take these complex areas and break them down to a point where they were understandable to people not only you know people who may not have the technical details but maybe people who don’t know they could be data scientists who work in this field but don’t deal with certain technologies or don’t deal with certain algorithms. Oh great that’s one of the beauties of this podcast we try to make it break it down to a point where you can have a conversation over it. So one of the sites that I recently stumbled upon and maybe not so recently was ARC’s of so A.R. x ivied dot org. Okay okay. And what it is it seems like it’s a repository of a number of research papers and all different types of topics. And one of the things that I look for is more let’s see what they have in machine learning. Sure. And I remember in episode 8 we talked a little bit about graphs and networks so that’s going to come up.
And a few new concepts come up but the title of this article that I found is an expert system for detecting automobile insurance fraud using social network analysis. That sounds cool. So right. And they start off with the fact that look in in insurance and insurance is not like a sexy topic right now they’re not. It’s you know one of the things with the content and then the probabilities and statistics. But but the fact that the insurance business is like a trillion dollar industry it is the number of people. Everyone needs insurance insurance even if you aren’t getting the lot you need insurance. The house need insurance. Yeah. And I think the problem with thinks the biggest problem with something like fraud is that it affects all of us. Yes yes. If there’s fraud then you know your particular claim or what you’ve been paying on a monthly basis may go up because some somewhere down the line they had to pay a claim of X millions of dollars. And this this paper is by three three people in Slovenia. Love row Stubblebine and Markoe. Again you can go to our XXIV dot org. The paper number is 11 0 4 dot 39 0 4.
PTF just in case I’m just giving a full reference here but so the first thing I think about this is when they say an expert system so an expert system when I think about expert to some I’m thinking about a couple of things but I want to put it on you charities to maybe give me an idea of when you say and when you hear expert system what do you think about expert system or expert individuals so I’m a little expert system I don’t know because it’s so vague it when I hear experts system I don’t hear in what field are you an expert system in. So I’m not hearing that right. So I would assume they’re just a let’s say if there was a field that they you know what I don’t know. Right. So expert system before before the machine learning AI Crase everything was either exper or decision support system. So people use those terms interchangeably. And also the idea of a rules based system. So if you think about a rules based system what do you think. Well you’re setting some parameters and they can only make decisions based on what the parameters are set and those parameters are usually set by experts and whatever field it is. Got it. And that’s why the idea became an expert system. People who are setting these parameters are no expert so expert system just means that this has been written by a person from the field that knows what they’re talking about. And by a calibrated configured by that idea that you have maybe a human element someone tweaking these knobs and getting it right. So for detecting Auto Mobile it Bill insure insurance fraud using social network analysis that’s what I want to dive into. So a social network analysis. You remember Episode 8 I think we talked a little bit about Dykstra as algorithm a graph or a social network or network. It’s basically when you have components that you somehow relate to each other. Gotcha. Let’s say Jordy we work on that data science and POSIX podcast together.
Jodhi you also work at a you know a certain you went to high school and so you know certain people from that high school I’m connected to those people by the fact that I’m connected to you. And then you know we can build relationships between different things using what we call networks or graphs. You don’t have to only do it by a relationship between people might that something like Facebook would do. And that is what’s called a social network. You could also do it by relationships by you web pages. And that’s what Google is famous for. Yeah we did think of the number so it was like a phone book example like an address book kind of thing. Right. Right. OK. So social network analysis being the fact that look people are connected. And when you have an automobile clean bright let’s say you do have an accident you have a legitimate accident. What happens right. What’s happening in that accident. Well we’ve all been there right. So there’s an accident. And hopefully everyone’s OK. Everyone you know the drivers get out the vehicle they assess the damage. They probably pull over and they exchange insurance information and if they deem it necessary they’ll call the police so that you can file a police report. Everyone gets a copy. Everyone goes on their merry way. The next step is though you call your insurance company and let them know what’s happened. You give them your information. The information of the person you were with they’re probably just going to say what’s Denish who’s or who carries insurance and you give them a report. Right. And so if you were taken out.
Let’s not even talking about the social network aspect of it or the fact that people are connected in some way here. And I’ll talk about that in a second but you have some data associated to this accident. Yes. That data being things like time location. OK. Were you in the back. Were you hit in the front right from the side of a number of passenger number of passengers cars you were in the weather. Yeah the weather is important in the city or state you were in. Witnesses were there. Did the police. Did they show up. Did the police issue a summons or a traffic violation. Maybe someone did an illegal move. Right. Yeah. So all of these all of these attributes and the paper one of the things the paper discusses is that you do have a lot of the systems that you use all of these attributes to make predictions about whether something looks anomalous right something is out of the ordinary. Right. But what you don’t have. And you know it’s interesting in the paper how they label it is you don’t have the social network that you have by connecting the fact that you have two drivers. Let’s say it’s a two car accident. You have two drivers that are now connected in some way. Right. They were in an accident they were in an accident together but not only do you have the drivers. You also have to connect the passengers OK. Right. The passengers were also in this accident together. All right. So if I now connect and this is actually an interesting problem because let’s say I have one car with three people.
My first car has three people and my second car has more people. All of these people were in an accident together. Right. So technically I can connect them all to each other or try to. Yeah. Well no they were. We can say they were in one event together. They are connected by this to that following. Right. So the people themselves the connections are the event got it right or I can say the drivers should be connected directly. And then the passengers of each car should then be connected to the drivers of those cars. Right. Or I could say really I don’t care about the passengers. So even when we’re thinking about this graph we can model model this graph where we can draw in so many different ways. And so that’s all in this paper one of the things that you see if you do get a chance to read it is that they talk a little bit about how you would model this this scenario. Would you actually put the drivers together would you put the passengers together. I’d be curious to know and I don’t know if they talk about it in the paper. What would instigate some this type of investigation. Was there a time where insurance companies were figuring out that you know there is fraud like the driver of car a and the passenger of carby or maybe Cozzens or high school classmates or some sort of. And they go listen these two drivers aren’t related so. All right pay out the claim. But they found out through back channels that there was some collusion that may have occurred.
It was that happening. Yeah I think so. So the fact that what they were looking for were rings of insurance fraud because of the insurance fraud doesn’t only happen in isolation in isolation it has to happen with the number of people. And I think one of the things we’re seeing is that you would have a scenario where it wouldn’t be you know one driver and you know one driver in each car because then you wouldn’t have a witness. Got it. And so what I mean. That’s right. So you have these witnesses who are quote unquote independent or you know. So the fact that collusion may happen right collusion when we think about collusion the fact that someone else is saying something that is not true. You know we we think. All right that likelihood of that is lower in a normal environment. Right. Collusion is one of those things where it sort of orchestrated fraud. It’s all about it. And the more people you have in a collusion scheme and you know in a scheme like that the less likely it’s to happen right. If 100000 people are in the scheme then you say well is that really a scheme or part of a part of the. Yeah yeah. You know you’re all you need is one whistleblower to say hey hey. Right. Which is why you know the more it’s like this is funny and way off track is like Ocean’s 11. It’s like there’s 11 people involved. No one’s going to snitch on anyone else. It’s rare right. Right.
If you’re going to commit a crime you’ve got to keep your circle of trust small as you can. All right well this is not how to commit a crime 101 No no no no no. But it’s funny interesting. So so in in their paper they show the representation of the network. They show how it’s represented and that and that’s pretty interesting. And they talk a little bit about the intention of. So when you represent something in these graphs you have to sort of show the intention you show you direct the graph in a certain way. No what they when they talk about intention one of the things they did is that they showed who was at fault. Yeah. So if one driver was at fault or the other driver was at fault they try to represent that in the graph. Yeah. So that that part of it you can think about it. It’s not an expert system. This is a computer algorithm developing these graphs to helping to get these things setup how did they get this information. I’ve been in a couple of car accidents nothing too major and correct me if I’m wrong and I just don’t remember handing over the information of my passengers that were with me at the time. Yeah. So that while it’s a number of things right. There might be the police report you may have a witness list. They did get this information and they talk about the source of this information at the bottom of this paper so let me just go to that. While you’re scrolling I mean think back. We’re not off base right.
I mean you’ve never have to give that information. I’m lucky that I have not been in an accident in a very long time and so on. I’m looking for that. I don’t recall having to. I don’t recall one last time that I had an accident where it went to insurance while you kind of handle it on the spot. So the data that they use is data from between 1999 and 2008 collected in Slovenia. Wow. Yeah. And it consisted of three thousand four hundred fifty one participant involved and fifteen hundred collisions. And so it’s not only one data set it’s a number it’s two data sets where they combine the data and they basically they use what they had labeled data. So they used some you know some data that they knew was fraudulent. They were using it using some supervised learning techniques here right. They knew some data they knew some people were it was determined through investigation that it was Portugal and were they trying to determine OK these characteristics of the fraudulent data. Let’s see how much of the nonfossil data is in fact fraudulent. Right. So that’s usually what you do in these scenarios. So what they talk about here how I’ve read this is that they formed these networks right.
And then the next thing that they do with these networks says all right you form these networks that already existed out of all of your data and then based on characteristics of what happened during that day you have an expert who will go in and say Yes this seems like let’s say the fact that it was at three o’clock in the morning this seems maybe to have some impact to being more fortunately or less fraudulent. So they would decide yes or no this could be associated to fraudulent activity and they probably scrutinize it a little more. Right. And so they would take all of these characteristics around each claim right around each inch accident. Right. And then they would basically apply that and see well where else did we see these types of scenarios. And that’s usually what you do in these using these algorithms. You have some use case you have some ideas where you know that something was fraud you know that something was a bad actor you know it was anything like this and then you go ahead and you say well let me extrapolate that information to my bigger subset and see opportunities where I have I have not decided that this is fraud but maybe maybe it was and we should look back into it. Gotcha gotcha. And that’s that’s a normal use case that you will see that’s a normal pattern that you will see when dealing with data. And once programming strategy do. Does one use two to go about a study like this by programming strategy. You mean like we’ve spoken about so many different ones you know like what was the one that was one of my favorites. I’m terrible with the names and I should have written this down and when I walked in here when we had the baker genetic was that the genetic got our lives right. Yeah that’s just a strategy and then obviously there’s machine learning which is the the encompassing one. And then there’s a couple right. Right. Yeah.
So for this one they you know they used I think what they. And again I skimmed through this article because it’s the title is great. And the descriptions. But once you get to the mathematics and this is where we hopefully bridge that gap for people. Yeah like I skimmed through some of it because of the fact that it’s sometimes it’s just it’s more than than I need it’s more not theoretical but it just like sometimes mathematics draws us away because it’s just like all these random symbols and all of these random things going on. But essentially they use something like Google’s page rank on them which shows them some sort of you know give some some indication of which of these things is most likely to be insurance fraud. Gotcha gotcha. And really just given things waits and with the probability of those other factors all weighed into this network how do you get to this to this model and this model. Like obviously this is their own model right. When you when they do research like this usually what they’re doing is they’re expanding the existing model which is accessing Mahto but they wanted to take it a step further and use this graph analysis to caution. Interesting stuff. So you know I think I think it’s pretty interesting because it combines what we had talked about before with the graphs. It combines the fact that we had talked a little bit about the fact that there is these systems that don’t work just alone by using machine learning. You know it’s a combination. It’s humans deciding whether something looks fraudulent or not. It’s using historical data that we’ve labeled as fraudulent not fraudulent.
It is using these social network analysis and it’s combining these things together and I think when you think about the iterations of the stings of this technology and and the approach you can only imagine that the next step is as as you get more data that some of the factors that the humans were sort of indicating now become more and more robust and you have a machine learning algorithm helping to guide that decision. Yeah. And to your point you always need that human element to make this work and you need that data and the historical experience to just make these algorithms work and do their job which is pretty awesome. In particular I’m curious if if this support has caused any reaction in the insurance industry to to to change their ways of how they look at things. I think because you know as a research paper. And I think there’s a lot of research like this being done I think that some of the insurance companies around the world have their own similar models and they’re all looking for a different way to approach the problem. Sure. And so sometimes sometimes this approach may be better and sometimes it might not be so I think with the insurance insurance companies I think one of the things that they do have is that they have a lot of data they have a lot of cases where this can be applicable to them and most of the times they’re not looking for that you know that find solution. They’re just looking for a smaller subset to look at. Yeah.
Because if if you have an algorithm that you have a technique that just says hey look at 100 insurance claims instead of the ten thousand that are on your desk to look for fraudulent activity and do a better job looking at these you know these 100 then that’s significant improvement overall for everyone. Yeah but still those nine hundred ninety nine thousand that don’t get the attention that may warrant them right. And and and that might be that and that might be fine or it might not be fine. Right. Because you know one of the things that I’ve had to deal with is getting people to recognize that OK if you gave me a thousand things to do and I skim you know. Yeah. All the 1000 won’t. Why don’t you just give me a hundred things to correct and I’ll do a better job at those. And what I can do going forward is let’s say we learned something from those 100 B because we paid closer attention. We can build that back into the model we build that back into the chain so that next time we don’t miss those and next time we do a better job. And and when I advocate using machine learning that’s how I advocate. I advocate in the loop I advocate with iterating the process to make it better. That makes one makes sense. All right well look that’s that’s really what we wanted to talk about today. It’s a paper on insurance fraud using social network analysis. I found that interesting I will not lie to you I had to read it a number of times and have a few cups of coffee to get through some of the symbology associated to the mathematics behind it.
But it is interesting to think about how they had to do some modeling how they use some humans and how they imbedded expert systems into into this paper. Yep and I’m sure you know doing this research they’ve discovered ways to further look into it or give advice to insurance companies to look at fraud. Things like I’m sure they look east to look at things like how many times this person been in an accident and now that gets added on to other things to look at. Right. Right. So that becomes part of. And that’s a good point. That’s just social network analysis. When you see a thousand when you when you don’t. Every so. So I mentioned something and let me just clarify this. Every accident right. I would label my nodes right. I would label each person as a node and watch driver as a note. But if there was another accident that that node I’m not going to replicate that node that driver still is that driver. Yeah so I can see that if one driver was in 50 accidents out of 1300 then I’d say something like Yeah right. Or that you know the one driver that was in one accident. All of his passengers were in an accident as well. That’s a huge flag and that you could only link. It’s a lot easier to tell once you draw this into a social network 5 percent and these folks should do this stuff by hand and there were the good insurance claims agents and then there were the ones who actually passed to pass it along.
But if you if you put this all in in a tree that may not be the right term but in some sort of system or graph you can pick up on these things which is very cool. All right guys thanks again for listening to another episode of data science imposters. You can always check us out at Data Science impostures dot com. Check our Twitter account at DSN postures and reticent e-mail Torti or Antoniou at Data Science impostures dot com. Thank us Taksin.
What is a terraformer? Interview with Ezinne Uzo-Okoro
Sep 10, 2017
This week we learn about ‘terraformers’ and the three biggest ways that this company leverages data science. Ezinne Uzo-Okoro is the founder of Terraformers.com, an online marketplace that empowers anyone to grow affordable fresh organic food anywhere and donate the excess to local food banks. Terraformers provides an entrepreneurial approach to improved nutrition, and feeding communities. Customers – anyone with at least 100 sq ft of yard or rooftop space – will democratize food access, and improve the economic trajectories of their own communities.
During her 13-year NASA career, Ezinne contributed to multifarious missions including a constellation of micro-satellites, and six launched spacecraft missions – Cassini, the Saturn orbiter; ExPRESS Logistics Carrier (ELC) sitting atop the International Space Station; Global Precipitation Measurement (GPM) observing hurricanes and precipitation on Earth; EFT-1 test mission for human exploration; Neutron star Interior Composition Explorer studying supernovae and neutron stars; and the Transiting Exoplanet Survey Satellite continuing the search for exoplanets. She is a technical leader in flight software development, and spacecraft systems engineering, and has made significant contributions in all phases of the NASA mission life-cycle from concept development and design to deployment and operations. She earned Computer Science and Aerospace Systems Engineering degrees and has received numerous NASA awards.
Listen closely, let me tell you a story #storytelling
Sep 07, 2017
I stepped closer but I didn’t want to make a sound… Storytelling is a powerful tool that we use to explain, warn, and motivate. In this episode, we talk about storytelling and invite our listeners to share their stories. Listen to Jordy’s story. Let us know what you think. Email us if you have a story to share for our story feature. If you need more inspiration for storytelling, check out the The Moth (https://themoth.org/ ) which has been hosting story slams for some time now.
Real Data. Fantasy Football. w/ @TheCoogene
Sep 04, 2017
Football season is right around the corner with preseason games well underway. This week we spend time with @TheCoogene who teaches us all about fantasy football strategies, leagues, and choosing the right picks. He even gives us three ‘sleeper’ picks for those of you that participate in fantasy football. For those of you that care about the #DataScience, listen to get ideas on how to get your winning model. Fantasy football is a big business and using data over the competition will be a big advantage. We also talk a big about gambling and the relation to fantasy football.
Hacking your (wireless) data – Interview w/ @s0lst1c3
Aug 28, 2017
Gabriel (@s0lst1c3), a security engineer, joins us this week to discuss wireless security and his talk at this year’s #Defcon. ‘DEF CON is one of the oldest continuously running hacker conventions around, and one of the largest’. Some great paraphrased snippets from the episode include ‘lawyers are like [law] hackers’, ‘getting domain admin access is like getting a star in Mario Brother’s, ‘grandma gets her own wifi’
Average Lies (Basic Statistics and Not Truths)
Aug 24, 2017
We explore basic statistics and how skeptical you should be about them. 100% of the people that have heard this podcast agree. (Hint: you should ask about our sample size) We look forward to hearing your thoughts on how Statistics have betrayed you.
Are you with me? The journey of a startup
Aug 21, 2017
This week we interview a technology and data company startup entrepreneur. Dwaine explains how a company is born out of an idea, evolves to meet the demands of customers, and survives through the hard work, sacrifice, and dedication of its team. We discuss how the financial crisis of 2007 and 2008, when the world seemed close to collapse, led to some people questioning their futures in the industry and others seeing opportunities. One of our listeners said : “….[I] thought I’d be a deer in headlights listening to you guys. However, I thought it was not only informative, but really engaging & reliable even to someone like me, who isn’t familiar with the lingo and industry.”
Using Evolution and Genomics to Solve Problems
Aug 16, 2017
Take a look at NASA’s evolved antenna picture on our site. This is not your grandfather’s antenna. This design was created by a computer.
If you can agree that in nature, the strongest survive then you could have developed the theory behind this next topic. In the early 1970s, John Hollande used what he knew about evolution and genetics to propose a new way to solve optimization problems. These became known as evolutionary algorithms.
The algorithms go through the following stages:
1. Natural selection – deciding which solutions live and which go away
2. Reproduction or cross over – mixing and combining solutions
3. Mutations – randomly changing components or ordering of solutions
Examples include:
1. Timetabling – scheduling resources across multiple constraints
2. Traveling salesman – example: what’s the best route for a UPS truck to take during any given day?
3. Playing games – can a computer evolve a strategy only knowing the rules of the game?
Becoming a Data Scientist with Renee Teate (@BecomingDataSci)
Aug 14, 2017
Renee Teate – an accomplished data scientist, creator of the ‘Becoming A Data Scientist’ podcast and website, and the voice behind @BecomingDataSci twitter name – joins us to discuss her own journey to become a data scientist, her quest to get others to join the field of Data Science, and what excites her about the future. Renee was an exceptional guest with great stories, good insight, and the stamina to keep Jordy and Antonio focused for an hour.
Here are few hashtags to describe moments in this episode including #HarrisonburgNotHarrison #JMU #RosettaStone #Skynet #AI
Check out these resources if you’re interested in more …
If you tell me who your closest friends are, I can tell you who you are
Aug 09, 2017
We’ve heard sayings like ‘if you tell me who your closest friends are, I can tell you who you are’ – the idea is that you and your friends have such similarities allowing you to form a group or cluster that may be different than strangers.
In this episode, we explore how we would group similar items together (clustering) using similarities (or distances). The unsupervised clustering algorithms (i.e. we do not provide test data indicating an existing relationship) that we discuss are k-Means and DBSCAN.
If Only I Had A Brain – Artificial Neural Networks
Aug 07, 2017
Antonio and Jordy talk about artificial neural networks; these are algorithms that today seem synonymous with Artificial Intelligence. These algorithms, first introduced in the late 1950s, which mimic how the brain works are now being used extensively. We casually explore the history of ANNs, how these algorithms work, and what they can do. We expect that this will be one of many conversations about artificial neural networks.
Keep the conversation going on Twitter @dsimposters.
Ep. 0 – Who? What? Why? The Data Science Imposters Introduction
Aug 02, 2017
We have danced around an introduction of ourselves and the show. Now that we have published 10 episodes, have a few more being edited, and have some great guests lined up we want to introduce ourselves and our thoughts about the show.
@BecomingDataSci said it best. If you haven’t already,check out her Twitter page and site.
Ep. 10 – 1 way to lose $1,000 in a week: Cryptocurrencies
Jul 30, 2017
Antonio plunges into the world of cryptocurrencies and buys enough Ether and Litecoin to lose $1,000 in a week. Bitcoin, Ether, and Litecoin cryptocurrencies are built on blockchain technology. Blockchain technology is like a decentralized ledger. While this is an exciting new frontier for the technology and these cryptocurrencies, there is still room for improvement. Hacks and thefts have threatened these cryptocurrencies and has created some skepticism for the underlying technology. Financial companies are optimistic and betting on blockchain technology’s success to improve their own processes.
Listen to our podcast, read these articles, and let us know what you think about the technology.
Credit: The screenshot for Litecoin pricing is from https://coinmarketcap.com/currencies/litecoin/ and the screenshot for Ether pricing is https://www.coindesk.com/ethereum-price/
Ep. 9 – Web scraping, APIs, and Programming
Jul 24, 2017
Jordy and Antonio discuss web scraping, APIs (application programming interface), and the benefits of programming solutions. They also talk about the challenges of combining data across multiple data sources (even if those data sets are open).
Beautiful soup is the premier solution for web scraping in Python. The following is an article that you will find helpful if you are interested in starting with the library: Web Scraping with Beautiful Soup.
The list goes on and on and only continues to grow.
Episode is archived here: http://traffic.libsyn.com/datascienceimposters/DSI_Episode_09.mp3
Ep. 8 – Getting bad directions from Mr. Dijkstra
Jul 20, 2017
In this episode, Antonio begins by discussing the seven hour car ride to Cape Code, MA and the eight hours spent on the return trip. This leads into a discussion about how Google and other companies use graphs to give us directions. Here’s a simplified graph created with Graph Online.
You can also use a free software package, GraphViz, to create these graphs. It requires a little bit more work but allows a lot more customization.
Book Recommendations
I highly recommend this book. It is technical but accessible to a wider audience than most of these books. (You can skip the coding sections and still get a good amount from this book)
Jason, a friend of the show, joins us in this episode as we discuss the ‘Money Ball’ of Health Care. We also dive into Jordy’s BMI, learn that there are now three categories of obesity within the BMI range, reveal that Antonio is 52 years old according to his ‘Vitality’ age, and find out what Jason gets from City MD that he does not get from his regular doctor. We also wonder how we can contribute to the Project Baseline Study without being one of the 10,000 selected volunteers.