Machine learning model research requires running expensive, long-running experiments where even a slight mis-calibration can cost millions of dollars in underutilized compute resources. Once trained, model deployment, production monitoring, and observability requirements all present unique operational challenges.
Chris Van Pelt is the Chief Information Officer of Weights and Biases, which is the industry standard in experiment monitoring and visualization, and has expanded that expertise into a comprehensive suite of ML Ops tooling including model management, deployment, and monitoring.
Chris joins us today to discuss the state of the machine learning ecosystem at large, as well as some of their more recent work around production LLM tracing and monitoring.
Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer .
De-scoping Your AWS Services from Data Residency Requirements
Dec 11, 2023
From the widely recognized GDPR in Europe to Brazil’s LGPD regulations, and the more recent introduction of India’s DPDP law, over 100 countries now have some form of privacy regulation in place. What’s common among many of these regulations is the concept of data residency – the physical location of your data. However, each region’s requirements bring their own unique nuances, encompassing restrictions on data transfer, data storage locations, and individual data rights.
Navigating this complex sphere of privacy regulations is a huge burden for many companies born in the cloud. Their data simply ends up everywhere, and tracking down the locations, adhering to local laws, and even storing and using it locally is enormously complex and expensive.
Over the past year, I’ve engaged with numerous companies eager to expand their businesses into new markets, such as Europe and Australia. However, they’ve encountered a significant roadblock – the absence of a robust technology solution to address the data residency requirements of these regions. As a result, they face the expensive and nightmarish scenario of duplicating their cloud infrastructure for each new region, which not only hampers operational efficiency but also limits their data analyst and scientists from running analytics globally.
In this blog post, I offer a solution to this pressing technology and business challenge by introducing a PII data privacy vault. This architectural approach to data privacy effectively removes the burden of data residency, compliance, and data security responsibilities from your infrastructure, providing a seamless path for global expansion and data management.
Let’s dive in.
Data Residency and Barriers to Expansion
To grasp the intricacies of regulatory compliance in the context of global expansion, it’s important to understand a few key concepts.
Compliance
Compliance denotes a business’s adherence to the laws and regulations governing data privacy and protection. These regulations are contingent on the geographic location of the customer whose data is being collected. Ensuring compliance is imperative for legal reasons as it shields businesses from financial penalties, license revocations, and the erosion of customer trust.
Data Residency
Data residency pertains to the physical location where customer data is stored. For instance, a website may serve customers in the EU, but their data could be hosted on a server located in Chicago. Different countries and regions have precise laws dictating how customer data should be handled, processed, stored, and safeguarded, making data residency a critical consideration.
Varying Regulations
The complexity surrounding data residency and compliance obligations primarily arises from the diversity of regulations worldwide. For instance, the European Union (EU) has GDPR, Brazil follows LGPD, and the United States enforces a patchwork of state-specific laws like CCPA in California and CTDPA in Connecticut. These regulations diverge significantly in terms of their stipulations and penalties.
Barriers to Global Expansion
The disparities in regulations and compliance requirements often pose formidable obstacles for companies striving to attain a global presence. Navigating diverse regulatory frameworks demands significant time, resources, and expertise. The resulting complexity frequently dissuades businesses from venturing into new markets, thereby constraining opportunities for global expansion.
We’ve looked at the problem, now, let’s explore an approach to addressing these challenges.
What is a Data Privacy Vault?
A data privacy vault isolates, protects, and governs access to sensitive customer data. Within the vault, confidential information is securely stored, while abstract and non-sensitive tokens, serving as references, are retained in conventional cloud storage. This means that only non-sensitive tokenized data is accessible to other systems, ensuring the utmost protection and compliance.
In a recent IEEE article, the authors made a case that this architectural approach to data privacy is the future of privacy engineering. Just as any modern system likely contains back end services, a database, and a warehouse, all modern systems need a data privacy vault to safely store, handle, and use of sensitive customer PII.
Traditional PII management versus a data privacy vault (source: IEEE).
Let’s take a look at a specific example for a simple web application. In the image below, a phone number is being collected by a front-end application. For effective de-scoping, it’s ideal to initiate the de-identification process at the earliest stage in the data lifecycle. In this scenario, the phone number is stored directly within the vault during collection at the front end.
Example of vault architecture for collecting sensitive customer PII.
Within the vault, the phone number, alongside any other personally identifiable information (PII), is stored within a robust and isolated environment, segregated from your organization’s existing infrastructure. All downstream services, ranging from application databases to data warehouses, analytics platforms, and logging systems, interact solely with tokenized (de-identified) representations of the data. Queries against the PII for specialized operations or algorithmic operations against PII execute directly within the vault.
Access to de-tokenize or re-identify data is controlled through a zero trust model. Policy-based rules control who sees what, when, where, and for how long on a row and column level.
Controlling access to vault data based on who is requesting the data.
The vault combines the principle of isolation, zero trust, privacy-enhancing technologies, and governance controls to insulate your systems from ever having to touch PII directly. This places your AWS components beyond the scope of regulatory compliance, assuring a higher level of data protection and adherence to data residency requirements.
Your AWS Services Handle Only De-identified Data
Let’s assume we have a simple application infrastructure as shown below with AWS Amplify providing the web server infrastructure, DynamoDB for application storage, and Redshift for warehousing.
Example web application infrastructure running on AWS.
Without a vault in place, everything within our AWS account is under compliance and security scope.
By introducing the vault as shown below (in this example, the collection of PII is handled directly from the vault), we de-scope all our AWS services. The services are only ever handling de-identified data, including the warehouse.
Many analytical operations can be performed with de-identified data provided the data is consistently generated. A warehouse doesn’t need to have access to someone’s name, it only needs a consistently generated representation of the name in order to execute counts, group bys, and joins.
Example of de-scoping AWS services with a data privacy vault.
Storing PII to Different Regionalized Vaults
With Skyflow, a data privacy vault company, you can host vaults in various global regions and route sensitive data to a specific regional vault for storage and use. For instance, consider how the following application architecture meets data residency requirements across multiple regions:
Using regional multiple vaults to comply with data residency requirements.
Your company’s site collects customer PII during account creation.
On the client side, the website detects the customer’s location.
Detecting that the customer is in the EU, the client-side code uses Skyflow’s SDK to collect the PII data and store it in your company’s data privacy vault in Frankfurt, Germany. Note: For customers based in the US, the PII data is instead routed to the data privacy vault in the US (in this case, Virginia).
The EU-based customer’s sensitive PII is stored in the EU-based data privacy vault, and Skyflow responds with de-identified data.
The client-side code sends the account request, now with de-identified data, to the server.
The server processes the request, storing the data (now de-identified and tokenized) in cloud storage in the “Oregon, US” region.
At the end of the week, your company’s Redshift instance in Tokyo, Japan, loads the data (already de-identified and tokenized) from cloud storage to perform analytics.
Deploying multiple vaults situated in different regions streamlines the management of your sensitive data, ensuring compliance with data residency requirements across all your markets.
The data privacy vault architecture significantly simplifies the complexities associated with data residency and compliance. Furthermore, by exempting Redshift (or any warehouse) from the compliance responsibilities tied to data residency, global analytics operations continue seamlessly within a single warehouse instance.
Final Thoughts
Compliance regulations, with their stringent data residency stipulations, necessitate businesses to maintain rigorous standards for data localization, protection, privacy, and security. Adhering to these regulations is essential to mitigating the risks associated with breaches, penalties, and potential damage to reputation. However, enterprises operating in various global regions, serving diverse customer bases, are left to deal with the complex task of navigating multiple regulatory landscapes.
Using data privacy vaults as your core infrastructure for customer PII offers a streamlined solution to simplify global compliance, particularly concerning AWS services and cloud storage.
With a data privacy vault, organizations gain the ability to centralize the security of all sensitive data, effectively removing AWS and cloud storage from their compliance scope. By deploying data privacy vaults in various regions, companies can ensure that sensitive data storage and transmission align with the specific laws and regulations of each operational jurisdiction, thereby enhancing their overall compliance and security posture.
If you have thoughts on this or questions about this approach, please reach out to me on LinkedIn.
Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer.
Hugging Face was founded in 2016 and has grown to become one of the most prominent ML platforms. It’s commonly used to develop and disseminate state-of-the-art ML models and is a central hub for researchers and developers.
Sayak Paul is a Machine Learning Engineer at Hugging Face and a Google Developer Expert. He joins the show today to talk about how he entered the ML field, diffusion model training, the transformer-based architecture, and more.
Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer .
The Data Cloud’s Cheese and Diamond Problem
Dec 04, 2023
In any given week, if you search the news for “data breach”, you’ll see headlines like the ones below.
Companies like MGM and Caesars spend millions of dollars on firewalls, SIEMs, HSMs, and a whole smorgasbord of cybersecurity tools and yet, they can’t protect your social security number.
From hotels and casinos to some of the most innovative technology companies in the world, why is it that companies with seemingly endless financial and talent resources can’t get a handle on their data security challenges?
I believe this is due to a fundamental misunderstanding about the nature of data that started over 40 years ago.
Back in the 1980s, as computers found their way more and more into businesses, we lived in a disconnected world. To steal someone’s data, you had to physically steal the box the data lived on. As a consequence, we assumed that all data is created equal, that all data is simply ones and zeros, but this is wrong. All data isn’t created equal, some data is special, and needs to be treated that way.
In this blog post, I share my thoughts on what I refer to as the “Cheese and Diamond Problem” and how this has led to the data security challenges companies face today. I also offer explore an alternative approach, a new way of thinking, a privacy by engineering approach that helps us move towards a world where security is the default, and not bolted on.
The Cheese and Diamond Problem
Imagine that in my house I have cheese and I have diamonds. As a gracious host, I want guests of my home to be able to access my cheese. They should be able to freely go into the refrigerator and help themselves to some delicious cheese and perhaps a cracker.
However, I don’t want just anyone to touch my diamonds. Perhaps my diamonds even have sentimental value because it’s a diamond ring that’s been passed down through many generations in my family. Clearly the diamond is special.
Yet, if I store my diamonds in the refrigerator next to my cheese, it makes controlling access to the diamonds much more challenging. By co-locating these very different objects, my refrigerator alone isn’t enough to make sure my wife has access to the diamonds and cheese, but my guests only have access to my cheese.
The rules of engagement for something like diamonds are completely different than the rules of engagement for cheese. We all understand this distinction when it comes to physical objects.
This is exactly why my passport and my children’s birth certificates aren’t in the junk drawer in my kitchen with my batteries and my flashlights. If someone breaks into my home and steals my batteries, it’s not that big a deal, but if someone steals my daughter’s birth certificate, then I not only feel like I’ve failed as a parent, but the information on her birth certificate is also now compromised forever. I can’t simply replace her date of birth.
Despite all of us intuitively understanding that some physical objects are different, that they’re special, we somehow miss this point when we work with data. We don’t apply this thinking to Personally Identifiable Information (PII). We treat it like any other form of transactional or application data. We stuff it in a database, pass it around, make a million copies, and this leads to a whole host of problems.
The PII Replication Problem
Let’s consider a simple example.
In the diagram below, which represents an abstraction of a modern system, a phone number is being collected in the front end of the application, perhaps during account creation. That phone number ends up being passed downstream through each node and edge of the graph and at each node, we potentially end up with a copy of the phone number.
We store it in our database, in the warehouse, but we may also end up with a copy in our log files and the backups of all these systems. Instead of just having one copy of the phone number, we now have many copies and we need to protect all those locations and control access consistently wherever the data is stored.
Imagine that instead of having one copy of your passport that you keep in a secure location, you made 10,000 copies and then distributed them all over the world. Suddenly keeping your passport safe becomes a much harder problem in all 10,000 locations than if you have one copy secure in your home.
But this is exactly what we do with data.
We copy it everywhere and then attempt to lock down the hatches across all these systems and keep the policies and controls in sync about who can see what, when, and where. Additionally, because of the Cheese and Diamond Problem, we can’t adequately govern access to the data because the intermixing of our data conflates the rules of engagement about who has access. This quickly becomes an intractable problem because businesses don’t know what they’re storing or where it is, leading to the world we live in now where major corporations have data breaches on a regular basis.
Not All Data is Equal
Businesses are collecting and processing more data than ever. With the explosion of generative AI, as much as we are in an AI revolution, we are also in a data revolution. We can’t have powerful LLMs without access to massive data.
Companies leverage their data to drive business decisions, product direction, help serve customers better, and even create new types of consumer experiences. However, as discussed, not all data is created equal, some data, like PII, is special.
Over time, we’ve recognized that other forms of data like encryption keys, secrets, and identity are special and need to be treated that way. There was a time when we stored secrets in our application code or database. We eventually realized that was a bad idea and moved them into secret managers.
Approaches to managing different types of sensitive data.
Despite this progress, we are still left without an accepted standard for the storage and management of sensitive PII data. PII deserves the same type of special handling. You shouldn’t be contaminating your database with customer PII.
Luckily there’s a solution to this problem originally pioneered by companies like Netflix, Google, Apple, and Goldman Sachs and now touted by the IEEE as the future of privacy engineering, the PII Data Privacy Vault.
The PII Data Privacy Vault
A data privacy vault isolates, protects, and governs access to sensitive customer data (i.e. PII) while also keeping it usable. With a vault approach, you remove PII from your existing infrastructure, effectively de-scoping it from the responsibility of compliance and data security.
A vault is a first principles architectural approach to data privacy and security, facilitating workflows like:
PII storage and management for regulated industries
PCI storage and payment orchestration
Data residency compliance
Privacy-preserving analytics
Privacy-preserving AI
Let’s go back to our example from earlier where we were collecting a phone number from the front end of an application.
In the vault world, the phone number is sent directly to the vault from the front end. From a security perspective, we ideally want to de-identify sensitive data as early in the life cycle as possible. The real phone number will only exist within the vault, it acts as a single source of truth that’s isolated and protected outside of the existing systems.
Example of using a data privacy vault to de-scope an application.
The vault securely stores the phone number and generates a de-identified reference in the form of a token that gets passed back to the front end. The token has no mathematical connection to the original data, so it can’t be reverse engineered to reveal the original value.
This way, even if someone steals the data, as what happened with the Capital One data breach, the tokenized data carries no value. In fact, Capital One was fined only because they failed to tokenize all regulated data, some records were purely encrypted and those records were compromised.
Revealing Sensitive Data
While it’s great to securely store sensitive data, if we simply lock it up and throw away the key, it’s not super useful. We store all this customer PII so we can use it.
For example, we may need to reveal some of the data to a customer support agent, an IT administrator, a data analyst, or to the owner of the data. In this case, if we absolutely need to reveal some of the data, we want to re-identify it as late as possible, for example during render. We also want to limit what a user has access to based on the operations they need to perform with the data. While I might be able to see my full phone number, a customer support agent likely only needs the last four digits of my phone number and an analyst maybe only needs the area code for executing geo-based analytics.
The vault facilitates all of these use cases through a zero trust model where no one and no thing has access to data without explicit policies in place. The policies are built bottoms up, granting access to specific columns and rows of PII. This allows you to control who sees what, when, where, for how long, and in what format.
Let’s consider the situation where we have a user logging into an application and navigating to their account page. On the account page, we want to show the user their name, email, phone number, and home address based on the information they registered with us.
In the application database, we’ll have a table similar to the one shown below where the actual PII has been replaced by de-identified tokens.
Example of users table within the application database.
As in the non-vault world, the application will query the application database for the user record associated with the logged in user. The record will be passed to the front end application and the front end will exchange the tokens for a representation of the original values depending on the policies in place.
In the image below, the front end already has the tokenized data but needs to authenticate with the vault attaching the identity of the logged in user so that access is restricted based on the contextual information of the user’s identity. This is known as context-aware authorization.
Once authenticated and authorized, the front end can directly call the data privacy vault to reveal the true values of the user’s account information. But the front end only has access to this singular row of data and it’s limited to the few columns needed to render the information on the account page.
Example of revealing sensitive data for a single record.
Sharing Sensitive Data
No modern application exists in a silo. Most applications need to share customer PII with third party services to send emails, SMS, issue a payment, or some other type of workflow. This is also supported by the vault architecture by using the vault as a proxy to the third party service.
In this case, instead of calling a third party API directly, you call the data privacy vault with the de-identified data. The vault knows how to re-identify the PII securely within its environment, and then securely share that with the third party service.
An example of this flow for sending HIPAA compliant forms of communication is shown below. The backend server calls the vault directly with tokenized data and the vault then shares the actual sensitive data with the third party communication service.
Example of using a vault to send HIPAA compliant communication.
Final Thoughts
We’ve come a long way since building business applications in the 1980s, but we’ve failed to evolve our thinking regarding how we secure and manage customer PII. Point solutions like firewalls, encryption, and tokenization alone aren’t enough to address the fundamental problem. We need a new approach to cut to the root of the Cheese and Diamond Problem.
Not all data is the same, PII belongs in a data privacy vault.
The data privacy vault provides such an approach.
It’s an architectural approach to data privacy where security is the default. Multiple techniques like polymorphic encryption, confidential computing, tokenization, data governance, and others combine with the principle of isolation and zero trust to give you all the tools you need to store and use PII securely without exposing your systems to the underlying data.
If you have comments or questions about this approach, please connect with me on LinkedIn. Thanks for reading!
Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer.
Building a Privacy-Preserving LLM-Based Chatbot
Nov 27, 2023
As Large Language Models (LLMs) and generative AI continue to grow more sophisticated and available, many organizations are starting to build, fine-tune, and customize LLMs based on their internal data and documents. This can bring incredible efficiency and reliability to data-driven decision-making processes. However, this practice comes with its share of challenges, primarily around data privacy, protection, and governance.
Let’s consider the construction of the LLM itself, which is trained on a massive amount of data collected from public and private sources. Without careful anonymization and filtering, sensitive data — such as PII or intellectual property — may be inadvertently included in the training set, potentially leading to a privacy breach.
Furthermore, privacy concerns are introduced when interacting with LLMs, as users might input sensitive data, such as names, addresses, or even confidential business information. If these inputs aren’t handled properly, the misuse or exposure of this information is a genuine risk.
In this post, we’ll explore how to work with LLMs in a privacy-preserving way when building an LLM-based chatbot. As we walk through the technology from end-to-end, we’ll highlight the most acute data privacy concerns and we’ll show how using a data privacy vault addresses those concerns.
Let’s start by taking a closer look at the problem we need to solve.
The problem: Protecting sensitive information from exposure by a chatbot
Consider a company that has uses an LLM-based chatbot for its internal operations. The LLM for the chatbot was built by modifying a pre-existing base model with embeddings created from internal company documents. The chatbot provides an easy-to-use interface that lets non-technical users within the company access information from internal data and documents.
The company has a sensitive internal project called “Project Titan.” Project Titan is so important and so sensitive that only people working on Project Titan know about it. In fact, the team often says: the first rule of Project Titan is don’t talk about Project Titan. Naturally, the team wants to take advantage of the internal chatbot and also include Project Titan specific information to speed up creation of design documents, documentation, and press releases. However, they need to control who can see details about this sensitive project.
What we have is a tangible and pressing privacy concern that sits at the intersection of AI and data. These challenges appear extremely difficult to solve in a scalable and production-ready way. Simply having a private version of the LLM doesn’t address the core issue of data access.
The proposed solution: Sensitive data de-identification and fine-grained access control
Ultimately, we need to identify the key points where sensitive data must be de-identified during the process of building (or fine-tuning) the LLM and the end user’s interaction with the LLM-based chatbot. After careful analysis, we’ve identified that there are two key points in the process where we need to de-identify (and later re-identify) sensitive data:
Before ingestion: When documents from Project Titan are used to create embeddings, the project name, any PII, and anything else sensitive to the project must be de-identified. This de-identification should occur as part of the ETL pipeline prior to data ingestion into the LLM.
During use: When a user inputs data to the chatbot, any sensitive data included in that input must also be de-identified.
You can de-identify sensitive data using Skyflow’s polymorphic encryption and tokenization engine that’s included within Skyflow Data Privacy Vault. This includes detection of PII but also terms you define within a sensitive data dictionary, like intellectual property (i.e. Project Titan).
Of course, only Project Titan team members who use the chatbot should be able to access the sensitive project data. Therefore, when the chatbot forms a response, we’ll rely on Skyflow’s governance engine (which provides fine-grained access control) and detokenization API to retrieve the sensitive data from the data privacy vault, making it available only to authorized end users.
Before we dive into the technical implementation, let’s go through a brief overview of foundational LLM concepts. If you’re already familiar with these concepts, you can skip the next section.
A brief primer on LLMs
LLMs are sophisticated artificial intelligence (AI) systems designed to analyze, generate, and work with human language. Built on advanced machine learning architectures, they are trained on vast quantities of text data, enabling them to generate text that is convincingly human-like in its coherence and relevance.
LLMs leverage a technology called transformers — one example is GPT, which stands for Generative Pre-Trained Transformer — to predict or generate a piece of text when given input or context. LLMs learn from patterns in the data they are trained on and then apply these learnings to understand newly given content or to generate new content.
Despite their benefits, LLMs pose potential challenges in terms of privacy, data security, and ethical considerations. This is because LLMs can inadvertently memorize sensitive information from their training data or generate inappropriate content if not properly regulated or supervised. Therefore, the use of LLMs necessitates effective strategies for data handling, governance, and preserving user privacy.
A technical overview of the solution
When embarking on any LLM project, we need to start with a model. Many open-source LLMs have been released in recent months, each with its specific area of focus. Instead of building an entire LLM model from scratch, many developers choose a pre-built model and then adjust the model with vector embeddings generated from domain-specific data.
Vector embeddings encapsulate the semantic relationship between words and help algorithms understand context. The embeddings act as an additional contextual knowledge base to help augment the facts known by the base model.
In our case, we’ll begin with an existing model from Hugging Face, and then customize it with embeddings. Hugging Face provides ML infrastructure services as well as open-source models and datasets.
In addition to the Hugging Face model, we’ll use the following additional tools to build out our privacy-preserving LLM-based ETL pipeline and chatbot:
LangChain an open-source Python library that chains together components typically used for building applications (such as chatbots) powered by LLMs
Snowflake, which we’ll use for internal document and data storage
Snowpipe, which we’ll use with Snowflake for automated data loading
Chroma, an AI-native, open-source database for vector embeddings
Streamlit, an open-source framework for building AI/ML-related applications using Python
RetrievalQA, a question-answering chain in LangChain which gets documents from a Retriever and then uses a QA chain to answer questions from those documents
The following diagram shows the high-level ETL and embeddings data flow:
Example of the ETL and embeddings data flow.
The ETL and embeddings flows from end to end are:
ETL
Start with source data, which may contain sensitive data.
Send data to Skyflow Data Privacy Vault for de-identification.
Use Snowpipe to load clean data into Snowflake.
Create vector embeddings
Load documents from Snowflake into LangChain.
Create vector embeddings with LangChain.
Store embeddings in Chroma.
Once the model has been customized with the Project Titan information, the user interaction and inference flow is as follows:
User interaction and inference information flow
Chat UI input
Accept user input via Streamlit’s chat UI.
Send user input to Skyflow for de-identification.
2. Retrieve embeddings
Get the embeddings from Chroma and attach to RetrievalQA.
3. Inference
Send clean data to RetrievalQA.
Use QA chain in RetrievalQA to answer the user’s question.
4. Chat UI response
Send RetrievalQA’s response to Skyflow for detokenization.
Send re-identified data to Streamlit for display to the end user.
Now that we’re clear on the high-level process, let’s dive in and take a closer look at each step.
ETL: Cleaning the source data
Cleaning the source data with Skyflow Data Privacy Vault is fairly straightforward and I’ve covered some of this in a prior post. In this case, we need to process all the source documents for Project Titan available in an AWS S3 bucket.
Skyflow will store the raw files, de-identify PII and IP, and save the clean files to another S3 bucket.
import boto3
from skyflow.vault import ConnectionConfig, Configuration, RequestMethod
# Authentication to Skyflow API
bearerToken = ''
def tokenProvider():
global bearerToken
if is_expired(bearerToken):
return bearerToken
bearerToken, _ = generate_bearer_token('<YOUR_CREDENTIALS_FILE_PATH>')
return bearerToken
def processTrainingData(trainingData):
try:
# Vault connection configuration
config = Configuration('<YOUR_VAULT_ID>', '<YOUR_VAULT_URL>', tokenProvider)
# Define the connection API endpoint
connectionConfig = ConnectionConfig('<YOUR_CONNECTION_URL>', RequestMethod.POST,
requestHeader = {
'Content-Type': 'application/json',
'Authorization': '<YOUR_CONNECTION_BASIC_AUTH>'
}
requestBody = {
'trainingData': trainingData
}
# Connect to the vault
client = Client(config)
# Call the Skyflow API to de-identify the training data
response = client.invoke_connection(connectionConfig)
# Define the S3 bucket name and key for the file
bucketName = "clean-data-bucket"
fileKey = "{timestamp}-{generated-uuid}"
# Write the data to a file in memory
fileContents = bytes(response.training_data.encode("UTF-8"))
# Upload the file to S3
s3 = boto3.client("s3")
s3.put_object(Bucket=bucketName, Key=fileKey, Body=fileContents)
except SkyflowError as e:
print('Error Occurred:', e)
Next, we’ll configure Snowpipe to detect new documents in our S3 bucket and load that data into Snowflake. To do this, we’ll need to create the following in Snowflake:
CREATE OR REPLACE TABLE custom_training_data (
training_text BINARY
);
CREATE OR REPLACE FILE FORMAT training_data_json_format
TYPE = JSON;
CREATE OR REPLACE TEMPORARY STAGE training_data_stage
FILE_FORMAT = training_data_json_format;
CREATE PIPE custom_training_data
AUTO_INGEST = TRUE
AS
COPY INTO custom_training_data
FROM (SELECT $1:records.fields.training_text
FROM @ training_data_stage t)
ON_ERROR = 'continue';
With that, we have raw data that goes through a de-identification process, and then we store the plaintext sensitive data in Snowflake. Any sensitive data related to Project Titan is now obscured in the LLM, but because of Skyflow’s polymorphic encryption and tokenization, the de-identified data has referential integrity, meaning we can return the data to its original form when interacting with the chatbot.
Creating vector embeddings: Customizing our LLM
Now that we have our de-identified text data stored in Snowflake, we’re confident that all information related to Project Titan has been properly concealed. The next step is to create embeddings of these documents.
We’ll use the Instructor model provided by Hugging Face as our embedding model. We store our embeddings in Chroma, a vector database built expressly for this purpose. This will allow for the downstream retrieval and search support of the textual data stored in our vector database.
The code below loads the base model, embedding model, and storage context.
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings
model_id = "hkunlp/instructor-large"
embed_model = HuggingFaceEmbeddings(model_name=model_id)
vectorstore = Chroma("langchain_store", embed_model)
Next, we need to load all documents and add them to the vector store. For this, we use the Snowflake document loader in LangChain.
from snowflakeLoader import SnowflakeLoader
import settings as s
QUERY = "select training_text as source from custom_training_data"
snowflake_loader = SnowflakeLoader(
query=QUERY,
user=s.SNOWFLAKE_USER,
password=s.SNOWFLAKE_PASS,
account=s.SNOWFLAKE_ACCOUNT,
warehouse=s.SNOWFLAKE_WAREHOUSE,
role=s.SNOWFLAKE_ROLE,
database=s.SNOWFLAKE_DATABASE,
schema=s.SNOWFLAKE_SCHEMA,
metadata_columns=["source"],
)
training_documents = snowflake_loader.load()
vector_store.add_documents(training_documents)
With the training document and vector store created, we create the question-answering chain.
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(temperature=0.2,model_name='gpt-3.5-turbo'),
chain_type="stuff",
retriever=vector_store.as_retriever())
result = qa.run("What is Project Titan?")
This question (“What is Project Titan?”) will fail because the model doesn’t actually know about Project Titan, it knows about a de-identified version of the string “Project Titan”.
To issue a query like this, the query needs to be first sent through Skyflow to de-identify the string and then the de-identified version is passed to the model. We’ll tackle this next as we start to put the pieces together for our chat UI.
Chat UI Input: Preserving privacy of user-supplied data
We’re ready to focus on the chatbot UI aspect of our project, dealing with accepting and processing user input as well as returning results with Project Titan data detokenized when needed.
For this portion of the project, we will use Streamlit for our UI. The code below creates a simple chatbot UI with Streamlit.
import openai
import streamlit as st
st.title("Acme Corp Assistant")
# Initialize the chat messages history
if "messages" not in st.session_state.keys():
st.session_state.messages = [
{"role": "assistant", "content": "Hello ! \nHow can I help?"}
]
# Prompt for user input and save
if prompt := st.chat_input():
st.session_state.messages.append({"role": "user", "content": prompt})
# display the prior chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.write(message["content"])
# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
# Generate a response
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
response = "TODO"
message = {"role": "assistant", "content": response}
st.session_state.messages.append(message)
Our simple chat UI looks like this:
As you can see, the UI accepts a user input, but doesn’t currently integrate with our LLM. Next, we need to send the user input to Skyflow for de-identification before we use RetrievalQA to answer the user’s question. Let’s start with accepting and processing our input data.
To detect and de-identify plaintext sensitive data with Skyflow, we can use the detect API endpoint with code similar to the following:
Now that we’ve de-identified the user input data, we can send the question to RetrievalQA, which will then use a QA chain to answer the question from our documents.
We now have our response from RetrievalQA. However, we need to take one additional step before we can send it back to our user: detokenize (re-identify) our response through Skyflow’s detokenization API. This is fairly straightforward, similar to previous API calls to Skyflow.
Everything we need is encapsulated by the function performInference, which calls a function to reIdentifyText after the completion is returned.
Who can see what and in which format is controlled by Skyflow’s governance engine. There’s too much to cover here, but if you want to learn more, see Introducing the Skyflow Data Governance Engine.
These final steps connect our entire application from end-to-end. Now, we need to update our UI code from above so that the response is correctly set.
# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
# Generate a response
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
response = performInference(m["content"])
With these pieces in place, here’s a quick demo of our privacy-preserving LLM-based chatbot in action:
Example of the privacy-preserving bot in action.
Tying it all together
In this article, we walked through the general steps to construct a privacy-preserving LLM-based chatbot. With organizations increasingly using LLM-based applications in their businesses and operations, the need to preserve data privacy has become acute. Concerns about protecting the privacy and security of sensitive data are the biggest adoption blocker that prevents many companies from making full use of AI with their datasets.
Solving this problem requires identifying the key points where sensitive data might enter your system and need to be de-identified. When working with LLMs, those points occur during model training — both when building an LLM or customizing one — and at the user input stage. You can use Skyflow Data Privacy Vault to implement effective de-identification and data governance for LLM-based AI tools like chatbots.
Building an LLM-based chatbot requires the use of several tools to ensure that data is handled in a manner that preserves privacy. Taking privacy-preserving measures is critical to prevent the misuse or exposure of sensitive information. By using the tools and methods we’ve demonstrated here, companies can leverage AI’s benefits and promote efficient data-driven decision-making while prioritizing data privacy and protection.
Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer.
Cloud-based software development platforms such as GitHub Codespaces continue to grow in popularity. These platforms are attractive to enterprise organizations because they can be managed centrally with security controls. However, many, if not most, developers prefer a local IDE.
Daytona is aiming to bridge that gap. It’s a layer between a local IDE and a backend server, so developers can work locally while interfacing invisibly with a remote environment. Ivan Burazin is the CEO and Co-Founder at Daytona, and he joins the show today to talk about how Daytona works, Spotify as an inspiration for his product, and more.
Jordi Mon Companys is a product manager and marketer that specializes in software delivery, developer experience, cloud native and open source. He has developed his career at companies like GitLab, Weaveworks, Harness and other platform and devtool providers. His interests range from software supply chain security to open source innovation. You can reach out to him on Twitter at @jordimonpmm
Please click here to see the transcript of this episode.
Sponsorship inquiries: sponsor@softwareengineeringdaily.com
Knowledge graphs are an intuitive way to define relationships between objects, events, situations, and concepts. Their ability to encode this information makes them an attractive database paradigm.
Hume is a graph-based analysis solution developed by GraphAware. It represents data as a network of interconnected entities and provides analysis capabilities to extract insights from the data. Luanne Misquitta is VP of Engineering at GraphAware and she joins the show today to talk about graph databases, and the engineering of Hume.
Starting her career as a software developer, Jocelyn Houle is now a Senior Director of Product Management at Securiti.ai, a unified data protection and governance platform. Before that, she was an Operating Partner at Capital One Ventures investing in data and AI startups. Jocelyn has been a founder of two startups and a full life cycle, technical product manager at large companies like Fannie Mae, Microsoft and Capital One. Follow Jocelyn on LinkedIn or Twitter @jocelynbyrne.
One Snowflake, Multiple Vaults: A Solution to Data Residency
Nov 20, 2023
Data residency requirements, which govern where sensitive data can be stored or processed in the cloud (or in an on-prem server) are a common feature of many modern data protection laws. Because of data residency requirements, the location of sensitive data has significant regulatory compliance implications in countries and regions around the world.
In this post, we’ll look at the challenges of managing data residency with Snowflake. We’ll start by examining how Snowflake Cloud Regions address data residency challenges, and consider the compliance implications of this approach — especially when loading data from cloud storage. Then, we’ll look at how to simplify data residency compliance using one or more regional data privacy vaults.
Let’s begin with a deeper dive into data residency, and how it impacts compliance.
The implications of data residency on compliance
When you work with personally identifiable information (PII), where you store and process this information has a direct impact on your legal compliance requirements. Some jurisdictions have regulations that govern the protection and privacy of their residents’ PII, restricting how and where it’s used by businesses and other organizations.
For example, the personal data (i.e., PII) of European Union residents cannot be transferred outside the EU without appropriate safeguards.
The laws of each jurisdiction impact how you transmit, manage, process, and store sensitive data in that jurisdiction. Because data residency dictates where (geographically ) data is stored in the cloud, data residency becomes a critical concern in cloud environments that handle sensitive data.
Choose your cloud region carefully
Cloud service providers have data centers located in multiple regions around the world. When businesses sign up for cloud services and configure storage regions and other tooling, they select specific regions where their data is stored.
For many businesses, the selection of regions and locations for data storage is an afterthought.
But, treating this decision as an afterthought is a costly mistake that can come back to haunt you if you’re handling sensitive data. That’s because choosing storage regions is a weighty decision that can have a long-term impact on compliance, and on your business operations.
Snowflake Cloud Regions: a data residency solution?
Snowflake Cloud Regions let you choose the geographic location where your Snowflake data is stored across the data centers provided by the Snowflake-supported public cloud providers — AWS, GCP, and Azure. Each cloud provider offers a set of regions across the globe, with specific geographic data center locations in each cloud provider region.
If your company uses Snowflake Cloud Regions, you have your choice of providers, as well as regions where your data can be stored. When you create an account to deploy and set up Snowflake, whichever region you select becomes the primary location for data storage and for data processing resources.
At first glance, it might seem like Snowflake Cloud Regions provides a simple, effective solution to your data residency and compliance concerns. But for global companies who need global analytics, it isn’t that simple. That’s because, as noted in the Snowflake Cloud Regions documentation:
Each Snowflake account is hosted in a single region. If you wish to use Snowflake across multiple regions, you must maintain a Snowflake account in each of the desired regions.
This means that for each region where your business operates that has data residency requirements, you’ll need a different Snowflake account hosted in that region. Compliance becomes increasingly complex as you scale globally to more and more regions around the world. With this approach, running global analytics operations across different accounts to get a comprehensive view of your business can be a massive and ongoing challenge.
Instead of managing multiple Snowflake accounts with multiple Snowflake instances distributed in various regions around the world, you’d rather maintain a Snowflake instance in a single region to support global data operations. However, you still need to consider the need to honor data residency requirements for sensitive data so you can uphold your compliance obligations and safeguard customer trust.
For example, if you collect the personal data (PII) of customers located in the EU, but your Snowflake instance is located somewhere else, then you need to think through the privacy and compliance impact of storing and processing that data.
Loading data from cloud storage into Snowflake
Snowflake also lets businesses load data from cloud storage services like AWS S3, Google Cloud Storage, Microsoft Azure — regardless of which cloud platform hosts the businesses’ Snowflake account. This can present additional challenges when working to ensure data residency compliance.
For example, let’s say that your company collects PII from both US and EU customers using its website. And, let’s say that this sensitive data is then stored in a Google Cloud Storage bucket that’s located in the AUSTRALIA-SOUTHEAST1 (Sydney) region.
How does transmitting this PII data to Australia, and then storing it in Australia, affect your compliance with regulations like the EU’s GDPR?
The answer is: doing this likely puts you out of compliance with GDPR. This is just one example of how the location where sensitive data is stored — and where it’s processed and replicated — complicates the compliance requirements faced by businesses that handle sensitive PII.
Businesses that handle PII must ensure regulatory compliance by aligning their choice of cloud storage regions with the data residency requirements of markets where they operate.
And beyond compliance issues, businesses should also consider data transfer costs. Transferring data between cloud storage regions can incur significant additional costs, especially if your company is frequently transferring large volumes of data. So, we not only have compliance concerns with cross-border transfers of PII, we also have a cost concern.
So, to briefly recap our problem:
Countries and regions have their own laws and regulations that govern how to handle their residents’ sensitive data (PII).
The geographic location where your business stores and processes sensitive data impacts whether you’re compliant with the data residency requirements of the markets where you operate.
If you use Snowflake to perform analytics on PII, then the complexity of meeting your compliance obligations will depend on the location of your Snowflake account.
If you load PII data into Snowflake from cloud storage, then your compliance obligations are also impacted by the location of your cloud storage.
So, how can we meet data residency requirements, support global analytics operations, and remove the operational overhead of managing multiple Snowflake accounts and instances?
We can solve our data residency problems and protect sensitive data with one or more data privacy vaults.
How a data privacy vault simplifies data privacy
A data privacy vault isolates, protects, and governs access to sensitive customer data. Sensitive data is stored in the vault, while opaque tokens that serve as references to this data are stored in traditional cloud storage or used in data warehouses. A data privacy vault can store sensitive data in a specific geographic location, and tightly controls access to this data. Other systems only have access to non-sensitive tokenized data.
In the example architecture shown below, a phone number is collected by a front end application. Ideally, we should de-identify (i.e., tokenize) this sensitive information as early in the data lifecycle as possible. A data privacy vault lets us do just that.
This phone number, along with any other PII, is stored securely in the vault, which is isolated outside of your company’s existing infrastructure. Any downstream services — the application database, data warehouse, analytics, any logs, etc. — store only a token representation of the data, and are removed from the scope of compliance:
Example of reducing compliance scope with a data privacy vault
Snowflake handles only de-identified data
Because no sensitive data is stored outside the data privacy vault, your compliance scope is restricted to just the vault. This removes the compliance burden from your Snowflake instance.
Example pipeline where sensitive data is isolated and protected within a data privacy vault
To satisfy data residency requirements, we can extend this approach by using multiple regional data privacy vaults placed near customers whose data is subject to these requirements. With sensitive data stored in these data privacy vaults, Snowflake contains only de-identified, tokenized data. It no longer matters if you operate a single global instance of Snowflake or multiple Snowflake accounts across different regions because data residency concerns no longer apply to your Snowflake instances.
Compliance with data residency requirements now depends solely on where your data privacy vaults are located. You no longer need to worry about data residency for all the different parts of your data tech stack, including cloud storage and Snowflake. All sensitive data goes into your data privacy vaults, and these vaults become the only component of your architecture subject to data residency requirements.
Store PII in a data privacy vault in a specific region
With Skyflow Data Privacy Vault you can host your vaults in a wide variety of regions around the world. You can also route sensitive data to a data privacy vault located in a specific region for storage.
For example, consider how the application architecture shown below supports data residency requirements from multiple regions:
Using vaults to satisfy multiple data residency requirements for one Snowflake instance
Your company’s e-commerce site collects customer PII whenever a customer places an order.
On the client side, the website detects the customers’ location.
Detecting that the customer is in the EU, the client-side code uses Skyflow’s API to send the PII data to your company’s data privacy vault in Frankfurt, Germany. Note: For customers based in the US, the PII data is instead routed to the data privacy vault in the US (in this case, Virginia).
This EU-based customer’s sensitive PII is stored in the EU-based data privacy vault, and Skyflow’s API responds with tokenized data.
The client-side code sends the customer order request, now with tokenized data, to the server.
The server processes the order, storing the data (now de-identified and tokenized) in cloud storage in the “Oregon, US” region.
At the end of the week, your company’s Snowflake instance in Tokyo, Japan, loads the data (already de-identified and tokenized) from cloud storage to perform analytics.
By using multiple vaults located in different regions around the world, you can easily manage all of your sensitive data to meet various data residency compliance obligations across each of your global markets.
The data privacy vault architectural pattern vastly simplifies the challenges of data residency and compliance. Additionally, by de-scoping Snowflake from the compliance burden of data residency, global analytics executes as normal — within a single Snowflake instance.
Final thoughts
Compliance regulations and their data residency requirements require that businesses uphold stringent standards for data localization, protection, privacy, and security to reduce their risk of breaches, penalties, and reputational damage. However, businesses with customers (and data) located in a variety of global regions face the added challenge of managing multiple regulations across jurisdictions.
Using data privacy vaults lets businesses simplify their global compliance obligations around data residency as they relate to Snowflake and cloud storage.
Using a data privacy vault, companies can isolate and secure all sensitive data in one or more data privacy vaults, removing Snowflake and cloud storage from their compliance footprint. At the same time, by leveraging data privacy vaults in different regions, companies can help ensure that sensitive data is stored and transmitted according to the laws and regulations of each specific region where they operate.
Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer.
Speechlab and Realtime Translation with Ivan Galea
Nov 02, 2023
Speech technology has been around for a long time, but in the last 12 months it’s undergone a quantum leap. New speech synthesis models are able to produce speech that’s often indistinguishable from real speech. I’m sure many listeners have heard deep fakes where computer speech perfectly mimics the voice of famous actors or public figures. A major factor in driving the ongoing advances is generative AI.
Speechlab is at the forefront of using new AI techniques for realtime dubbing, which is the process of converting speech from one language into another. For the interested listener, we recommend hearing the examples with President Obama speaking Spanish or Elon Musk speaking Japanese in this YouTube video. Ivan Galea is the Co-founder and President at Speechlab and he joins the show to talk about how we’re on the cusp of reaching the holy grail of speech technology – real time dubbing – and how this will erase barriers to communication and likely transform the world.
This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His best-selling book, Architecting for Scale (O’Reilly Media), is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments.
Lee is the host of his podcast, Modern Digital Business, an engaging and informative podcast produced for people looking to build and grow their digital business with the help of modern applications and processes developed for today’s fast-moving business environment. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com, and see all his content at leeatchison.com.
If you’re a developer, there’s a good chance you’ve experimented with coding assistants like GitHub Copilot. Many developers have even fully integrated these tools into their workflows. One way these tools accelerate development is by autocompleting entire blocks of code. The AI achieves this by having awareness of the surrounding code. It understands context. However, in many cases the context available to an AI is limited. This restricts the AI’s ability to suggest more sweeping changes to a codebase, or even to refactor an entire application.
Quinn Slack is the CEO of Sourcegraph. He is now hard at work on the challenge of giving more context to AI – to make it aware of entire codebases, dependencies, error logs, and other data. Quinn joins the show today to talk about what it takes to move beyond code autocomplete, how to develop the next generation of coding AI, and what the future looks like for software engineers and programming languages.
Josh Goldberg is an independent full time open source developer in the TypeScript ecosystem. He works on projects that help developers write better TypeScript more easily, most notably on typescript-eslint: the tooling that enables ESLint and Prettier to run on TypeScript code. Josh regularly contributes to open source projects in the ecosystem such as ESLint and TypeScript.
Josh is a Microsoft MVP for developer technologies and the author of the acclaimed Learning TypeScript (O’Reilly), a cherished resource for any developer seeking to learn TypeScript without any prior experience outside of JavaScript. Josh regularly presents talks and workshops at bootcamps, conferences, and meetups to share knowledge on TypeScript, static analysis, open source, and general frontend and web development.
You can find Josh on: Bluesky, Fosstodon, Twitter, Twitch, YouTube, and joshuakgoldberg.com.
Stack Overflow in the AI era with Ellen Brandenberger
Oct 03, 2023
When StackOverflow launched in 2008 it lowered the barrier to writing complex software. It solved the longstanding problem of accessing accurate and reliable programming knowledge by offering a collaborative space where programmers could ask questions, share insights, and receive high-quality answers from a community of experts.
Generative AI has impacted the way programmers want to consume this knowledge. It has also opened new possibilities in terms of getting a personalized and real-time response. StackOverflow has decided to put a fifth of its organizational effort in Generative AI to improve the user experience of the website.
Ellen Brandenberger leads the Product Innovation team at Stack Overflow and she joins us in this episode.
Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer .
AI for Software Delivery with Birgitta Böckeler
Aug 15, 2023
AI-assisted software delivery refers to the utilization of artificial intelligence to assist, enhance, or automate various phases of the software development lifecycle. AI can be used in numerous aspects of software development, from requirements gathering to code generation to testing and monitoring. The overarching aim is to streamline software delivery, reduce errors and, ideally, reduce the time and costs associated with software development.
Birgitta Böckeler is the Global Lead for AI-assisted Software Delivery at Thoughtworks and she joins us in this episode. We discuss how the latest advances in large language models are revolutionizing software development.
Jordi Mon Companys is a product manager and marketer that specializes in software delivery, developer experience, cloud native and open source. He has developed his career at companies like GitLab, Weaveworks, Harness and other platform and devtool providers. His interests range from software supply chain security to open source innovation. You can reach out to him on Twitter at @jordimonpmm.
Generative pre-trained transformer models, or GPT models, have countless applications and are being rapidly deployed across a wide range of domains.
However, using GPT models without appropriate safeguards can lead to leakage of sensitive data. This concern underscores the critical need for privacy and data protection.
Skyflow LLM Privacy Vault prevents sensitive data from reaching GPTs. Amruta Moktali is the Chief Product Officer at Skyflow and she joins us today. We discuss generative AI, how the technology is different from other AI approaches, and how we can use this technology in a safe and ethical manner.
Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer .
Data Investing and the MAD with Matt Turck
Mar 10, 2023
There are many types of early stage funding available from friends and family to seed to series A. Some firms invest across a wide set of technologies and seek only to provide capital. Others are in it for the long haul – they focus on specific areas of technology and develop both long term relationships and deep expertise over time.
Today, we are interviewing Matt Turck of First Mark Capital, who is in it for the long haul and whose portfolio companies include Dataiku, Crossbeam, Ada, Cockroach Labs, Clickhouse and more. Today we will talk about Matt’s career, investment point of view, founding the Data-driven NYC community and the recent release of the 20234 MAD – an industry resource for understanding the Machine Learning, AI and Data Landscape
Be sure to check out the show notes for links to the MAD
This epsiode is hosted by Jocelyn Houle. Follow Jocelyn on Linked or on Twitter @jocelynbyrne.
Show notes –
In today’s show we referenced a couple things you may want to check out.
Surviving ChatGPT with Christian Hubicki
Feb 24, 2023
ChatGPT is an artificial intelligence language model developed by OpenAI. It is part of the GPT (Generative Pre-trained Transformer) family of models, which are designed to generate human-like text based on input prompts. ChatGPT is specifically trained to carry out conversational tasks, such as answering questions, completing sentences, and engaging in dialogue. It has been pre-trained on a large corpus of text data and fine-tuned on specific tasks to improve its performance. As a result, ChatGPT can generate responses that are often coherent, relevant, and natural-sounding.
Christian Hubicki is an Assistant Professor in the Robotics Department at Florida State University. He joins us today to discuss ChatGPT and its implications. We also discussed the future of Artificial Intelligence in general.
This show is hosted by Sean Falconer. Sean is the Head of Developer Relations and Marketing @Skyflow. Follow Sean at @seanfalconer
Automatic Database Tuning with Andy Pavlo
Sep 23, 2022
The default configuration in most databases is meant for broad compatibility rather than performance. Database tuning is a process in which the configurations of a database are modified to achieve optimal performance. Databases have hundreds of configuration knobs that control various factors, such as the amount of memory to use for caches or how often the data is written to the storage.
The problem with these knobs is that
they are not standardized (i.e., two databases may have a different name for the same knob),
not independent (i.e., changing one knob can impact others),
and not universal (i.e., what works for one application may be suboptimal for another).
In reality, information about the effects of the knobs typically comes only from (expensive) experience.
OtterTune is automatic database tuning software that promises to overcome these problems. It uses machine learning to tune the configuration knobs of your database automatically to improve performance.
In this episode, we interview Andy Pavlo. Andy is a Database Professor at Carnegie Mellon and Co-Founder of OtterTune.
Practical Machine Learning in JavaScript with Charlie Gerard
Sep 18, 2022
Originally published on January 1, 2022.
Charlie Gerard is an incredibly productive developer. In addition to being the author of Practical Machine Learning in JavaScript, her website charliegerard.dev has a long list of really interesting side projects exploring the intersection of human computer interaction, computer vision, interactivity, and art. In this episode we touch on some of these projects and broadly explore how practical it is to bring interesting HCI concepts into one’s work.
At Lyft, Ketan Umare worked on Flyte, an orchestration system for machine learning. Flyte provides reliability and APIs for machine learning workflows, and is used at companies outside of Lyft such as Spotify.
Since leaving Lyft, Ketan founded Union.ai, a company focused on productionizing Flyte as a service. He joins the show to talk about the architecture and usage of Flyte, as well as how he is formulating the company around it.
Ad-free Search on Neeva with Darin Fisher
Jan 25, 2022
Historically, search engines made money by showing sponsored ads alongside organic results. As the idiom goes, if you’re not paying for something, you are the product. Neeva is a new take on search engines. When you search at neeva.com, you get the type of result you’d expect from a search engine minus any advertising. In this episode, I speak with Darin Fisher, Software Engineer at Neeva. We discuss the motivation, implementation, and mobile experience for searching with Neeva.
Practical Machine Learning in JavaScript with Charlie Gerard
Jan 04, 2022
Charlie Gerard is an incredibly productive developer. In addition to being the author of Practical Machine Learning in JavaScript, her website charliegerard.dev has a long list of really interesting side projects exploring the intersection of human computer interaction, computer vision, interactivity, and art. In this episode we touch on some of these projects and broadly explore how practical it is to bring interesting HCI concepts into one’s work.
Responsibly Deploy AI in Production with Anupam Datta
Nov 30, 2021
Once a machine learning model is trained and validated, it often feels like a major milestone has been achieved. In reality, it’s more like the first lap in a relay race. Deploying ML to production bears many similarities to a typical software release process, but brings several novel challenges like failing to generalize as expected or model drift.
AI Quality management is the biggest challenge in AI today. In this episode, I interview Anupam Datta, the co-founder at TruEra. TruEra has a solution aimed at helping with AI performance, monitoring, and model explainability. We talk about some of the challenges of modern machine learning deployment in production and how companies are succeeding with ML Ops.
Learning Tensorflow.js with Gant Laborde
Nov 09, 2021
Machine learning models must first be trained. That training results in a model which must be serialized or packaged up in some way as a deployment artifact. A popular deployment path is using Tensorflow.js to take advantage of the portability of JavaScript, allowing your model to be run on a web server or client.
Gant Laborde is Chief Innovation Officer at Infinite Red, a React Native consulting team and the author of Learning TensorFlow.js: Powerful Machine Learning in JavaScript from O’Reilly. In this interview, we explore use cases for Tensorflow.js.
No Code AI for Video Analytics with Alex Thiele
Oct 22, 2021
Imagine a world where you own some sort of building whether that’s a grocery store, a restaurant, a factory… and you want to know how many people reside in each section of the store, or maybe how long did the average person wait to be seated or how long did it take the average factory worker to complete their assembly task.
Currently today these systems are either not using AI and instead use a mix of sensors and buttons to track certain actions or they do use AI but in a way that’s highly specific to their use case and hard to easily modify for new use cases that come down the line.
This is where BrainFrame comes in. BrainFrame is a tool that connects to all your on-prem cameras and lets you easily leverage AI models and business logic. Alex Thiele is the CTO of Aotu the company that makes BrainFrame and he joins me today to talk about BrainFrame and the vision for a future where computer vision can be run by anyone.
This episode is hosted by David Cohen. David is a Software Engineering Lead at LinkedIn where he works on backend applications and APIs that power their enterprise data systems. In his free time, he is an AI enthusiast and enjoys talking about all things Software. You can contact him on LinkedIn or Twitter.
Virtual Agents for IT and HR with Dan Turchin
Sep 22, 2021
The dream of machines with artificial general intelligence is entirely plausible in the future, yet well beyond the reach of today’s cutting edge technology. However, a virtual agent need not win in Alan Turing’s Imitation Game to be useful. Modern technology can deliver on some of the promises of narrow intelligence for accomplishing specific tasks.
PeopleReign has created a virtual agent for IT and HR employee service. This agent’s goal is not to replace a human agent but to augment them by handling some requests and elegantly handing off to a human in other cases. In this episode, I speak with Dan Turchin, CEO of PeopleReign about their virtual agent and the future of work.
Autonomous Driving Infrastructure with Vinoj Kumar
Sep 20, 2021
Interest in autonomous vehicles dates back to the 1920s. It wasn’t until the 1980s that the first truly autonomous vehicle prototypes began to appear. The first DARPA Grand Challenge took place in 2004 offering competitors $1 million dollars to complete a 150-mile course through the Mojave desert. The prize was not claimed.
Since then, rapid progress has begun in autonomous driving fueled by advances in sensor technology, software, and the hardware which runs it. Infrastructure has become a serious consideration for autonomous vehicle companies. In this episode, I speak with Vinoj Kumar, VP of Infrastructure at Cruise, the San Francisco company building an all-electric self-driving rideshare and delivery service. They’re tackling the infinite longtail scenarios of city driving and helping Walmart with self-driving grocery delivery.
Sust Global: Taking Action Against Climate Change with Josh Gilbert
Jul 20, 2021
Governments, consumers, and companies across the world are becoming more aware and attentive to the risks and causes of climate change. From recycling to using solar power, people are looking for ways to reduce their carbon footprint. Markets like the financial sector, governments, and consulting are looking for ways to understand climate data to make smart decisions and manage risk.
The company Sust Global was founded as a way to deliver sustainable change and climate resilient action. Sust Global uses an AI-powered platform that combines climate science, satellite-derived data, and geospatial data sets to quantify climate change. Companies can use this analysis to evaluate risk to assets, better understand future commodities like metal, and plan for future supply chain challenges and climate perils.
In this episode we talk to Josh Gilbert, CEO at Sust Global. Josh explains Sust Global’s mission and product, and discusses how companies use Sust Global to prepare and respond to climate change.
Machine Learning: The Great Stagnation with Mark Saroufim
Jun 04, 2021
Mark Saroufim is the author of an article entitled “Machine Learning: The Great Stagnation”. Mark is a PyTorch Partner Engineer with Facebook AI. He has spent his entire career developing machine learning and artificial intelligence products. Before joining Facebook to do PyTorch engineering with external partners, Mark was a Machine Learning Engineer at Graphcore. Before that he founded Yuri.ai. Mark has also published “The Robot Overlord Manual” which “will teach you all the software, math and ML you’ll need to start building robots at home.” In this episode we discuss machine learning subjects and his experience developing cutting edge software.
Data Management Systems and Artificial Intelligence with Arun Kumar
May 27, 2021
Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering and the Halicioglu Data Science Institute at the University of California, San Diego. His primary research interests are in data management and systems for machine learning/artificial intelligence-based data analytics.
Systems and ideas based on his research have been released as part of the Apache MADlib open-source library, shipped as part of products from Cloudera, IBM, Oracle, and Pivotal, and used internally by Facebook, Google, LogicBlox, Microsoft, and other companies.
Arun did his undergrad in Computer Science and Engineering at the Indian Institute of Technology, Madras, and then his MS and PhD in Computer Science at the University of Wisconsin-Madison, where his thesis research explores problems at the intersection of data management and machine learning, with a focus on problems related to usability, developability, performance, and scalability. In this episode he joins us to discuss data management systems and artificial intelligence.
BaseTen: Creating Machine Learning APIs with Tuhin Srivastava and Amir Haghighat
May 20, 2021
Application Programming Interfaces (APIs) are interfaces that enable multiple software applications to send and retrieve data from one another. They are commonly used for retrieving, saving, editing, or deleting data from databases, transmitting data between apps, and embedding third-party services into apps.
The company BaseTen helps companies build and deploy machine learning APIs and applications. Using pre-existing ML models, or choosing from BaseTen’s library of pretrained models, BaseTen helps you instantly deploy API endpoints powered by those models to use in your applications. These APIs easily scale and integrate with existing data sources. BaseTen’s serverless infrastructure enables chaining model outputs and pre- and post- processing code. They also use a drag-and-drop UI builder to create custom UI’s for the applications, all without learning React.
In this episode, we talk with Tuhin Srivastava and Amir Haghighat, founders at BaseTen. Tuhin previously founded Shape, and also worked as a Data Scientist at Gumroad. We discuss machine learning API development, scaling ML-driven applications, and the capabilities of BaseTen’s technology.
Botpress: Natural Language Processing with Sylvain Perron
May 07, 2021
Natural Language Processing (NLP) is a branch of artificial intelligence concerned with giving computers the ability to understand text and spoken words. “Understanding” includes intent, sentiment, and what’s important in the message. NLP powers things like voice-operated software, digital assistants, customer service chat bots, and many other academic, consumer and enterprise tools.
The company Botpress provides open-source developer tools to create NLP tools for process and FAQ automation. They use the latest NLP models for domain-specific, contextual and goal-oriented conversations. This technology is free and available through simple API routes. They also maintain integrations with popular messaging services like Facebook Messenger, Slack, and Microsoft Teams. For other proprietary systems, they provide a raw Messaging API.
In this episode we talk to Sylvain Perron, CEO of Botpress. Sylvain was previously a Director of Engineering at Protorisk Limited and a Software Developer at ArcBees before that. We discuss the current advances in Natural Language Processing and how NLP powers Botpress.
MindsDB: Automated Machine Learning with Jorge Torres
Apr 06, 2021
Using artificial intelligence and machine learning in a product or database is traditionally difficult because it involves a lot of manual setup, specialized training, and a clear understanding of the various ML models and algorithms. You need to develop the right ML model for your data, train the model, evaluate it, optimize it, analyze it for outliers and anomalies, assemble confidence ranges of the predictions and feature importance, and eventually deploy it to make predictions. An emerging field in AI, called Automated Machine Learning (AutoML), lowers these barriers to entry by using AI to automate much of this process.
One of the market leaders in AutoML is MindsDB. Their service lets business users and developers make predictions on top of data at its source. Rather than make expensive copies of databases, a process that creates complex infrastructures, MindsDB trains and deploys models right inside the database. The results of their ML models can be queried with standard SQL statements and integrated into other applications as easily as querying any other database.
In this episode we learn more about the progress that has been made in AutoML to simplify incorporating machine learning throughout organizations. We discuss the current features available from MindsDB, the difference their product has made for companies trying to leverage AI, and the future of AutoML and artificial intelligence generally.
Creation Labs: Self Driving Trucks with Jakub Langr
Mar 30, 2021
Creation Labs is helping bring Europe 1 step closer to fully autonomous long haul trucking. They have developed an AI Driver Assistance System (AIDAS) that retrofits to any commercial vehicle, starting with VW Crafters and MAN TGE trucks. Their system uses camera hardware mounted to the vehicle to capture video data that is processed with computer vision to understand the context on the road. This piece of the system was developed by the world’s leading experts in computer vision. While the computer interprets what is happening on the road, data is sent to a processing system that can control the vehicle’s break, throttle, and steering.
The AIDAS system currently augments a driver’s role but does not replace the need for one yet. However, the difference between great drivers and bad drivers is around a 30% difference in fuel efficiency, according to Creation Labs. They have trained their systems with data from the best drivers in order to lower fuel costs for vehicles driven by their AIDAS. They’ve also built their system using the highest standards of safety.
Jakub Langr is the CEO of Creation Labs and an Oxford-educated data scientist with 10 years of industry experience. He discusses Creation Labs’s vision for the future and the impact their incredible product is having on customers’ profit margins and emissions.
Pinecone: Vector Database with Edo Liberty
Mar 15, 2021
Vectors are the foundational mathematical building blocks of Machine Learning. Machine Learning models must transform input data into vectors to perform their operations, creating what is known as a vector embedding. Since data is not stored in vector form, an ML application must perform significant work to transform data in different formats into a form that ML models can understand. This can be computationally intensive and hard to scale, especially for the high-dimensional vectors used in complex models.
Pinecone is a managed database built specifically for working with vector data. Pinecone is serverless and API-driven, which means engineers and data scientists can focus on building their ML application or performing analysis without worrying about the underlying data infrastructure.
Edo Liberty is the founder and CEO of Pinecone. Prior to Pinecone, he led the creation of Amazon SageMaker at AWS. He joins the show today to talk about the fundamental importance of vectors in machine learning, how Pinecone built a vector-centric database, and why data infrastructure improvements are key to unlocking the next generation of AI applications.
OctoML: Automated Deep Learning Engineering with Jason Knight and Luis Ceze
Feb 09, 2021
The incredible advances in machine learning research in recent years often take time to propagate out into usage in the field. One reason for this is that such “state-of-the-art” results for machine learning performance rely on the use of handwritten, idiosyncratic optimizations for specific hardware models or operating contexts. When developers are building ML-powered systems to deploy in the cloud and at the edge, their goals to ensure the model delivers the best possible functionality and end-user experience- and importantly, their hardware and software stack may require different optimizations to achieve that goal.
OctoML provides a SaaS product called the Octomizer to help developers and AIOps teams deploy ML models most efficiently on any hardware, in any context. The Octomizer deploys its own ML models to analyze your model topology, and optimize, benchmark, and package the model for deployment. The Octomizer generates insights about model performance over different hardware stacks and helps you choose the deployment format that works best for your organization.
Luis Ceze is the Co-Founder and CEO of OctoML. Luis is a founder of the ApacheTVM project, which is the basis for OctoML’s technology. He is also a professor of Computer Science at the University of Washington. Jason Knight is co-founder and CPO at OctoML. Luis and Jason join the show today to talk about how OctoML is automating deep learning engineering, why it’s so important to consider hardware when building deep learning systems, and how the field of deep learning is evolving.
Embedded Software Engineering is the practice of building software that controls embedded systems- that is, machines or devices other than standard computers. Embedded systems appear in a variety of applications, from small microcontrollers, to consumer electronics, to large-scale machines such as cars, airplanes, and machine tools.
iRobot is a consumer robotics company that applies embedded engineering to build robots that perform common household tasks. Its flagship product is the Roomba, perhaps one of the most well-known autonomous consumer robots on the market today. iRobot’s engineers work at the intersection of software and hardware, and work in a variety of domains from electrical engineering to AI.
Chris Svec is a Software Engineering Manager at iRobot. He started his career designing x86 chips and later moved up the hardware/software stack into embedded software. He joins the show today to talk about iRobot, the design process for embedded systems, and the future of embedded systems programming.
Reinforcement Learning and Robotics with Nathan Lambert
Jan 27, 2021
Reinforcement learning is a paradigm in machine learning that uses incentives- or “reinforcement”- to drive learning. The learner is conceptualized as an intelligent agent working within a system of rewards and penalties in order to solve a novel problem. The agent is designed to maximize rewards while pursuing a solution by trial-and-error.
Programming a system to respond to the complex and unpredictable “real world” is one of the principal challenges in robotics engineering. One field which is finding new applications for reinforcement learning is the study of MEMS devices- robots or other electronic devices built at the micrometer scale. The use of reinforcement learning in microscopic devices poses a challenging engineering problem, due to constraints with power usage and computational power.
Nathan Lambert is a PhD student at Berkeley who works with the Berkeley Autonomous Microsystems Lab. He has also worked at Facebook AI Research and Tesla. He joins the show today to talk about the application of reinforcement learning to robotics and how deep learning is changing the MEMS device landscape.
Machine Learning Carbon Capture with Diego Saez-Gil
Jan 21, 2021
Companies can have a negative impact on the environment by outputting excess carbon. Many companies want to reduce their net carbon impact to zero, which can be done by investing in forests. Pachama is a marketplace for forest investments. Pachama uses satellites, imaging, machine learning, and other techniques to determine how much carbon is being absorbed by different forests. Diego Saez-Gil is a founder of Pachama, and joins the show to talk through how Pachama works and the long-term goals of the company.
TensorFlow Lite is an open source deep learning framework for on-device inference. TensorFlow Lite was designed to improve the viability of machine learning applications on phones, sensors, and other IoT devices. Pete Warden works on TensorFlow Lite at Google and joins the show to talk about the world of machine learning applications and the necessary frameworks and devices necessary to build them.
WebAssembly on IoT with Jonathan Beri (Repeat)
Jan 05, 2021
Originally published July 30, 2019
“Internet of Things” is a term used to describe the increasing connectivity and intelligence of physical objects within our lives.
IoT has manifested within enterprises under the term “Industrial IoT,” as wireless connectivity and machine learning have started to improve devices such as centrifuges, conveyor belts, and factory robotics. In the consumer space, IoT has moved slower than many people expected, and it remains to be seen when we will have widespread computation within consumer devices such as microwaves, washing machines, and lightswitches.
IoT computers have different constraints than general purpose computers. Security, reliability, battery life, power consumption, and cost structures are very different in IoT devices than in your laptop or smartphone. One technology that could solve some of the problems within IoT is WebAssembly, a newer binary instruction format for executable programs.
Jonathan Beri is a software engineer and the organizer of the San Francisco WebAssembly Meetup. He has significant experience in the IoT industry, and joins the show to discuss the state of WebAssembly, the surrounding technologies, and their impact on IoT.
Drishti: Deep Learning for Manufacturing with Krish Chaudhury (Repeat)
Dec 28, 2020
Originally published April 17, 2019
Drishti is a company focused on improving manufacturing workflows using computer vision.
A manufacturing environment consists of assembly lines. A line is composed of sequential stations along that manufacturing line. At each station on the assembly line, a worker performs an operation on the item that is being manufactured. This type of workflow is used for the manufacturing of cars, laptops, stereo equipment, and many other technology products.
With Drishti, the manufacturing process is augmented by adding a camera at each station. Camera footage is used to train a machine learning model for each station on the assembly line. That machine learning model is used to ensure the accuracy and performance of each task that is being conducted on the assembly line.
Krish Chaudhury is the CTO at Drishti. From 2005 to 2015 he led image processing and computer vision projects at Google before joining Flipkart, where he worked on image science and deep learning for another four years. Krish had spent more than twenty years working on image and vision related problems when he co-founded Drishti.
In today’s episode, we discuss the science and application of computer vision, as well as the future of manufacturing technology and the business strategy of Drishti.
Niantic Real World with Paul Franceus (Repeat)
Dec 22, 2020
Originally published June 21, 2019
Niantic is the company behind Pokemon Go, an augmented reality game where users walk around in the real world and catch Pokemon which appear on their screen.
The idea for augmented reality has existed for a long time. But the technology to bring augmented reality to the mass market has appeared only recently. Improved mobile technology makes it possible for a smartphone to display rendered 3-D images over a video stream without running out of battery.
Ingress was the first game to come out of Niantic, followed by Pokemon Go, but there are other games on the way. Niantic is also working on the Niantic Real World platform, a “planet-scale” AR platform that will allow independent developers to build multiplayer augmented reality experiences that are as dynamic and entertaining as Pokemon Go.
Paul Franceus is an engineer at Niantic, and he joins the show to describe his experience building and launching Pokemon Go, as well as abstracting the technology from Pokemon Go and opening up the Niantic Real World platform to developers.
Practical AI with Chris Benson (Repeat)
Dec 17, 2020
Originally published December 9, 2019
Machine learning algorithms have existed for decades. But in the last ten years, several advancements in software and hardware have caused dramatic growth in the viability of applications based on machine learning.
Smartphones generate large quantities of data about how humans move through the world. Software-as-a-service companies generate data about how these humans interact with businesses. Cheap cloud infrastructure allows for the storage of these high volumes of data. Machine learning frameworks such as Apache Spark, TensorFlow, and PyTorch allow developers to easily train statistical models.
These models are deployed back to the smartphones and the software-as-a-service companies, which improves the ability for humans to move through the world and gain utility from their business transactions. And as the humans interact more with their computers, it generates more data, which is used to create better models, and higher consumer utility.
The combination of smartphones, cloud computing, machine learning algorithms, and distributed computing frameworks is often referred to as “artificial intelligence.” Chris Benson is the host of the podcast Practical AI, and he joins the show to talk about the modern applications of artificial intelligence, and the stories he is covering on Practical AI. On his podcast, Chris talks about everything within the umbrella of AI, from high level stories to low level implementation details.
Kubeflow: TensorFlow on Kubernetes with David Aronchick (Repeat)
Dec 15, 2020
Originally published January 25, 2019
When TensorFlow came out of Google, the machine learning community converged around it. TensorFlow is a framework for building machine learning models, but the lifecycle of a machine learning model has a scope that is bigger than just creating a model. Machine learning developers also need to have a testing and deployment process for continuous delivery of models.
The continuous delivery process for machine learning models is like the continuous delivery process for microservices, but can be more complicated. A developer testing a model on their local machine is working with a smaller data set than what they will have access to when it is deployed. A machine learning engineer needs to be conscious of versioning and auditability.
Kubeflow is a machine learning toolkit for Kubernetes based on Google’s internal machine learning pipelines. Google open sourced Kubernetes and TensorFlow, and the projects have users AWS and Microsoft. David Aronchick is the head of open source machine learning strategy at Microsoft, and he joins the show to talk about the problems that Kubeflow solves for developers, and the evolving strategies for cloud providers.
David was previously on the show when he worked at Google, and in this episode he provides some useful discussion about how open source software presents a great opportunity for the cloud providers to collaborate with each other in a positive sum relationship.
Hedge Fund Artificial Intelligence with Xander Dunn (Repeat)
Dec 09, 2020
Originally published April 3, 2017
A hedge fund is a collection of investors that make bets on the future. The “hedge” refers to the fact that the investors often try to diversify their strategies so that the direction of their bets are less correlated, and they can be successful in a variety of future scenarios. Engineering-focused hedge funds have used what might be called “machine learning” for a long time to predict what will happen in the future.
Numerai is a hedge fund that crowdsources its investment strategies by allowing anyone to train models against Numerai’s data. A model that succeeds in a simulated environment will be adopted by Numerai and used within its real money portfolio. The engineers who create the models are rewarded in proportion to how well the models perform.
Xander Dunn is a software engineer at Numerai and in this episode he explains what a hedge fund is, why the traditional strategies are not optimal, and how Numerai creates the right incentive structure to crowdsource market intelligence. This interview was fun and thought provoking–Numerai is one of those companies that makes me very excited about the future.
Rosebud: Artificially Generated Media with Dzmitry Pletnikau
Nov 30, 2020
For several years, we have had the ability to create artificially generated text articles. More recently, audio and video synthesis have been feasible for artificial intelligence. Rosebud is a company that creates animated virtual characters that can speak. Users can generate real or fictional presenters easily with Rosebud. Dzmitry Pletnikau is an engineer with Rosebud and joins the show to talk about the technology and engineering behind the company.
Computer Architecture with Dave Patterson Holiday Repeat
Nov 27, 2020
Originally published November 7, 2018
An instruction set defines a low level programming language for moving information throughout a computer. In the early 1970’s, the prevalent instruction set language used a large vocabulary of different instructions. One justification for a large instruction set was that it would give a programmer more freedom to express the logic of their programs.
Many of these instructions were rarely used. Think of your favorite programming language (or your favorite human language). What percentage of words in the vocabulary do you need to communicate effectively? We sometimes call these language features “syntactic sugar”. They add expressivity to a language, but may not improve functionality or efficiency.
These extra language features can have a cost.
Dave Patterson and John Hennessy created the RISC architecture: Reduced Instruction Set Compiler architecture. RISC proposed reducing the size of the instruction set so that the important instructions could be optimized for. Programs would become more efficient, easier to analyze, and easier to debug.
Dave Patterson’s first paper on RISC was rejected. He continued to research the architecture and advocate for it. Eventually RISC became widely accepted, and Dave won a Turing Award together with John Hennessy.
Dave joins the show to talk about his work on RISC and his continued work in computer science research to the present. He is involved in the Berkeley RISELab and works at Google on the Tensor Processing Unit.
Machine learning is an ocean of new scientific breakthroughs and applications that will change our lives. It was inspiring to hear Dave talk about the changing nature of computing, from cloud computing to security to hardware design.
Cruise: Self-Driving Engineering with Mo Elshenawy Holiday Repeat
Nov 26, 2020
October 1, 2019
The development of self-driving cars is one of the biggest technological changes that is under way.
Across the world, thousands of engineers are working on developing self-driving cars. Although it still seems far away, self-driving cars are starting to feel like an inevitability. This is especially true if you spend much time in downtown San Francisco, where you will see a self-driving car being tested every day. Much of the time, that self-driving car will be operated by Cruise.
Cruise is a company that is building a self-driving car service. The company has hundreds of engineers working across the stack, from computer vision algorithms to automotive hardware. Cruise’s engineering requires engineers who can work with cloud tools as well as low-latency devices. It also requires product developers and managers to lead these different teams.
The field of self-driving is very new. There is not much literature available on how to build a self-driving car. There is even less literature on how to manage a team of engineers that are building, testing, and deploying software and hardware for real cars that are driving around the streets of San Francisco.
Mo Elshenawy is VP of engineering at Cruise, and he joins the show to talk about the engineering that is required to develop fully self-driving car technology, as well as how to structure teams to align the roles of product design, software engineering, testing, machine learning, and hardware.
Full disclosure: Cruise is a sponsor of Software Engineering Daily.
Model Deployment and Serving with Chaoyu Yang
Nov 04, 2020
Newer machine learning tooling is often focused on streamlining the workflows and developer experience. One such tool is BentoML. BentoML is a workflow that allows data scientists and developers to ship models more effectively. Chaoyu Yang is the creator of BentoML and he joins the show to talk about why he created Bento and the engineering behind the project.
Humanloop: NLP Model Engineering with Raza Habib
Nov 03, 2020
Data labeling is a major bottleneck in training and deploying machine learning and especially NLP. But new tools for training models with humans in the loop can drastically reduce how much data is required. Humanloop is a platform for annotating text and training NLP models with much less labelled data. Raza Habib, founder of Humanloop, joins the show to to talk about NLP workflows and his work on Humanloop.
Federated Learning with Mike Lee Williams
Oct 23, 2020
Federated learning is machine learning without a centralized data source. Federated Learning enables mobile phones or edge servers to collaboratively learn a shared prediction model while keeping all the training data on device. Mike Lee Williams is an expert in federated learning, and he joins the show to give an overview of the subject and share his thoughts on its applications.
Machine learning models require training data, and training data needs to be labeled. Raw images and text can be labeled using a training data platform like Labelbox. Labelbox is a system of labeling tools that enables a human workforce to create data that is ready to be consumed by machine learning training algorithms. The Labelbox team joins the show today to discuss training data and how to label it.
Roboflow: Computer Vision Models with Brad Dwyer
Oct 13, 2020
Training a computer vision model is not easy. Bottlenecks in the development process make it even harder. Ad hoc code, inconsistent data sets, and other workflow issues hamper the ability to streamline models. Roboflow is a company built to simplify and streamline these model training workflows. Brad Dwyer is a founder of Roboflow and joins the show to talk about model development and his company.
Aquarium: Dataset Quality Improvement with Peter Gao
Oct 02, 2020
Machine learning models are only as good as the datasets they’re trained on. Aquarium is a system that helps machine learning teams make better models by improving their dataset quality. Model improvement is often made by curating high quality datasets, and Aquarium helps make that a reality. Peter Gao works on Aquarium, and he joins the show to talk through modern machine learning and the role of Aquarium.
Elementary Robotics with Arye Barnehama
Sep 17, 2020
Factories require quality assurance work. That QA work can be accomplished by a robot with a camera together with computer vision. This allows for sophisticated inspection techniques that do not require as much manual effort on the part of a human.
Arye Barnehama is a founder of Elementary Robotics, a company that makes these kinds of robots. Arye joins the show to talk through the engineering of Elementary Robotics, and his vision for the future of the factory floor.
Robotic Process Automation with Antti Karjalainen
Sep 04, 2020
Robotic process automation involves the scripting and automation of highly repeatable tasks. RPA tools such as UIPath paved the way for a newer wave of automation, including the Robot Framework, an open source system for RPA.
Antti Karjalainen is the CEO of Robocorp, a company that provides an RPA tool suite for developers. Antti joins the show to talk through the definition of RPA, common RPA tasks, and what he is building with Robocorp.
Hyperparameter Tuning with Richard Liaw
Aug 28, 2020
Hyperparameters define the strategy for exploring a space in which a machine learning model is being developed. Whereas the parameters of a machine learning model are the actual data coming into a system, the hyperparameters define how those data points are fed into the training process for building a model to be used by an end consumer.
A different set of hyperparameters will yield a different model. Thus, it is important to try different hyperparameter configurations to see which models end up performing better for a given application. Hyperparameter tuning is an art and a science.
Richard Liaw is an engineer and researcher, and the creator of Tune, a library for scalable hyperparameter tuning. Richard joins the show to talk through hyperparameters and the software that he has built for tuning them.
Machine Learning Labeling and Tooling with Lukas Biewald
Aug 26, 2020
CrowdFlower was a company started in 2007 by Lukas Biewald, an entrepreneur and computer scientist. CrowdFlower solved some of the data labeling problems that were not being solved by Amazon Mechanical Turk. A decade after starting CrowdFlower, the company was sold for several hundred million dollars.
Today, data labeling has only grown in volume and scope. But Lukas has moved on to a different part of the machine learning stack: tooling for hyperparameter search and machine learning monitoring.
Lukas Biewald joins the show to talk about the problems he was solving with CrowdFlower, the solutions that he developed as part of that company, and the efforts with his current focus: Weights and Biases, a machine learning tooling company.
ParlAI: Facebook Dialogue Platform with Stephen Roller
Aug 20, 2020
Chatbots are useful for developing well-defined applications such as first-contact customer support, sales, and troubleshooting. But the potential for chatbots is so much greater. Over the last five years, there have been numerous platforms that have arisen to allow for better, more streamlined chatbot creation.
Dialogue software enables the creation of sophisticated chatbots. ParlAI is a dialogue platform built inside of Facebook. It allows for the development of dialogue models within Facebook. These chatbots can “remember” information from session to session, and continually learn from user input.
Stephen Roller is an engineer who helped build ParlAI, and he joins the show to discuss the history of chatbot applications and what the Facebook team is trying to accomplish with the development of ParlAI.
SuperAnnotate: Image Annotation Platform with Vahan and Tigran Petrosyan
Aug 19, 2020
Image annotation is necessary for building supervised learning models for computer vision. An image annotation platform streamlines the annotation of these images. Well-known annotation platforms include Scale AI, Amazon Mechanical Turk, and Crowdflower.
There are also large consulting-like companies that will annotate images in bulk for you. If you have an application that requires lots of annotation, such as self-driving cars, then you might be compelled to outsource this annotation to such a company.
SuperAnnotate is an image annotation platform that can be used by these image annotation outsourcing firms. This episode explores SuperAnnotate, and the growing niche of image annotation. Vahan and Tigran Petrosyan are the founders of SuperAnnotate, and join the show for today’s interview.
Drug Simulations with Bryan Vicknair and Jason Walsh
Jul 29, 2020
Drug trials can lead to new therapeutics and preventative medications being discovered and placed on the market. Unfortunately, these drug trials typically require animal testing. This means animals are killed or harmed as a result of needing to verify that a drug will not kill humans.
Animal testing is unavoidable, but the extent to which testing needs to occur can be reduced by inserting machine learning models which simulate the effects of a drug on the human body. If the simulated effect is negative enough, animal testing doesn’t need to be run, thus no animals need to be harmed.
Bryan Vicknair and Jason Walsh work at VeriSIM Life, a company which makes software simulations of animals. These simulations can be used to model drug testing, and change the workflow for drug trials. They join the show to talk through the mechanics of drug testing, and how VeriSIM Life fits into that workflow.
Netflix runs all of its infrastructure on Amazon Web Services. This includes business logic, data infrastructure, and machine learning. By tightly coupling itself to AWS, Netflix has been able to move faster and have strong defaults about engineering decisions. And today, AWS has such an expanse of services that it can be used as a platform to build custom tools.
Metaflow is an open source machine learning platform built on top of AWS that allows engineers at Netflix to build directed acyclic graphs for training models. These DAGs get deployed to AWS as Step Functions, a serverless orchestration platform.
Savin Goyal is a machine learning engineer with Netflix, and he joins the show to talk about the machine learning challenges within Netflix, and his experience working on Metaflow. We also talk about DAG systems such as AWS Step Functions and Airflow.
Determined AI: Machine Learning Ops with Neil Conway
Jul 08, 2020
Developing machine learning models is not easy. From the perspective of the machine learning researcher, there is the iterative process of tuning hyperparameters and selecting relevant features. From the perspective of the operations engineer, there is a handoff from development to production, and the management of GPU clusters to parallelize model training.
In the last five years, machine learning has become easier to use thanks to point solutions. TensorFlow, cloud provider tools, Spark, Jupyter Notebooks. But every company works differently, and there are few hard and fast rules for the workflows around machine learning operations.
Determined AI is a platform that provides a means for collaborating around data prep, model development and training, and model deployment. Neil Conway is a co-founder of Determined, and he joins the show to discuss the challenges around machine learning operations, and what he has built with Determined.
Deepgram: End-to-End Speech Recognition with Scott Stephenson
Jul 03, 2020
Deepgram is an end-to-end deep learning platform for speech recognition. Unlike the general purpose APIs from Google or Amazon, Deepgram models are custom-trained for each customer. Whether the customer is a call center, a podcasting company, or a sales department, Deepgram can work with them to build something specific to their use case.
Sound data is incredibly rich. Consider all the features in a voice recording: volume, intonation, inflection. And once the speech is transcribed, there are many more features that can be discovered from the text transcription.
Scott Stephenson is the CEO of Deepgram, and he joins the show to talk through end-to-end deep learning for speech, as well as the dynamics of the business and the deployment strategy for working with customers.
Cresta: Speech ML for Calls with Zayd Enam
Jun 29, 2020
At a customer service center, thousands of hours of audio are generated. This audio provides a wealth of information to transcribe and analyze. With the additional data of the most successful customer service representatives, machine learning models can be trained to identify which speech patterns are associated with a successful worker.
By identifying these speaking patterns, a customer service center can continuously improve, with the different representatives learning the different patterns. The same is true for other speech-based tasks, such as sales calls.
Cresta is a company that builds systems to ingest high volumes of speech data in order to discover features that correlate with high performance human workers. Zayd Enam is a co-founder of Cresta, and joins the show to talk about the domain of speech data and what he and his team are building at Cresta.
Traces: Video Recognition with Veronica Yurchuk and Kostyantyn Shysh (Summer Break Repeat)
Jun 25, 2020
Originally published October 8, 2019. We are taking a few weeks off. We’ll be back soon with new episodes.
Video surveillance impacts human lives every day.
On most days, we do not feel the impact of video surveillance. But the effects of video surveillance have tremendous potential. It can be used to solve crimes and find missing children. It can be used to intimidate journalists and empower dictators. Like any piece of technology, video surveillance can be used for good or evil.
Video recognition lets us make better use of video feeds. A stream of raw video doesn’t provide much utility if we can’t easily model its contents. Without video recognition, we must have a human sitting in front of the video to manually understand what is going on in that video.
Veronica Yurchuk and Kosh Shysh are the founders of Traces.ai, a company building video recognition technology focused on safety, anonymity, and positive usage. They join the show to discuss the field of video analysis, and their vision for how video will shape our lives in the future.
Stripe Machine Learning Infrastructure with Rob Story and Kelley Rivoire (Summer Break Repeat)
Jun 16, 2020
Originally published June 13, 2019. We are taking a few weeks off. We’ll be back soon with new episodes.
Machine learning allows software to improve as that software consumes more data.
Machine learning is a tool that every software engineer wants to be able to use. Because machine learning is so broadly applicable, software companies want to make the tools more accessible to the developers across the organization.
There are many steps that an engineer must go through to use machine learning, and each additional step inhibits the chances that the engineer will actually get their model into production.
An engineer who wants to build machine learning into their application needs access to data sets. They need to join those data sets, and load them into a machine (or multiple machines) where their model can be trained. Once the model is trained, the model needs to test on additional data to ensure quality. If the initial model quality is insufficient, the engineer might need to tweak the training parameters.
Once a model is accurate enough, the engineer needs to deploy that model. After deployment, the model might need to be updated with new data later on. If the model is processing sensitive or financially relevant data, a provenance process might be necessary to allow for an audit trail of decisions that have been made by the model.
Rob Story and Kelley Rivoire are engineers working on machine learning infrastructure at Stripe. After recognizing the difficulties that engineers faced in creating and deploying machine learning models, Stripe engineers built out Railyard, an API for machine learning workloads within the company.
Rob and Kelley join the show to discuss data engineering and machine learning at Stripe, and their work on Railyard.
Architects of Intelligence with Martin Ford (Summer Break Repeat)
Jun 15, 2020
Originally published January 31, 2019. We are taking a few weeks off. We’ll be back soon with new episodes.
Artificial intelligence is reshaping every aspect of our lives, from transportation to agriculture to dating. Someday, we may even create a superintelligence–a computer system that is demonstrably smarter than humans. But there is widespread disagreement on how soon we could build a superintelligence. There is not even a broad consensus on how we can define the term “intelligence”.
Information technology is improving so rapidly we are losing the ability to forecast the near future. Even the most well-informed politicians and business people are constantly surprised by technological changes, and the downstream impact on society. Today, the most accurate guidance on the pace of technology comes from the scientists and the engineers who are building the tools of our future.
Architects of Intelligence is a privileged look at how AI is developing. Martin Ford surveys these different AI experts with similar questions. How will China’s adoption of AI differ from that of the US? What is the difference between the human brain and that of a computer? What are the low-hanging fruit applications of AI that we have yet to build?
Martin joins the show to talk about his new book. In our conversation, Martin synthesizes ideas from these different researchers, and describes the key areas of disagreement from across the field.
Cruise is an autonomous car company with a development cycle that is highly dependent on testing its cars–both in the wild and in simulation. The testing cycle typically requires cars to drive around gathering data, and that data to subsequently be integrated into a simulated system called Matrix.
With COVID-19, the ability to run tests in the wild has been severely dampened. Cruise cannot put so many cars on the road, and thus has had to shift much of its testing procedures to rely more heavily on the simulations. Therefore, the simulated environments must be made very accurate, including the autonomous agents such as pedestrians and cars.
Tom Boyd is VP of Simulation at Cruise. He joins the show to talk about the testing workflow at Cruise, how the company builds simulation-based infrastructure, and his work managing simulation at the company.
Tecton: Machine Learning Platform from Uber with Kevin Stumpf
Jun 03, 2020
Machine learning workflows have had a problem for a long time: taking a model from the prototyping step and putting it into production is not an easy task. A data scientist who is developing a model is often working with different tools, or a smaller data set, or different hardware than the environment which that model will be deployed to.
This problem existed at Uber just as it does at many other companies. Models were difficult to release, iterations were complicated, and collaboration between engineers could never reach a point that resembled a harmonious “DevOps”-like workflow. To address these problems, Uber developed an internal system called Michelangelo.
Some of the engineers working on Michelangelo within Uber realized that there was a business opportunity in taking the Michelangelo work and turning it into a product company. Thus, Tecton was born. Tecton is a machine learning platform focused on solving the same problems that existed within Uber. Kevin Stumpf is the CTO at Tecton, and he joins the show to talk about the machine learning problems of Uber, and his current work at Tecton.
Edge Machine Learning with Zach Shelby
May 26, 2020
Devices on the edge are becoming more useful with improvements in the machine learning ecosystem. TensorFlow Lite allows machine learning models to run on microcontrollers and other devices with only kilobytes of memory. Microcontrollers are very low-cost, tiny computational devices. They are cheap, and they are everywhere.
The low-energy embedded systems community and the machine learning community have come together with a collaborative effort called tinyML. tinyML represents the improvements of microcontrollers, lighter weight frameworks, better deployment mechanisms, and greater power efficiency.
Zach Shelby is the CEO of EdgeImpulse, a company that makes a platform called Edge Impulse Studio. Edge Impulse Studio provides a UI for data collection, training, and device management. As someone creating a platform for edge machine learning usability, Zach was a great person to talk to the state of edge machine learning and his work building a company in the space.
Rasa: Conversational AI with Tom Bocklisch
Apr 24, 2020
Chatbots became widely popular around 2016 with the growth of chat platforms like Slack and voice interfaces such as Amazon Alexa. As chatbots came into use, so did the infrastructure that enabled chatbots. NLP APIs and complete chatbot frameworks came out to make it easier for people to build chatbots.
The first suite of chatbot frameworks were largely built around rule-based state machine systems. These systems work well for a narrow set of use cases, but fall over when it comes to chatbot models that are more complex. Rasa was started in 2015, amidst the chatbot fever.
Since then, Rasa has developed a system that allows a chatbot developer to train their bot through a system called interactive learning. With interactive learning, I can deploy my bot, spend some time talking to it, and give that bot labeled feedback on its interactions with me. Rasa has open source tools for natural language understanding, dialogue management, and other components needed by a chatbot developer.
Tom Bocklisch works at Rasa, and he joins the show to give some background on the field of chatbots and how Rasa has evolved over time.
Snorkel: Training Dataset Management with Braden Hancock
Apr 09, 2020
Machine learning models require the use of training data, and that data needs to be labeled. Today, we have high quality data infrastructure tools such as TensorFlow, but we don’t have large high quality data sets. For many applications, the state of the art is to manually label training examples and feed them into the training process.
Snorkel is a system for scaling the creation of labeled training data. In Snorkel, human subject matter experts create labeling functions, and these functions are applied to large quantities of data in order to label it.
For example, if I want to generate training data about spam emails, I don’t have to hire 1000 email experts to look at emails and determine if they are spam or not. I can hire just a few email experts, and have them define labeling functions that can indicate whether an email is spam. If that doesn’t make sense, don’t worry. We discuss it in more detail in this episode.
Braden Hancock works on Snorkel, and he joins the show to talk about the labeling problems in machine learning, and how Snorkel helps alleviate those problems. We have done many shows on machine learning in the past, which you can find on SoftwareDaily.com. Also, if you are interested in writing about machine learning, we have a new writing feature that you can check out by going to SoftwareDaily.com/write.
Descript is a software product for editing podcasts and video.
Descript is a deceptively powerful tool, and its software architecture includes novel usage of transcription APIs, text-to-speech, speech-to-text, and other domain-specific machine learning applications. Some of the most popular podcasts and YouTube channels use Descript as their editing tool because it provides a set of features that are not found in other editing tools such as Adobe Premiere or a digital audio workstation.
Descript is an example of the downstream impact of machine learning tools becoming more accessible. Even though the company only has a small team of machine learning engineers, these engineers are extremely productive due to the combination of APIs, cloud computing, and frameworks like TensorFlow.
Descript was founded by Andrew Mason, who also founded Groupon and Detour, and Andrew joins the show to describe the technology behind Descript and the story of how it was built. It is a remarkable story of creative entrepreneurship, with numerous takeaways for both engineers and business founders.
Machine learning applications are widely deployed across the software industry.
Most of these applications used supervised learning, a process in which labeled data sets are used to find correlations between the labels and the trends in that underlying data. But supervised learning is only one application of machine learning. Another broad set of machine learning methods is described by the term “reinforcement learning.”
Reinforcement learning involves an agent interacting with its environment. As the model interacts with the environment, it learns to make better decisions over time based on a reward function. Newer AI applications will need to operate in increasingly dynamic environments, and react to changes in those environments, which makes reinforcement learning a useful technique.
Reinforcement learning has several attributes that make it a distinctly different engineering problem than supervised learning. Reinforcement learning relies on simulation and distributed training to rapidly examine how different model parameters could affect the performance of a model in different scenarios.
Ray is an open source project for distributed applications. Although Ray was designed with reinforcement learning in mind, the potential use cases go beyond machine learning, and could be as influential and broadly applicable as distributed systems projects like Apache Spark or Kubernetes. Ray is a project from the Berkeley RISE Lab, the same place that gave rise to Spark, Mesos, and Alluxio.
The RISE Lab is led by Ion Stoica, a professor of computer science at Berkeley. He is also the co-founder of Anyscale, a company started to commercialize Ray by offering tools and services for enterprises looking to adopt Ray. Ion Stoica returns to the show to discuss reinforcement learning, distributed computing, and the Ray project.
If you enjoy the show, you can find all of our past episodes about machine learning, data, and the RISE Lab by going to SoftwareDaily.com and searching for the technologies or companies you are curious about . And if there is a subject that you want to hear covered, feel free to leave a comment on the episode, or send us a tweet @software_daily.
Machine learning algorithms have existed for decades. But in the last ten years, several advancements in software and hardware have caused dramatic growth in the viability of applications based on machine learning.
Smartphones generate large quantities of data about how humans move through the world. Software-as-a-service companies generate data about how these humans interact with businesses. Cheap cloud infrastructure allows for the storage of these high volumes of data. Machine learning frameworks such as Apache Spark, TensorFlow, and PyTorch allow developers to easily train statistical models.
These models are deployed back to the smartphones and the software-as-a-service companies, which improves the ability for humans to move through the world and gain utility from their business transactions. And as the humans interact more with their computers, it generates more data, which is used to create better models, and higher consumer utility.
The combination of smartphones, cloud computing, machine learning algorithms, and distributed computing frameworks is often referred to as “artificial intelligence.” Chris Benson is the host of the podcast Practical AI, and he joins the show to talk about the modern applications of artificial intelligence, and the stories he is covering on Practical AI. On his podcast, Chris talks about everything within the umbrella of AI, from high level stories to low level implementation details.
We are hiring a content writer and also an operations lead. Both of these are part-time positions working closely with Jeff and Erika. If you are interested in working with us, send an email to jeff@softwareengineeringdaily.com.
Future of Computing with John Hennessy Holiday Repeat
Nov 26, 2019
Originally published June 7, 2018
Moore’s Law states that the number of transistors in a dense integrated circuit doubles about every two years. Moore’s Law is less like a “law” and more like an observation or a prediction.
Moore’s Law is ending. We can no longer fit an increasing amount of transistors in the same amount of space with a highly predictable rate. Dennard scaling is also coming to an end. Dennard scaling is the observation that as transistors get smaller, the power density stays constant.
These changes in hardware trends have downstream effects for software engineers. Most importantly–power consumption becomes much more important.
As a software engineer, how does power consumption affect you? It means that inefficient software will either run more slowly or cost more money relative to our expectations in the past. Whereas software engineers writing code 15 years ago could comfortably project that their code would get significantly cheaper to run over time due to hardware advances, the story is more complicated today.
Why is Moore’s Law ending? And what kinds of predictable advances in technology can we still expect?
John Hennessy is the chairman of Alphabet. In 2017, he won a Turing award (along with David Patterson) for his work on the RISC (Reduced Instruction Set Compiler) architecture. From 2000 to 2016, he was the president of Stanford University.
John joins the show to explore the future of computing. While we may not have the predictable benefits of Moore’s Law and Dennard scaling, we now have machine learning. It is hard to plot the advances of machine learning on any one chart (as we explored in a recent episode with OpenAI). But we can say empirically that machine learning is working quite well in production.
If machine learning offers us such strong advances in computing, how can we change our hardware design process to make machine learning more efficient?
As machine learning training workloads eat up more resources in a data center, engineers are developing domain specific chips which are optimized for those machine learning workloads. The Tensor Processing Unit (TPU) from Google is one such example. John mentioned that chips could become even more specialized within the domain of machine learning. You could imagine a chip that is specifically designed for a LSTM machine learning model.
There are other domains where we could see specialized chips–drones, self-driving cars, wearable computers. In this episode, John describes his perspective on the future of computing, and offers some framework for how engineers can adapt to that future.
Incident Response Machine Learning with Chris Riley
Nov 12, 2019
Software bugs cause unexpected problems at every company.
Some problems are small. A website goes down in the middle of the night, and the outage triggers a phone call to an engineer who has to wake up and fix the problem. Other problems can be significantly larger. When a major problem occurs, it can cause millions of dollars in losses and requires hours of work to fix.
When software unexpectedly breaks, it is called an incident. To triage these incidents, an engineer uses a combination of tools, including Slack, GitHub, cloud providers, and continuous deployment systems. These different tools emit updates that can be received by an incident response platform, which allow the on-call engineer to have the information they need centralized to more easily work through the incident.
On-call rotation means that different people will be responsible for dealing with different incidents that occur. When an incident happens, the current engineer who is on-call may not be aware that a similar incident happened last week. It might be easier for the new engineer to triage the issue if they have insights about how the incident was managed during the first time.
Chris Riley is a DevOps advocate with Splunk. He joins the show to discuss the application of machine learning to incident response. We discuss the different data points that are created during an incident, and how that data can be used to build models for different types of incidents, which can generate information to help the engineer respond appropriately to an incident. Full disclosure: Splunk is a sponsor of Software Engineering Daily.
Traces: Video Recognition with Veronica Yurchuk and Kostyantyn Shysh
Oct 08, 2019
Video surveillance impacts human lives every day.
On most days, we do not feel the impact of video surveillance. But the effects of video surveillance have tremendous potential. It can be used to solve crimes and find missing children. It can be used to intimidate journalists and empower dictators. Like any piece of technology, video surveillance can be used for good or evil.
Video recognition lets us make better use of video feeds. A stream of raw video doesn’t provide much utility if we can’t easily model its contents. Without video recognition, we must have a human sitting in front of the video to manually understand what is going on in that video.
Veronica Yurchuk and Kosh Shysh are the founders of Traces.ai, a company building video recognition technology focused on safety, anonymity, and positive usage. They join the show to discuss the field of video analysis, and their vision for how video will shape our lives in the future.
We are hiring a head of growth. If you like Software Engineering Daily and consider yourself competent in sales, marketing, and strategy, send me an email: jeff@softwareengineeringdaily.com
FindCollabs is a place to build open source software.
The SEDaily app for iOS and Android includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. Subscribe for ad-free episodes.
Cruise: Self-Driving Engineering with Mo Elshenawy
Oct 01, 2019
The development of self-driving cars is one of the biggest technological changes that is under way.
Across the world, thousands of engineers are working on developing self-driving cars. Although it still seems far away, self-driving cars are starting to feel like an inevitability. This is especially true if you spend much time in downtown San Francisco, where you will see a self-driving car being tested every day. Much of the time, that self-driving car will be operated by Cruise.
Cruise is a company that is building a self-driving car service. The company has hundreds of engineers working across the stack, from computer vision algorithms to automotive hardware. Cruise’s engineering requires engineers who can work with cloud tools as well as low-latency devices. It also requires product developers and managers to lead these different teams.
The field of self-driving is very new. There is not much literature available on how to build a self-driving car. There is even less literature on how to manage a team of engineers that are building, testing, and deploying software and hardware for real cars that are driving around the streets of San Francisco.
Mo Elshenawy is VP of engineering at Cruise, and he joins the show to talk about the engineering that is required to develop fully self-driving car technology, as well as how to structure teams to align the roles of product design, software engineering, testing, machine learning, and hardware.
Full disclosure: Cruise is a sponsor of Software Engineering Daily.
FindCollabs is a place to build open source software.
The SEDaily app for iOS and Android includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. Subscribe for ad-free episodes.
People.ai: Machine Learning for Sales with Andrey Akselrod
Aug 07, 2019
A large sales organization has hundreds of sales people. Each of those sales people manages a set of accounts who they are trying to close sales deals on. Sales people are overseen by managers who ensure that the sales people are performing well. Directors and VPs ensure the scalability and health of the overall sales organization.
The sales lifecycle mostly takes place within a piece of software called a CRM: customer relationship management. This tool documents the interactions between sales people and accounts. CRMs have been around for many years, and although CRM software is a useful repository of data, it does not fulfill all the needs of a salesperson.
People.ai is a system of machine learning tools built around the sales tooling ecosystem. People.ai helps a sales organization avoid manual data entry, understand areas of potential improvement, and decide on who the highest value sales lead to pursue might be. Andrey Akselrod is the CTO At People.ai and he joins the show to discuss the potential applications of machine learning in the domain of sales, and the engineering work that his company has done.
FindCollabs is a place to find collaborators and build projects. We recently launched GitHub integrations. It’s easier than ever to find collaborators for your open source projects. And if you are looking for some people to start a project with, FindCollabs we have topic rooms that allow you to find other people who are interested in a particular technology, so that you can find people who are curious about React, or cryptocurrencies, or Kubernetes, or whatever you want to build with.
Podsheets is an open source podcast hosting platform that we recently launched. We are building Podsheets with the learnings from Software Engineering Daily, and our goal is to be the best place to host and monetize your podcast. If you have been thinking about starting a podcast, check out podsheets.com.
New SEDaily app for iOS and for Android. It includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. You can comment on episodes and have discussions with other members of the community. I’ll be commenting on each episode, so if you hear an episode that you have some commentary on, jump onto the app, or on SoftwareDaily.com to share your thoughts. And you can become a paid subscriber for ad free episodes at softwareengineeringdaily.com/subscribe. Altalogy is the company who has been developing much of the software for the newest app, and if you are looking for a company to help you with your mobile and web development, I recommend checking them out.
WebAssembly on IoT with Jonathan Beri
Jul 30, 2019
“Internet of Things” is a term used to describe the increasing connectivity and intelligence of physical objects within our lives.
IoT has manifested within enterprises under the term “Industrial IoT,” as wireless connectivity and machine learning have started to improve devices such as centrifuges, conveyor belts, and factory robotics. In the consumer space, IoT has moved slower than many people expected, and it remains to be seen when we will have widespread computation within consumer devices such as microwaves, washing machines, and lightswitches.
IoT computers have different constraints than general purpose computers. Security, reliability, battery life, power consumption, and cost structures are very different in IoT devices than in your laptop or smartphone. One technology that could solve some of the problems within IoT is WebAssembly, a newer binary instruction format for executable programs.
Jonathan Beri is a software engineer and the organizer of the San Francisco WebAssembly Meetup. He has significant experience in the IoT industry, and joins the show to discuss the state of WebAssembly, the surrounding technologies, and their impact on IoT.
ANNOUNCEMENTS
FindCollabs is a place to find collaborators and build projects. We recently launched GitHub integrations. It’s easier than ever to find collaborators for your open source projects. And if you are looking for some people to start a project with, FindCollabs we have topic rooms that allow you to find other people who are interested in a particular technology, so that you can find people who are curious about React, or cryptocurrencies, or Kubernetes, or whatever you want to build with.
Podsheets is an open source podcast hosting platform that we recently launched. We are building Podsheets with the learnings from Software Engineering Daily, and our goal is to be the best place to host and monetize your podcast. If you have been thinking about starting a podcast, check out podsheets.com.
New SEDaily app for iOS and for Android. It includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. You can comment on episodes and have discussions with other members of the community. I’ll be commenting on each episode, so if you hear an episode that you have some commentary on, jump onto the app, or on SoftwareDaily.com to share your thoughts. And you can become a paid subscriber for ad free episodes at softwareengineeringdaily.com/subscribe. Altalogy is the company who has been developing much of the software for the newest app, and if you are looking for a company to help you with your mobile and web development, I recommend checking them out.
Afresh: Grocery Store Software with Volodymyr Kuleshov
Jun 26, 2019
A grocery store contains fruit, vegetables, meat, bread, and other items that can expire. In order to keep these items in stock, the store must be aware of how much food has been sold and what has gone bad. When a food item is low in stock, the store needs to order more of that food from a central distribution system.
Managing food inventory is not simple. Some kinds of meat might expire faster than others. Avocados do not become ripe at the same rate as apples. In order to keep the shelves stocked, there are manual workflows for checking the inventory and ordering new inventory.
Afresh is a company that builds software for grocery stores.
Afresh works with grocery chains that have a central distribution center. These grocery stores already have some software. At the back of the store, inventory management systems maintain records of the items that the store has on the shelves. At the front of the store, checkout systems detect what has been sold and help to update inventory. When the inventory is running low, the store can order more inventory from the central distribution center, so that trucks can deliver more inventory.
Afresh improves the operational intelligence of these stores by detecting spoilage among items that are prone to expiration, such as fruit. Volodomyr Kuleshov is the CTO and co-founder of Afresh and he joins the show to discuss the technical challenges of a grocery store, and the software that Afresh is building to make groceries more intelligent.
ANNOUNCEMENTS
FindCollabs is a place to find collaborators and build projects. FindCollabs is the company I am building, and we are having an online hackathon with $2500 in prizes. If you are working on a project, or you are looking for other programmers to build a project or start a company with, check out FindCollabs. I’ve been interviewing people from some of these projects on the FindCollabs podcast, so if you want to learn more about the community you can hear that podcast.
New Software Daily app for iOS. It includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. You can comment on episodes and have discussions with other members of the community. And you can become a paid subscriber for ad free episodes at softwareengineeringdaily.com/subscribe. Altalogy is the company who has been developing much of the software for the newest app, and if you are looking for a company to help you with your mobile and web development, I recommend checking them out.
We are hiring two interns for software engineering and business development! If you are interested in either position, send an email with your resume to jeff@softwareengineeringdaily.com with “Internship” in the subject line.
Niantic Real World with Paul Franceus
Jun 21, 2019
Niantic is the company behind Pokemon Go, an augmented reality game where users walk around in the real world and catch Pokemon which appear on their screen.
The idea for augmented reality has existed for a long time. But the technology to bring augmented reality to the mass market has appeared only recently. Improved mobile technology makes it possible for a smartphone to display rendered 3-D images over a video stream without running out of battery.
Ingress was the first game to come out of Niantic, followed by Pokemon Go, but there are other games on the way. Niantic is also working on the Niantic Real World platform, a “planet-scale” AR platform that will allow independent developers to build multiplayer augmented reality experiences that are as dynamic and entertaining as Pokemon Go.
Paul Franceus is an engineer at Niantic, and he joins the show to describe his experience building and launching Pokemon Go, as well as abstracting the technology from Pokemon Go and opening up the Niantic Real World platform to developers.
FindCollabs is a place to find collaborators and build projects. FindCollabs is the company I am building, and we are having an online hackathon with $2500 in prizes. If you are working on a project, or you are looking for other programmers to build a project or start a company with, check out FindCollabs. I’ve been interviewing people from some of these projects on the FindCollabs podcast, so if you want to learn more about the community you can hear that podcast.
New Software Daily app for iOS. It includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. You can comment on episodes and have discussions with other members of the community. And you can become a paid subscriber for ad free episodes at softwareengineeringdaily.com/subscribe
We are hiring two interns for software engineering and business development! If you are interested in either position, send an email with your resume to jeff@softwareengineeringdaily.com with “Internship” in the subject line.
Stripe Machine Learning Infrastructure with Rob Story and Kelley Rivoire
Jun 13, 2019
Machine learning allows software to improve as that software consumes more data.
Machine learning is a tool that every software engineer wants to be able to use. Because machine learning is so broadly applicable, software companies want to make the tools more accessible to the developers across the organization.
There are many steps that an engineer must go through to use machine learning, and each additional step inhibits the chances that the engineer will actually get their model into production.
An engineer who wants to build machine learning into their application needs access to data sets. They need to join those data sets, and load them into a machine (or multiple machines) where their model can be trained. Once the model is trained, the model needs to test on additional data to ensure quality. If the initial model quality is insufficient, the engineer might need to tweak the training parameters.
Once a model is accurate enough, the engineer needs to deploy that model. After deployment, the model might need to be updated with new data later on. If the model is processing sensitive or financially relevant data, a provenance process might be necessary to allow for an audit trail of decisions that have been made by the model.
Rob Story and Kelley Rivoire are engineers working on machine learning infrastructure at Stripe. After recognizing the difficulties that engineers faced in creating and deploying machine learning models, Stripe engineers built out Railyard, an API for machine learning workloads within the company.
Rob and Kelley join the show to discuss data engineering and machine learning at Stripe, and their work on Railyard.
ANNOUNCEMENTS
FindCollabs is a place to find collaborators and build projects. FindCollabs is the company I am building, and we are having an online hackathon with $2500 in prizes. If you are working on a project, or you are looking for other programmers to build a project or start a company with, check out FindCollabs. I’ve been interviewing people from some of these projects on the FindCollabs podcast, so if you want to learn more about the community you can hear that podcast.
New Software Daily app for iOS. It includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. You can comment on episodes and have discussions with other members of the community. And you can become a paid subscriber for ad free episodes at softwareengineeringdaily.com/subscribe
We are hiring two interns for software engineering and business development! If you are interested in either position, send an email with your resume to jeff@softwareengineeringdaily.com with “Internship” in the subject line.
Augmented Reality Gaming with Tony Godar
May 28, 2019
Augmented reality applications can be used on smartphones and dedicated AR headsets. On smartphones, ARCore (Google) and ARKit (Apple) allow developers to build for the camera on a user’s smartphone. AR headsets such as Microsoft HoloLens and Magic Leap allow for a futuristic augmented reality headset experience.
The most prominent use of augmented reality today is gaming, with a notable example being Niantic’s Pokemon Go. Tony Godar is a software engineer who works on augmented and virtual reality applications. He joins the show to talk about his day job working on virtual reality experiences, and an AR game he built called ARhythm.
Drishti is a company focused on improving manufacturing workflows using computer vision.
A manufacturing environment consists of assembly lines. A line is composed of sequential stations along that manufacturing line. At each station on the assembly line, a worker performs an operation on the item that is being manufactured. This type of workflow is used for the manufacturing of cars, laptops, stereo equipment, and many other technology products.
With Drishti, the manufacturing process is augmented by adding a camera at each station. Camera footage is used to train a machine learning model for each station on the assembly line. That machine learning model is used to ensure the accuracy and performance of each task that is being conducted on the assembly line.
Krish Chaudhury is the CTO at Drishti. From 2005 to 2015 he led image processing and computer vision projects at Google before joining Flipkart, where he worked on image science and deep learning for another four years. Krish had spent more than twenty years working on image and vision related problems when he co-founded Drishti.
In today’s episode, we discuss the science and application of computer vision, as well as the future of manufacturing technology and the business strategy of Drishti.
Until Google DeepMind came into the field, protein structure prediction was dominated by academics.
Protein structure prediction is the process of predicting how a protein will fold by looking at genetic code. Protein structure prediction is a perfect field to approach through the application of deep learning, because the inputs are highly dimensional and there is a plentiful array of different sets of labeled data. Protein structure deep learning is a field in which many different approaches are taken, often involving supervised learning and reinforcement learning.
Mohammed Al Quraishi is a systems biologist at Harvard. His background spans computer engineering, statistics, and genetics. In his work, Mohammed explores the interplay between biology and computer systems.
One area of Mohammed’s focus is protein structure prediction. In a blog post last year, Mohammed gave a brief history of protein structure prediction and described the significance of DeepMind entering the field. DeepMind’s AlphaFold technology surpassed all other competitors in the most recent CASP protein structure competition.
Mohammed joins the show to discuss biology, academia, deep learning, and DeepMind.
Data sets can be modeled in a row-wise, relational format. When two data sets share a common field, those data sets can be combined in a procedure called a join. A join combines the data of two data sets into one data set that is often bigger than the initial two data sets independently occupied. In fact, this new data set is often so much bigger that it creates problems for the machine learning engineers.
Arun Kumar is an assistant professor at UC San Diego. He joins the show to discuss the modern lifecycle of machine learning models, and the gaps in the tooling.
Arun’s research into improving processing of joined data sets has been adopted by companies such as Google. Some of that research has been adapted into open source machine learning tools that improve the performance of machine learning jobs with minimal code required.
Energy Market Machine Learning with Minh Dang and Corey Noone
Mar 11, 2019
The demand for electricity is based on the consumption of the electrical grid at a given time. The supply of electricity is based on how much energy is being produced or stored on the grid at a given time. Because these sources of supply and demand fluctuate rapidly but predictably, energy markets present profit opportunities for traders.
Minh Dang and Corey Noone are engineers with Advanced Microgrid Solutions, a company that builds software to help traders capture better opportunities in the energy markets. Minh and Corey join the show to talk about how their company builds and deploys machine learning models for market prediction.
We discussed data infrastructure, machine learning model deployments, and the dynamics of the energy markets.
Zoox Self-Driving with Ethan Dreyfuss
Feb 20, 2019
Zoox is a full-stack self-driving car company. Zoox engineers work on everything a self-driving car company needs, from the physical car itself to the algorithms running on the car to the ride hailing system which the company plans to use to drive around riders. Since starting in 2014, Zoox has grown to over 500 employees.
Ethan Dreyfuss is a software infrastructure engineer at Zoox. He joins the show to discuss scaling an engineering team for self-driving. Machine learning was a big part of our conversation, because there are so many different approaches that an engineering team can take when it comes to machine learning for cars.
Can you take computer vision algorithms from academic papers and apply them to cars? Can you use the computer vision APIs from the cloud providers for anything useful? What about physical world mapping companies like Mapillary? How do you do data labeling, and data management? And how do you manage the interactions across the stack, from mechanical engineering to user interface design?
We touched on some of these areas, but barely scratched the surface of the self-driving car domain.
Store2Vec: DoorDash Recommendations with Mitchell Koch
Feb 19, 2019
DoorDash is a food delivery company where users find restaurants to order from. When a user opens the DoorDash app, the user can search for types of food or specific restaurants from the search bar or they can scroll through the feed section and look at recommendations that the app gives them within their local geographic area.
Recommendations is a classic computer science problem. Much like sorting, or mapping, or scheduling, we will probably never “solve” recommendations. We will adapt our recommendation systems based off of discoveries in computer science and software engineering.
One pattern that has been utilized recently by software engineers in many different areas is the “word2vec”-style strategy of embedding entities in a vector space and then finding relationships between them. If you have never heard of the word2vec algorithm, you can listen to the episode we did with computer scientist and venture capitalist Adrian Colyer or listen to this episode in which we will describe the algorithm with a few brief examples.
Store2vec is a strategy used by DoorDash to model restaurants in vector space and find relationships between them in order to generate recommendations. Mitchell Koch is a senior data scientist with DoorDash, and he joins the show to discuss the application of store2vec, and the more general strategy of word2vec-like systems. This episode is also a great companion to our episode about data infrastructure at DoorDash.
Architects of Intelligence with Martin Ford
Jan 31, 2019
Artificial intelligence is reshaping every aspect of our lives, from transportation to agriculture to dating. Someday, we may even create a superintelligence–a computer system that is demonstrably smarter than humans. But there is widespread disagreement on how soon we could build a superintelligence. There is not even a broad consensus on how we can define the term “intelligence”.
Information technology is improving so rapidly we are losing the ability to forecast the near future. Even the most well-informed politicians and business people are constantly surprised by technological changes, and the downstream impact on society. Today, the most accurate guidance on the pace of technology comes from the scientists and the engineers who are building the tools of our future.
Architects of Intelligence is a privileged look at how AI is developing. Martin Ford surveys these different AI experts with similar questions. How will China’s adoption of AI differ from that of the US? What is the difference between the human brain and that of a computer? What are the low-hanging fruit applications of AI that we have yet to build?
Martin joins the show to talk about his new book. In our conversation, Martin synthesizes ideas from these different researchers, and describes the key areas of disagreement from across the field.
To find all 900 of our old episodes, including past episodes with authors and artificial intelligence researchers, check out the Software Engineering Daily app in the iOS and Android app stores. Whether or not you are a software engineer, we have lots of content about technology, business, and culture. In our app, you can also become a paid subscriber and get ad-free episodes–and you can have conversations with other members of the Software Engineering Daily community.
Kubeflow: TensorFlow on Kubernetes with David Aronchick
Jan 25, 2019
When TensorFlow came out of Google, the machine learning community converged around it. TensorFlow is a framework for building machine learning models, but the lifecycle of a machine learning model has a scope that is bigger than just creating a model. Machine learning developers also need to have a testing and deployment process for continuous delivery of models.
The continuous delivery process for machine learning models is like the continuous delivery process for microservices, but can be more complicated. A developer testing a model on their local machine is working with a smaller data set than what they will have access to when it is deployed. A machine learning engineer needs to be conscious of versioning and auditability.
Kubeflow is a machine learning toolkit for Kubernetes based on Google’s internal machine learning pipelines. Google open sourced Kubernetes and TensorFlow, and the projects have users AWS and Microsoft. David Aronchick is the head of open source machine learning strategy at Microsoft, and he joins the show to talk about the problems that Kubeflow solves for developers, and the evolving strategies for cloud providers.
David was previously on the show when he worked at Google, and in this episode he provides some useful discussion about how open source software presents a great opportunity for the cloud providers to collaborate with each other in a positive sum relationship.
Robots are making their way into every area of our lives. Security robots roll around industrial parks at night, monitoring the area for intruders. Amazon robots tirelessly move packages around in warehouses, reducing the time and cost of logistics. Self-driving cars have become a ubiquitous presence in cities like San Francisco.
For a hacker in a dorm room, or a researcher in a small lab, how do you get started with robotics? There are drones and other small options like AWS DeepRacer–but what is the equivalent of the Raspberry Pi for large, human-sized robots?
Zach Allen is the founder of Slate Robotics, a company that makes large, human-sized robots that are at a low enough cost to be accessible to tinkerers, researchers, and prototype builders. Zach joins the show to talk about the state of robotics and why he started a robot company.
Word2Vec with Adrian Colyer Holiday Repeat
Dec 28, 2018
Originally posted on 13 September 2017.
Machines understand the world through mathematical representations. In order to train a machine learning model, we need to describe everything in terms of numbers. Images, words, and sounds are too abstract for a computer. But a series of numbers is a representation that we can all agree on, whether we are a computer or a human.
In recent shows, we have explored how to train machine learning models to understand images and video. Today, we explore words. You might be thinking–”isn’t a word easy to understand? Can’t you just take the dictionary definition?” A dictionary definition does not capture the richness of a word. Dictionaries do not give you a way to measure similarity between one word and all other words in a given language.
Word2vec is a system for defining words in terms of the words that appear close to that word. For example, the sentence “Howard is sitting in a Starbucks cafe drinking a cup of coffee” gives an obvious indication that the words “cafe,” “cup,” and “coffee” are all related. With enough sentences like that, we can start to understand the entire language.
Adrian Colyer is a venture capitalist with Accel, and blogs about technical topics such as word2vec. We talked about word2vec specifically, and the deep learning space more generally. We also explored how the rapidly improving tools around deep learning are changing the venture investment landscape.
Self-Driving Deep Learning with Lex Fridman Holiday Repeat
Dec 27, 2018
Originally posted on 28 July 2017.
Self-driving cars are here. Fully autonomous systems like Waymo are being piloted in less complex circumstances. Human-in-the-loop systems like Tesla Autopilot navigate drivers when it is safe to do so, and lets the human take control in ambiguous circumstances.
Computers are great at memorization, but not yet great at reasoning. We cannot enumerate to a computer every single circumstance that a car might find itself in. The computer needs to perceive its surroundings, plan how to take action, execute control over the situation, and respond to changing circumstances inside and outside of the car.
Lex Fridman has worked on autonomous vehicles with companies like Google and Tesla. He recently taught a class on deep learning for semi-autonomous vehicles at MIT, which is freely available online. There was so much ground to cover in this conversation. Most of the conversation was higher level. How do you even approach the problem? What is the hardware and software architecture of a car?
I enjoyed talking to Lex, and if you want to hear more from him check out his podcast Take It Uneasy, which is about jiu jitsu, judo, wrestling, and learning.
Poker Artificial Intelligence with Noam Brown Holiday Repeat
Nov 21, 2018
Originally posted on May 12, 2015.
Humans have now been defeated by computers at heads up no-limit holdem poker.
Some people thought this wouldn’t be possible. Sure, we can teach a computer to beat a human at Go or Chess. Those games have a smaller decision space. There is no hidden information. There is no bluffing. Poker must be different! It is too human to be automated.
The game space of poker is different than that of Go. It has 10^160 different situations–which is more than the number of atoms in the universe. And the game space keeps getting bigger as the stack sizes of the two competitors gets bigger.
But it is still possible for a computer to beat a human at calculating game theory optimal decisions–if you approach the problem correctly.
Libratus was developed by CMU professor Tuomas Sandholm, along with my guest today Noam Brown. The Libratus team taught their AI the rules of poker, they gave it a reward function (to win as much money as possible), and they told it to optimize that reward function. Then they had Libratus train itself with simulations.
After enough training, Libratus was ready to crush human competitors, which it did in hilarious, entertaining fashion. There is a video from Engadget on YouTube about the AI competing against professional humans.
In this episode, Noam Brown explains how they built Libratus, what it means for poker players, and what the implications are for humanity–if we can automate poker, what can’t we automate?
Stay tuned at the end of this episode for the Indeed Prime tip on hiring developers.
Reflow: Distributed Incremental Processing with Marius Eriksen
Nov 16, 2018
The volume of data in the world is always increasing. The costs of storing that data is always decreasing. And the means for processing that data is always evolving.
Sensors, cameras, and other small computers gather large quantities of data from the physical world around us. User analytics tools gather information about how we are interacting with the Internet. Logging servers collect terabytes of records about how our systems are performing.
From the popularity of MapReduce, to the rise of open source distributed processing frameworks like Spark and Flink, to the wide variety of cloud services like BigQuery: there is an endless set of choices for how to analyze gigantic sets of data.
Machine learning training and inference is another dimension of the modern data engineering stack. Whereas tools like Spark and BigQuery are great for performing ad-hoc queries, systems like TensorFlow are optimized for the model training and deployment process.
Stitching together these tools allows a developer to compose workflows for how data pipelines progress through a data engineering system. One popular tool for this is Apache Airflow, which was created in 2014 and is widely used at companies like Airbnb.
Over the next few years, we will see a proliferation of new tools in the world of data engineering–and for good reason. There is a wealth of opportunity for companies to leverage their data to make better decisions, and potentially to clean and offer their internal data as APIs and pre-trained machine learning models.
Today, there is a vast number of enterprises who are modernizing their software development process with Kubernetes, cloud providers, and continuous delivery. Eventually, these enterprises will improve their complex software architecture, and will move from a defensive position to an offensive one. These enterprises will shift their modernization efforts from “DevOps” to “DataOps”, and thousands of software vendors will be ready to sell them new software for modernizing their data platform.
There is not a consensus for the best way to build and run a “data platform”. Nearly every company we have talked to on the show has a different definition and a different architecture for their “data platform”: Doordash, Dremio, Prisma, Uber, MapR, Snowflake, Confluent, Databricks…
We don’t expect to have a concise answer for how to run a data platform any time soon–but on the bright side, data infrastructure seems to be improving. Companies are increasingly able to ask questions about their data and get quick answers, in contrast to the data breadlines that were so prevalent five years ago.
Today we cover yet another approach to large scale data processing.
Reflow is a system for incremental data processing in the cloud. Reflow includes a functional, domain specific language for writing workflow programs, a runtime for evaluating those programs incrementally, and a scheduler for dynamically provisioning resources for those workflows. Reflow was created for large bioinformatics workloads, but should be broadly applicable to scientific and engineering computing workloads.
Reflow evaluates programs incrementally. Whenever the input data or the program changes, only the outputs that depend on the changes are recomputed. This minimizes the amount of recomputation that needs to be performed across a computational graph.
Marius Eriksen is the creator of Reflow and an engineer at GRAIL. He joins the show to discuss the motivation for a new data processing system–which involves explaining why workloads in bioinformatics are different than in some other domains.
Computer Architecture with Dave Patterson
Nov 07, 2018
An instruction set defines a low level programming language for moving information throughout a computer. In the early 1970’s, the prevalent instruction set language used a large vocabulary of different instructions. One justification for a large instruction set was that it would give a programmer more freedom to express the logic of their programs.
Many of these instructions were rarely used. Think of your favorite programming language (or your favorite human language). What percentage of words in the vocabulary do you need to communicate effectively? We sometimes call these language features “syntactic sugar”. They add expressivity to a language, but may not improve functionality or efficiency.
These extra language features can have a cost.
Dave Patterson and John Hennessy created the RISC architecture: Reduced Instruction Set Compiler architecture. RISC proposed reducing the size of the instruction set so that the important instructions could be optimized for. Programs would become more efficient, easier to analyze, and easier to debug.
Dave Patterson’s first paper on RISC was rejected. He continued to research the architecture and advocate for it. Eventually RISC became widely accepted, and Dave won a Turing Award together with John Hennessy.
Dave joins the show to talk about his work on RISC and his continued work in computer science research to the present. He is involved in the Berkeley RISELab and works at Google on the Tensor Processing Unit.
Machine learning is an ocean of new scientific breakthroughs and applications that will change our lives. It was inspiring to hear Dave talk about the changing nature of computing, from cloud computing to security to hardware design.
Diffbot: Knowledge Graph API with Mike Tung
Oct 31, 2018
Google Search allows humans to find and access information across the web. A human enters an unstructured query into the search box, the search engine provides several links as a result, and the human clicks on one of those links. That link brings up a web page, which is a set of unstructured data. Humans can read and understand news articles, videos, and Wikipedia pages.
Google Search solves the problem of organizing and distributing all of the unstructured data across the web, for humans to consume. Diffbot is a company with a goal of solving a related, but distinctly different problem: how to derive structure from the unstructured web, understand relationships within that structure, and allow machines to utilize those relationships through APIs.
Mike Tung is the founder of Diffbot. He joins the show to talk about the last decade that he has spent building artificial intelligence applications, from his research at Stanford to a mature, widely used product in Diffbot. I have built a few applications with Diffbot, and I encourage anyone who is a tinkerer or prototype builder to play around with it. It’s an API for accessing web pages as structured data.
Diffbot crawls the entire web, parsing websites, using NLP and NLU to comprehend those pages, and using probabilistic estimations to draw relationships between entities. It’s an ambitious product, and Mike has been working on it for a long time. I enjoyed our conversation.
We recently launched a new podcast: Fintech Daily! Fintech Daily is about payments, cryptocurrencies, trading, and the intersection between finance and technology. You can find it on fintechdaily.co or Apple and Google podcasts. We are looking for other hosts who want to participate. If you are interested in becoming a host, send us an email: host@fintechdaily.co
Drift: Sales Bot Engineering with David Cancel
Oct 30, 2018
David Cancel has started five companies, most recently Drift. Drift is a conversational marketing and sales platform. David has a depth of engineering skills and a breadth of business experience that make him an amazing source of knowledge. In today’s episode, David discusses topics ranging from the technical details of making a machine learning-driven sales platform to the battle scars from his early career, when he spent a lot of time building products that people did not want. He has found success by focusing on building software that the market has shown a desire for.
Chatbots were a popular, trendy subject a few years ago. The success of chatbots manifested in them fading into the background, and becoming a subtle, increasing part of our everyday interactions. Not every online interaction can be replaced by a chatbot, but many online interactions can be made more efficient by using chatbots. Chatbots can serve well-defined information, like product features, or the hours of operation of a business. When a chatbot gets a question that it cannot answer, the bot can route the conversation to a human.
When a customer lands on a web page of a company using Drift, they see a chat box appear in the corner of the screen. The customer is able to communicate through that chat box with a bot that represents the company. The customer can learn about the product, schedule a call with a salesperson, and get other useful utilities from the Drift sales bot.
The Drift chatbot messaging system is handled by Elixir/Erlang. Erlang is widely known as the messaging language that was used to scale WhatsApp while maintaining high availability. On the backend, Java services take the interactions from the Driftbot and pull it into a CRM, which allows sales and marketing people to manage information about the customers that are interacting with the chatbot. David gives lots more detail around the engineering stack, the deployment model, and his thoughts on the business and modern engineering.
We recently launched a new podcast: Fintech Daily! Fintech Daily is about payments, cryptocurrencies, trading, and the intersection between finance and technology. You can find it on fintechdaily.co or Apple and Google podcasts. We are looking for other hosts who want to participate. If you are interested in becoming a host, send us an email: host@fintechdaily.co
Google Brain is an engineering team focused on deep learning research and applications. One growing area of interest within Google Brain is that of generative models. A generative model uses neural networks and a large data set to create new data similar to the ones that the network has seen before.
One approach to making use of generative models is GANs: generative adversarial networks. GANs can use a generative model (which creates new examples) together with a discriminator model (which can classify examples).
As an example, let’s take the task of generating new pictures of cats. We want an artificial cat picture generator. First, we train a discriminator by feeding it billions of example pictures of cats. We now have a model that can tell what a cat is. Next, we make a model that generates completely random images. We feed those randomly generated images to the discriminator. The discriminator outputs a “loss” for these random images. Loss is a metric we can use to represent how far off a given image is from being something that the discriminator would recognize as a cat. Finally, you can feed this “loss” back into the generative model, so that the generative model will adjust its weights in a way that will reduce loss. Over time, the generator gets better and better at reducing loss, until the discriminator starts believing that some of these semi-random images are actually cats.
Generative model systems have produced useful applications, such as object detection, image editing, and text-to-image generation. Today’s guest Doug Eck works on the Magenta team at Google Brain. Magenta uses applications of deep learning to produce tools and experiments around music, art, and creativity.
Real Estate Machine Learning with Or Hiltch
Sep 11, 2018
Stock traders have access to high volumes of information to help them make decisions on whether to buy an asset. A trader who is considering buying a share of Google stock can find charts, reports, and statistical tools to help with their decision. There are a variety of machine learning products to help a technical investor create models of how a stock price might change in the future.
Real estate investors do not have access to the same data and tooling. Most people who invest in apartment buildings are using a combination of experience, news, and basic reports.
Real estate data is very different from stock data. Real estate assets are not fungible–each one is arguably unique from all others, whereas one share of Google stock is the same as another share. But there are commonalities between real estate assets.
Just like collaborative filtering can be applied to find a new movie that is similar to the ones you have watched on Netflix, comparable analysis can be used to find an apartment building that is very similar to another apartment building which recently appreciated in asset value.
Skyline.ai is a company that is building tools and machine learning models for real estate investors. Or Hiltch is the CTO at Skyline.ai and he joins the show to explain how to apply machine learning to real estate investing. He also describes the mostly serverless architecture of the company. This is one of the first companies we have talked to that is so heavily on managed services and functions-as-a-service.
RideOS: Fleet Management with Rohan Paranjpe
Aug 31, 2018
Self-driving transportation will be widely deployed at some point in the future. How far off is that future? There are widely varying estimations: maybe you will summon a self-driving Uber in a New York within 5 years, or maybe it will take 20 years to work out all of the challenges in legal and engineering.
Between now and the self-driving future, there will be a long span of time where cars are semi-autonomous. Maybe your car is allowed to drive itself in certain areas of the city. Maybe your car can theoretically drive itself in 99% of conditions, but the law requires you to be behind the wheel until the algorithms get just a little bit better.
While we wait for self-driving to be widely deployed to consumers, a lot could change in the market. We know about Uber, Lyft, Waymo, Tesla and Cruise. But what about the classic car companies like Ford, Mercedes Benz, and Volkswagen? These companies are great at making cars, and they have hired teams of engineers working on self-driving.
But self-driving functionality is not the only piece of software you need to compete as a transportation company. You also need to build a marketplace for your autonomous vehicles, because in the future, far fewer people will want to own a car. Customers will want to use transportation as a service.
RideOS is a company that is building fleet management and navigation software. If you run a company that is building autonomous cars, you need to solve the problem of making an autonomous, safe robot that can drive you around.
Building an autonomous car is hard, but to go to market as a next-generation transportation company, you also need fleet management software, so you can deploy your cars in an Uber-like transportation system. And you need navigation software so that your cars know how to drive around.
RideOS lets a car company like Ford focus on building cars by providing a set of SDKs and cloud services for managing and navigating fleets of cars. Rohan Paranjpe joins today’s show to talk about the world of self-driving cars. Rohan worked at Tesla and Uber before joining RideOS, so he has a well-informed perspective on a few directions the self-driving car market might go in.
Stitch Fix Engineering with Cathy Polinsky
Aug 23, 2018
Stitch Fix is a company that recommends packages of clothing based on a set of preferences that the user defines and updates over time. Stitch Fix’s software platform includes the website, data engineering infrastructure, and warehouse software. Stitch Fix has over 5000 employees, including a large team of engineers.
Cathy Polinsky is the CTO of Stitch Fix. In today’s show Cathy describes how the infrastructure has changed as the company has grown–including the process of moving the platform from Heroku to AWS, and the experience of scaling and refactoring a large monolithic database. Cathy also talked about the management structure, the hiring process, and engineering compensation at Stitch Fix.
DoorDash Engineering with Raghav Ramesh
Aug 16, 2018
DoorDash is a last mile logistics company that connects customers with their favorite national and local businesses. When a customer orders from a restaurant, DoorDash needs to identify the ideal driver for picking up the order from the restaurant and dropping it off with the customer.
This process of matching an order to a driver takes in many different factors. Let’s say I order spaghetti from an Italian restaurant. How long does the spaghetti take to prepare? How much traffic is there in different areas of the city? Who are the different drivers who could potentially pick the spaghetti up? Are there other orders near the Italian restaurant, that we could co-schedule the spaghetti delivery with?
In order to perform this matching of drivers and orders, DoorDash builds machine learning models that take into account historical data. In today’s episode, Raghav Ramesh explains how DoorDash’s data platform works, and how that data is used to build machine learning models. We also explore the machine learning model release process—which involves backtesting, shadowing, and gradual rollout.
Self-Driving Engineering with George Hotz
Aug 08, 2018
In the smartphone market there are two dominant operating systems: one closed source (iPhone) and one open source (Android). The market for self-driving cars could play out the same way, with a company like Tesla becoming the closed source iPhone of cars, and a company like Comma.ai developing the open source Android of self-driving cars.
George Hotz is the CEO of Comma.ai. Comma makes hardware devices that allow users with “normal” cars to be augmented with advanced cruise control and lane assist features. This means you can take your own car–for example, a Toyota Prius–and outfit your car to have something similar to the Tesla Autopilot. Comma’s hardware devices cost under $1000 to order online.
George joins the show to explain how the Comma hardware and software stack works in detail–from the low level interface with a car’s CAN bus to the high level machine learning infrastructure.
Users who purchase the Comma.ai hardware drive around with a camera facing the front of their windshield. This video is used to orient the state of the car in space. The video from that camera also gets saved and uploaded to Comma’s servers. Comma can use this video together with labeled events from the user’s driving experience to crowdsource their model for self-driving.
For example, if a user is driving down a long stretch of highway, and they turn on the Comma.ai driving assistance, the car will start driving itself and the video capture will begin. If the car begins to swerve into another lane, the user will take over for the car and the Comma system will disengage. This “disengagement” event gets labeled as such, and when that data makes it back to Comma’s servers, Comma can use the data to update their models.
George is very good at explaining complex engineering topics, and is also quite entertaining and open to discussing the technology as well as other competitors in the autonomous car space. I have not been able to get many other people on the show to talk about autonomous cars, so this was quite refreshing! I hope to do more in the future.
“Bots” are becoming increasingly relevant to our everyday interactions with technology. A bot sometimes mediates the interactions of two people. Examples of bots include automated reply systems, intelligent chat bots, classification systems, and prediction machines. These systems are often powered by machine learning systems that are black boxes to the user.
Today’s guest Rob May argues that these systems should be auditable and accountable, and that using a blockchain-based identity system for bots is a viable solution to the machine learning auditability problem.
Rob is the CEO of Talla, a knowledge base provider for business teams. The Botchain project was spun out of Talla as a solution to the problem of bot identity.
In this episode, we talk about Botchain and the application of blockchain to bot identity, the current state of ICOs, and the viability of utility token ecosystems. Botchain has its own cryptotoken called “Botcoin.”
Machine Learning Deployments with Diego Oppenheimer
Jul 13, 2018
Machine learning models allow our applications to perform highly accurate inferences. A model can be used to classify a picture as a cat, or to predict what movie I might want to watch. But before a machine learning model can be used to make these inferences, the model must be trained and deployed.
In the training process, a machine learning model consumes a data set and learns from it. The training process can consume significant resources. After the training process is over, you have a trained model that you need to get into production. This is known as the “deployment” step.
Deployment can be a hard problem. You are taking a program from a training environment to a production environment. A lot can change between these two environments. In production, your model is running on a different machine–which can lead to compatibility issues. If your model serves a high volume of requests, it might need to scale up. In production, you also need caching, and monitoring, and logging.
Large companies like Netflix, Uber, and Facebook have built their own internal systems to control the pipeline of getting a model from training into production. Companies who are newer to machine learning can struggle with this deployment process, and these companies usually don’t have the resources to build their own machine learning platform like Netflix.
Diego Oppenheiner is the CEO of Algorithmia, a company that has built a system for automating machine learning deployments. This is the second cool product that Algorithmia has built, the first being the algorithm marketplace that we covered in an episode a few years ago.
In today’s show, Diego describes the challenges of deploying a machine learning model into production, and how that product was a natural complement to the algorithms marketplace. Full disclosure: Algorithmia is a sponsor of Software Engineering Daily.
Machine Learning Stroke Identification with David Golan
Jul 05, 2018
When a patient comes into the hospital with stroke symptoms, the hospital will give that patient a CAT scan, a 3-dimensional imaging of the patient’s brain. The CAT scan needs to be examined by a radiologist, and the radiologist will decide whether to refer the patient to an interventionist–a surgeon who can perform an operation to lower the risk of long-term damage to the patient’s brain function.
After getting the CAT scan, the patient might wait for hours before a radiologist has a chance to look at the scan. In that period of time, the patient’s brain function might be rapidly degrading. To speed up this workflow, a company called Viz.ai built a machine learning model that can recognize whether a patient is at high risk of stroke consequences or not.
Many people have predicted that radiologists will be automated away by machine learning in the coming years. This episode presents a much more realistic perspective: first of all, we don’t have nearly enough radiologists, so if we can create automated radiologists that would be a very good thing; second of all, even in this workflow with a cutting-edge machine learning radiologist, you still need the human radiologist in the loop.
David Golan is the CTO at Viz.ai, and in today’s show he explains why he is working on a system for automated stroke identification, and the engineering challenges in building that system.
Transcript
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.
Digital Evolution with Joel Lehman, Dusan Misevic, and Jeff Clune
Jun 15, 2018
Evolutionary algorithms can generate surprising, effective solutions to our problems.
Evolutionary algorithms are often let loose within a simulated environment. The algorithm is given a function to optimize for, and the engineers expect that algorithm to evolve a solution that optimizes for the objective function given the constraints of the simulated environment. But sometimes these results are not exactly what we are looking for.
For example, imagine an evolutionary algorithm that tries to evolve a creature that does a flip within a simulated physics engine that mirrors the real world.
You could imagine all sorts of evolutionary traits. Maybe the creature will evolve to have legs that are like springs, and let the creature jump high enough to do a flip. Maybe the creature will develop normal legs with strong muscles that propel the creature high enough to flip. But you wouldn’t expect the creature to evolve to be extremely tall–so tall that the creature can merely lean over fast enough so that the top of its body flips upside down. In one experiment, this is exactly what happened.
In another, similar experiment, the evolving creature discovered a bug in the physics engine of the simulated environment. This creature was able to exploit the problem with this physics engine to be able to move in ways that would not be possible in our real-world physical universe.
Evolutionary algorithms sometimes evolve solutions in ways that we don’t expect. Researchers usually throw those results away, because they don’t contribute to the result that the researchers are looking for. The consequence is that lots of interesting anecdotes get lost.
Joel Lehman, Dusan Misevic, and Jeff Clune are the lead authors of the paper “The Surprising Creativity of Digital Evolution.” The paper was a collection of anecdotes about strange results within the world of digital evolution. They join the show to describe what digital evolution is and some of the strange results that they surveyed in their paper.
Joel and Jeff are engineers at Uber’s artificial intelligence division–so this topic has applicable importance to them. Machine learning is all about evolution within simulated environments, and developing safe algorithms for AI requires an understanding of what can go wrong in a poorly defined evolutionary system.
Future of Computing with John Hennessy
Jun 07, 2018
Moore’s Law states that the number of transistors in a dense integrated circuit double about every two years. Moore’s Law is less like a “law” and more like an observation or a prediction.
Moore’s Law is ending. We can no longer fit an increasing amount of transistors in the same amount of space with a highly predictable rate. Dennard scaling is also coming to an end. Dennard scaling is the observation that as transistors get smaller, the power density stays constant.
These changes in hardware trends have downstream effects for software engineers. Most importantly–power consumption becomes much more important.
As a software engineer, how does power consumption affect you? It means that inefficient software will either run more slowly or cost more money relative to our expectations in the past. Whereas software engineers writing code 15 years ago could comfortably project that their code would get significantly cheaper to run over time due to hardware advances, the story is more complicated today.
Why is Moore’s Law ending? And what kinds of predictable advances in technology can we still expect?
John Hennessy is the chairman of Alphabet. In 2017, he won a Turing award (along with David Patterson) for his work on the RISC (Reduced Instruction Set Compiler) architecture. From 2000 to 2016, he was the president of Stanford University.
John joins the show to explore the future of computing. While we may not have the predictable benefits of Moore’s Law and Dennard scaling, we now have machine learning. It is hard to plot the advances of machine learning on any one chart (as we explored in a recent episode with OpenAI). But we can say empirically that machine learning is working quite well in production.
If machine learning offers us such strong advances in computing, how can we change our hardware design process to make machine learning more efficient?
As machine learning training workloads eat up more resources in a data center, engineers are developing domain specific chips which are optimized for those machine learning workloads. The Tensor Processing Unit (TPU) from Google is one such example. John mentioned that chips could become even more specialized within the domain of machine learning. You could imagine a chip that is specifically designed for an LSTM machine learning model.
There are other domains where we could see specialized chips–drones, self-driving cars, wearable computers. In this episode, John describes his perspective on the future of computing and offers some framework for how engineers can adapt to that future.
OpenAI: Compute and Safety with Dario Amodei
Jun 04, 2018
Applications of artificial intelligence are permeating our everyday lives. We notice it in small ways–improvements to speech recognition; better quality products being recommended to us; cheaper goods and services that have dropped in price because of more intelligent production.
But what can we quantitatively say about the rate at which artificial intelligence is improving? How fast are models advancing? Do the different fields in artificial intelligence all advance together, or are they improving separately from each other? In other words, if the accuracy of a speech recognition model doubles, does that mean that the accuracy of image recognition will double also?
It’s hard to know the answer to these questions.
Machine learning models trained today can consume 300,000 times the amount of compute that could be consumed in 2012. This does not necessarily mean that models are 300,000 times better–these training algorithms could just be less efficient than yesterday’s models, and therefore are consuming more compute.
We can observe from empirical data that models tend to get better with more data. Models also tend to get better with more compute. How much better do they get? That varies from application to application, from speech recognition to language translation. But models do seem to improve with more compute and more data.
Dario Amodei works at OpenAI, where he leads the AI safety team. In a post called “AI and Compute,” Dario observed that the consumption of machine learning training runs is increasing exponentially–doubling every 3.5 months. In this episode, Dario discusses the implications of increased consumption of compute in the training process.
Dario’s focus is AI safety. AI safety encompasses both the prevention of accidents and the prevention of deliberate malicious AI application.
Today, humans are dying in autonomous car crashes–this is an accident. The reward functions of social networks are being exploited by botnets and fake, salacious news–this is malicious. The dangers of AI are already affecting our lives on the axes of accidents and malice.
There will be more accidents, and more malicious applications–the question is what to do about it. What general strategies can be devised to improve AI safety? After Dario and I talk about the increased consumption of compute by training algorithms, we explore the implications of this increase for safety researchers.
A sample of the human voice is a rich piece of unstructured data. Voice recordings can be turned into visualizations called spectrograms. Machine learning models can be trained to identify features of these spectrograms. Using this kind of analytic strategy, breakthroughs in voice analysis are happening at an amazing pace.
Rita Singh researches voice at Carnegie Mellon University. Her work studies the high volume of latent data that is available in the human voice. As she explains, just a small fragment of a human voice can be used to identify who a speaker is. Your voice is as distinctive as your fingerprint.
Your voice can also reveal medical conditions. Features of the human voice can be strongly correlated with psychiatric symptom severity, and potentially heart disease, cancer, and other illnesses. The human voice can even suggest a person’s physique–your height, weight, and facial features.
In this episode, Rita explains the machine learning techniques that she uses to uncover the hidden richness of the human voice.
Machine Learning with Data Skeptic and Second Spectrum at Telesign
May 19, 2018
Data Skeptic is a podcast about machine learning, data science, and how software affects our lives. The first guest on today’s episode is Kyle Polich, the host of Data Skeptic. Kyle is one of the best explainers of machine learning concepts I have met, and for this episode, he presented some material that is perfect for this audience: machine learning for software engineers.
Second Spectrum is a company that analyzes data from professional sports, turning that data into visualizations, reports, and futuristic sports viewing experiences. We had a previous show about Second Spectrum where we went into the company in detail–it was an excellent show, so I wanted to have Kevin Squire, an engineer from Spectrum, come on the show to talk about how the company builds machine learning tools to analyze sports data. If you have not seen any of the visualizations from Second Spectrum, stop what you are doing and watch a video on it!
This year we have had three Software Engineering Daily Meetups: in New York, Boston, and Los Angeles. At each of these Meetups, listeners from the SE Daily community got to meet each other and talk about software–what they are building and what they are excited about. I was happy to be in attendance at each of these, and I am posting the talks given by our presenters. The audio quality is not perfect on these, but there are also no ads.
Thanks to Telesign for graciously providing a space and some delicious food for our Meetup. Telesign has beautiful offices in Los Angeles, and they make SMS, voice, and data solutions. If you are looking for secure and reliable communications APIs, check them out.
We’d love to have you as part of our community. We will have more Meetups eventually, and you can be notified of these by signing up for our newsletter. Come to SoftwareDaily.com and get involved with the discussion of episodes and software projects. You can also check out our open source projects–the mobile apps, and our website.
Deep Learning Topologies with Yinyin Liu
May 10, 2018
Algorithms for building neural networks have existed for decades. For a long time, neural networks were not widely used. Recent changes to the cost of compute and the size of our data have made neural networks extremely useful. Our smartphones generate terabytes of useful data. Lower storage costs make it economical to keep that data. Cloud computing democratized the ability to do large scale machine learning across deep learning hardware.
Over the last few years, these trends have been driving widespread use of deep learning, in which neural nets with a large series of layers are used to create powerful results in various fields of classification and prediction. Neural networks are a tool for making sense of unstructured data–text, images, sound waves, and videos.
“Unstructured” data is data with high volume or high dimensionality. For example, an image has a huge collection of pixels, and each pixel has a color value. One way to think about image classification is that you are finding correlations between those pixels. A certain cluster of pixels might represent an edge. After doing edge detection on pixels, you have a collection of edges. Then you can find correlations between those edges, and build up higher levels of abstraction.
Yinyin Liu is a principal engineer and head of data science at the Intel AI products group. She studies techniques for building neural networks. Each different configuration of a neural network for a given problem is called a “topology.” Engineers are always looking at new topologies for solving a deep learning application–such as natural language processing.
In this episode, Yinyin describes what a deep learning topology is and describes topologies for natural language processing. We also talk about the opportunities and the bottlenecks in deep learning–including why the tools are so immature, and what it will take to make the tooling better.
Keybase is a platform for managing public key infrastructure. Keybase’s products simplify the complicated process of associating your identity with a public key. Keybase is the subject of the first half of today’s show. Michael Maxim, an engineer from Keybase gives an overview of how the technology works and what kinds of applications Keybase unlocks.
The second half of today’s show is about Clarifai. Clarifai is an AI platform that provides image recognition APIs as a service. Habib Talavati explains how Clarifai’s infrastructure processes requests, and the opportunities for improving the efficiency of that infrastructure.
Last month, we had three Software Engineering Daily Meetups: in New York, Boston, and Los Angeles. At each of these Meetups, listeners from the SE Daily community got to meet each other and talk about software–what they are building and what they are excited about. I was happy to be in attendance at each of these, and I am posting the talks given by our presenters. The audio quality is not perfect on these, but there are also no ads.
Thanks to Datadog for graciously providing a space for our Meetup, and for being a sponsor of SE Daily. You can sign up for Datadog and get a free t-shirt by going to softwareengineeringdaily.com/datadog.
We’d love to have you as part of our community. We will have more Meetups eventually, and you can be notified of these by signing up for our newsletter. Come to SoftwareDaily.com and get involved with the discussion of episodes and software projects. You can also check out our open source projects–the mobile apps, and our website.
TensorFlow Applications with Rajat Monga
Apr 26, 2018
Rajat Monga is a director of engineering at Google where he works on TensorFlow. TensorFlow is a framework for numerical computation developed at Google.
The majority of TensorFlow users are building machine learning applications such as image recognition, recommendation systems, and natural language processing–but TensorFlow is actually applicable to a broader range of scientific computation than just machine learning. TensorFlow has APIs for decision trees, support vector machines, and linear algebra libraries.
The current focus of the TensorFlow team is usability. There are thousands of engineers building data-intensive applications with TensorFlow, but Rajat and the rest of the TensorFlow team would like to see millions more. In today’s show, Rajat and I discussed how TensorFlow is becoming more usable, as well as some of the developments in TensorFlow around edge computing, TensorFlow Hub, and TensorFlow.js, which allows TensorFlow to run in the browser.
Scale Self-Driving with Alexandr Wang
Feb 27, 2018
The easiest way to train a computer to recognize a picture of a cat is to show the computer a million labeled images of cats. The easiest way to train a computer to recognize a stop sign is to show the computer a million labeled stop signs.
Supervised machine learning systems require labeled data. Today, most of that labeling needs to be done by humans. When a large tech company decides to “build a machine learning model,” that often requires a massive amount of effort to get labeled data.
Hundreds of thousands of knowledge workers around the world earn their income from labeling tasks. An example task might be “label all of the pedestrians in this intersection.” You receive a picture of a crowded intersection, and your task is to circle all the pedestrians. You have now created a piece of labeled data.
Scale API is a company that turns API requests into human tasks. Their most recent release is an API for labeling data that has been generated from sensors. As self-driving cars emerge onto our streets, the sensors on these cars generate LIDAR, radar, and camera data. The cars will interpret that data in real-time using their machine learning models, and then they will send that data to the cloud so that the data can be processed offline to improve the machine learning models of every car on the road.
The first step in that processing pipeline is the labeling–which is the focus of today’s conversation. Alexandr Wang is the CEO of Scale, and he joins the show to discuss self-driving cars, labeling, and the company he co-founded.
A few notes before we get started. We just launched the Software Daily, job board. To check it out, go to softwaredaily.com/jobs. You can post jobs, you can apply for jobs, and it’s all free. If you are looking to hire, or looking for a job, I recommend checking it out. And if you are looking for an internship, you can use the job board to apply for an internship at Software Engineering Daily.
Also, Meetups for Software Engineering Daily are being planned! Go to softwareengineeringdaily.com/meetup if you want to register for an upcoming Meetup. In March, I’ll be visiting Datadog in New York and Hubspot in Boston, and in April I’ll be at Telesign in LA.
Machine Learning Deployments with Kinnary Jangla
Feb 14, 2018
Pinterest is a visual feed of ideas, products, clothing, and recipes. Millions of users browse Pinterest to find images and text that are tailored to their interests.
Like most companies, Pinterest started with a large monolithic application that served all requests. As Pinterest’s engineering resources expanded, some of the architecture was broken up into microservices and Dockerized, which make the system easier to reason about.
To serve users with better feeds, Pinterest built a machine learning pipeline using Kafka, Spark, and Presto. User events are generated from the frontend, logged onto Kafka, and aggregated to build machine learning models. These models are deployed into Docker containers much like the production microservices.
Kinnary Jangla is a senior software engineer at Pinterest, and she joins the show to talk about her experiences at the company–breaking up the monolith, architecting a machine learning pipeline, and deploying those models into production.
Training a deep learning model involves operations over tensors. A tensor is a multi-dimensional array of numbers. For several years, GPUs were used for these linear algebra calculations. That’s because graphics chips are built to efficiently process matrix operations.
Tensor processing consists of linear algebra operations that are similar in some ways to graphics processing–but not identical. Deep learning workloads do not run as efficiently on these conventional GPUs as they would on specialized chips, built specifically for deep learning.
In order to train deep learning models faster, new hardware needs to be designed with tensor processing in mind.
Xin Wang is a data scientist with the artificial intelligence products group at Intel. He joins today’s show to discuss deep learning hardware and Flexpoint, a way to improve the efficiency of space that tensors take up on a chip. Xin presented his work at NIPS, the Neural Information Processing Systems conference, and we talked about what he saw at NIPs that excited him. Full disclosure: Intel, where Xin works, is a sponsor of Software Engineering Daily.
A modern farm has hundreds of sensors to monitor the soil health, and robotic machinery to reap the vegetables. A modern shipping yard has hundreds of computers working together to orchestrate and analyze the freight that is coming in from overseas. A modern factory has temperature gauges and smart security cameras to ensure workplace safety.
All of these devices could be considered “edge” devices.
Over the last decade, these edge devices have mostly been used to gather data and save it to an on-premise server, or to the cloud. Today, as the required volumes of data and compute scale, we look for ways to better utilize our resources. We can start to deploy more application logic to these edge devices, and build a more sophisticated relationship between our powerful cloud servers and the less powerful edge devices.
The soil sensors at the farm are recording long time series of chemical levels. The pressure sensors in a centrifuge are recording months and years of data. The cameras are recording terabytes of video. These huge data sets are correlated with labeled events–such as crop yields.
With these large volumes of data, we can construct models for responding to future events. Deep learning can be used to improve systems over time. The models can be trained in the cloud and deployed to devices at the edge.
Aran Khanna is an AI engineer with Amazon Web Services, and he joins the show to discuss workloads at the cloud and at the edge–how work can be distributed between the two places, and the tools that can be used to build edge deep learning systems more easily.
To find all of our shows about machine learning and edge computing, as well as links to learn more about the topics described in the show, download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links, and discussions around the episodes. It’s all free and also open source–if you are interested in getting involved in our open source community, we have lots of people working on the project and we do our best to be friendly and inviting to new people coming in looking for their first open-source project. You can find that project at Github.com/softwareengineeringdaily
Training the Machines with Russell Smith
Nov 17, 2017
Automation is changing the labor market.
To automate a task, someone needs to put in the work to describe the task correctly to a computer. For some tasks, the reward for automating a task is tremendous–for example, putting together mobile phones. In China, companies like FOXCONN are investing time and money into programming the instructions for how to assemble your phone. Robots execute those instructions.
FOXCONN spends millions of dollars deploying these robots, but it is a worthwhile expense. Once FOXCONN pays off the capital investment in those robots, they have a tireless workforce that can build phones all day long. Humans require training, rest, and psychological considerations. And with robots, the error rate is lower. Your smart phone runs your life, and you do not want the liability of human imperfection involved in constructing that phone.
As we race towards an automated future, the manual tasks that get automated first depend on their economic value. The manual labor costs of smartphone construction is a massive expense for corporations. This is also true for truck driving, food service, and package delivery. The savings that will be reaped from automating these tasks are tremendous–regardless of how we automate them.
There two ways of building automated systems: rule-based systems and machine learning.
With rule-based systems, we can describe to the computer exactly what we want it to do–like following a recipe. With machine learning, we can train the computer by giving it examples and let the computer derive its own understanding of how to automate a task.
Both approaches to automation have difficulties. A rule-based approach requires us to enumerate every single detail to the machine. This might work well in a highly controlled environment like a manufacturing facility. But rule-based systems don’t work well in the real world, where there are so many unexpected events, like snowstorms.
As we reported in a previous episode about how to build self-driving cars, engineers still don’t quite know what the right mix of rule-based systems and machine learning techniques are for autonomous vehicles. But we will continue to pour money into solving this problem, because the investment is worth figuring out how to train the machine.
The routine tasks in our world will be automated given enough time. How soon something will be automated depends on how expensive that task is when it is performed by a human, and how hard it is to design an artificial narrow intelligence to perform the task instead of a human.
Manual software testing is another type of work that is being automated today.
If I am building a mobile app to play podcast episodes, and I make a change to the user interface, I want to have manual quality assurance (QA) testers run through tests that I describe to them, to make sure my change did not break anything. QA tests describe high level application functionality. Can the user register and log in? Can the user press the play button and listen to a podcast episode on my app?
Unit tests are not good enough, because unit tests only verify the logic and the application state from the point of view of the computer itself. Manual QA tests ensure that the quality of the user experience was not impacted.
With so many different device types, operating systems, and browsers, I need my QA test to be executed in all of the different target QA environments. This requires lots of manual testers. If I want manual testing for every deployment I push, that manual testing can get expensive.
RainforestQA is a platform for QA testing that turns manual testing into automated testing. The manual test procedures are recorded, processed by computer vision, and turned into automated tests. RainforestQA hires human workers from Amazon Mechanical Turk to execute the well-defined manual tests, and the recorded manual procedure is used to train the machines that can execute the same task in the future.
Russell Smith is the CTO and co-founder of RainforestQA, and he joins the show to explain how RainforestQA works: the engineering infrastructure, the process of recruiting workers from mechanical turk, and the machine learning system for taking manual tasks and automating them.
Machine learning models can be built by plotting points in space and optimizing a function based off of those points.
For example, I can plot every person in the United States in a 3 dimensional space: age, geographic location, and yearly salary. Then I can draw a function that minimizes the distance between my function and each of those data points. Once I define that function, you can give me your age and a geographic location, and I can predict your salary.
Plotting these points in space is called embedding. By embedding a rich data set, and then experimenting with different functions, we can build a model that makes predictions based on those data sets. Yufeng Guo is a developer advocate at Google working on CloudML. In this show, we described two separate examples for preparing data, embedding the data points, and iterating on the function in order to train the model.
In a future episode, Yufeng will discuss CloudML and more advanced concepts of machine learning.
Sports Deep Learning with Yu-Han Chang and Jeff Su
Sep 29, 2017
A basketball game gives off endless amounts of data. Cameras from all angles capture the players making their way around the court, dribbling, passing, and shooting. With computer vision, a computer can build a well-defined understanding for what a sport looks like. With other machine learning techniques, the computer can make predictions by combining historical data with a game that is going on right now.
Second Spectrum is a company that builds products for analyzing sports. At major basketball arenas, Second Spectrum cameras sit above the court, recording the game and feeding that information to the cloud. Second Spectrum’s servers crunch on the raw data, processing it through computer vision and putting it into deep learning models. The output can be utilized by teams, coaches, and fans.
Yu-Han Chang and Jeff Su are co-founders of Second Spectrum. They join the show to describe the data pipeline of Second Spectrum from the cameras on the basketball court to the entertaining visualizations. After talking to them, I am convinced that machine learning will completely change how sports are played–and will probably open up a platform for new sports to be invented.
The iOS app is the first project to come out of the Software Engineering Daily Open Source Project. There are more projects on the way, and we are looking for contributors–if you want to help build a better SE Daily experience, check out github.com/softwareengineeringdaily. We are working on an Android app, the iOS app, a recommendation system, and a web frontend. Help us build a new way to consume software engineering content at github.com/softwareengineeringdaily.
Deep Learning Systems with Milena Marinova
Sep 19, 2017
The applications that demand deep learning range from self-driving cars to healthcare, but the way that models are developed and trained is similar. A model is trained in the cloud and deployed to a device. The device engages with the real world, gathering more data. That data is sent back to the cloud, where it can improve the model.
From the processor level to the software frameworks at the top of the stack, the impact of deep learning is so significant that it is driving changes everywhere. At the hardware level, new chips are being designed to perform the matrix calculations at the heart of a neural net. At the software level, programmers are empowered by new frameworks like Neon and TensorFlow. In between the programmer and the hardware, middleware can transform software models into representations that can execute with better performance.
Milena Marinova is the senior director of AI solutions at the Intel AI products group, and joins the show today to talk about modern applications of machine learning and how those translate into Intel’s business strategy around hardware, software, and cloud.
Full disclosure: Intel is a sponsor of Software Engineering Daily.
Question of the Week: What is your favorite continuous delivery or continuous integration tool? Email jeff@softwareengineeringdaily.com and a winner will be chosen at random to receive a Software Engineering Daily hoodie.
If I have a picture of a dog, and I want to search the Internet for pictures that look like that dog, how can I do that?
I need to make an algorithm to build an index of all the pictures on the Internet. That index can define the different features of my images. I can find mathematical features in each image that describe that image. The mathematical features can be represented by a matrix of numbers. Then I can run the same algorithm on the picture of my dog, which will make another matrix of numbers. I can compare the matrix representing my dog picture to the matrices of all the pictures on the internet.
This is what Google and Facebook do–and we covered this topic in our episode about similarity search a few weeks ago. Today, we evaluate a similar problem: searching images within Squarespace. Squarespace is a platform where users can easily build their own website for blogging, e-commerce, or anything else.
Neel Vadoothker is a machine learning engineer at Squarespace, and he joins the show to talk about how and why he built a visual similarity search engine.
If you like this episode, we have done many other shows about machine learning. You can check out our back catalog by going to softwareengineeringdaily.com or by downloading the Software Engineering Daily app for iOS, where you can listen to all of our old episodes, and easily discover new topics that might interest you. You can upvote the episodes you like and get recommendations based on your listening history. With 600 episodes, it is hard to find the episodes that appeal to you, and we hope the app helps with that.
Machines understand the world through mathematical representations. In order to train a machine learning model, we need to describe everything in terms of numbers. Images, words, and sounds are too abstract for a computer. But a series of numbers is a representation that we can all agree on, whether we are a computer or a human.
In recent shows, we have explored how to train machine learning models to understand images and video. Today, we explore words. You might be thinking–”isn’t a word easy to understand? Can’t you just take the dictionary definition?” A dictionary definition does not capture the richness of a word. Dictionaries do not give you a way to measure similarity between one word and all other words in a given language.
Word2vec is a system for defining words in terms of the words that appear close to that word. For example, the sentence “Howard is sitting in a Starbucks cafe drinking a cup of coffee” gives an obvious indication that the words “cafe,” “cup,” and “coffee” are all related. With enough sentences like that, we can start to understand the entire language.
Adrian Colyer is a venture capitalist with Accel, and blogs about technical topics such as word2vec. We talked about word2vec specifically, and the deep learning space more generally. We also explored how the rapidly improving tools around deep learning are changing the venture investment landscape.
If you like this episode, we have done many other shows about machine learning with guests like Matt Zeiler, the founder of Clarif.ai and Francois Chollet, the creator of Keras. You can check out our back catalog by downloading the Software Engineering Daily app for iOS, where you can listen to all of our old episodes, and easily discover new topics that might interest you. You can upvote the episodes you like and get recommendations based on your listening history. With 600 episodes, it is hard to find the episodes that appeal to you, and we hope the app helps with that.
Question of the Week: What is your favorite continuous delivery or continuous integration tool? Email jeff@softwareengineeringdaily.com and a winner will be chosen at random to receive a Software Engineering Daily hoodie.
Artificial Intelligence APIs with Simon Chan
Sep 05, 2017
Software companies that have been around for a decade have a ton of data. Modern machine learning techniques are able to turn that data into extremely useful models. Salesforce users have been entering petabytes of data into the company’s CRM tool since 1999. With its Einstein suite of products, Salesforce is using that data to build new product features and APIs.
Simon Chan is the senior director of product management with Einstein. He oversees the efforts to give longtime Salesforce customers new value, and the efforts to build brand new APIs for image recognition and recommendation systems, which can form the backbone of entirely new businesses.
Companies spend billions of dollars on sales and marketing, and I wanted to understand where the best opportunities for Salesforce were. Simon and I spent much of our time exploring higher level applications, but we got to lower level engineering eventually.
There are 600 episodes of Software Engineering Daily, and it can be hard to find the shows that will interest you. If you have an iPhone and you listen to a lot of Software Engineering Daily, check out the Software Engineering Daily mobile app in the iOS App Store. Every episode can be accessed through the app, and we give you recommendations based on the ones you have already heard.
Automation will make healthcare more efficient and less prone to error. Today, machine learning is already being used to diagnose diabetic retinopathy and improve radiology accuracy. Someday, an AI assistant will assist a doctor in working through a complicated differential diagnosis.
Our hospitals look roughly the same today as they did ten years ago, because getting new technology into the hands of doctors and nurses is a slow process–just ask anyone who has tried to sell software in the healthcare space. But technological advancement in healthcare is inevitable.
Cosima Gretton is a medical doctor and a product manager with KariusDX, a company that is building diagnostic tools for infectious diseases. She writes about the future of healthcare, exploring the ways that workflows will change and how human biases could impact the diagnostic process–even in the presence of sophisticated AI.
Querying a search index for objects similar to a given object is a common problem. A user who has just read a great news article might want to read articles similar to it. A user who has just taken a picture of a dog might want to search for dog photos similar to it. In both of these cases, the query object is turned into a vector and compared to the vectors representing the objects in the search index.
Facebook contains a lot of news articles and a lot of dog pictures. How do you index and query all that information efficiently? Much of that data is unlabeled. How can you use deep learning to classify entities and add more richness to the vectors?
Jeff Johnson is a research engineer at Facebook. He joins the show to discuss how similarity search works at scale, including how to represent that data and the tradeoffs of this kind of search engine across speed, memory usage, and accuracy.
Self-Driving Deep Learning with Lex Fridman
Jul 28, 2017
Self-driving cars are here. Fully autonomous systems like Waymo are being piloted in less complex circumstances. Human-in-the-loop systems like Tesla Autopilot navigate drivers when it is safe to do so, and lets the human take control in ambiguous circumstances.
Computers are great at memorization, but not yet great at reasoning. We cannot enumerate to a computer every single circumstance that a car might find itself in. The computer needs to perceive its surroundings, plan how to take action, execute control over the situation, and respond to changing circumstances inside and outside of the car.
Lex Fridman has worked on autonomous vehicles with companies like Google and Tesla. He recently taught a class on deep learning for semi-autonomous vehicles at MIT, which is freely available online. There was so much ground to cover in this conversation. Most of the conversation was higher level. How do you even approach the problem? What is the hardware and software architecture of a car?
I enjoyed talking to Lex, and if you want to hear more from him check out his podcast Take It Uneasy, which is about jiu jitsu, judo, wrestling, and learning.
Instacart Data Science with Jeremy Stanley
Jun 29, 2017
Instacart is a grocery delivery service. Customers log onto the website or mobile app and pick their groceries. Shoppers at the store get those groceries off the shelves. Drivers pick up the groceries and drive them to the customer. This is an infinitely complex set of logistics problems, paired with a rich data set given by the popularity of Instacart.
Jeremy Stanley is the VP of data science for Instacart. In this episode, he explains how Instacart’s 4-sided marketplace business is constructed, and how the different data science teams break down problems like finding the fastest route to groceries within a store, finding the best path to delivering groceries from a store to a user, and personalizing recommendations so people can find new items to try.
Are you looking for old episodes of Software Engineering Daily, but don’t know how to find the ones that are interesting to you? Check out our new topic feeds, in iTunes or wherever you find your podcasts. We’ve sorted all 500 of our old episodes into categories like business, blockchain, cloud engineering, JavaScript, machine learning, and greatest hits. Whatever specific area of software you are curious about, we have a feed for you. Check the show notes for more details.
Distributed Deep Learning with Will Constable
Jun 14, 2017
Deep learning allows engineers to build models that can make decisions based on training data. These models improve over time using stochastic gradient descent. When a model gets big enough, the training must be broken up across multiple machines. Two strategies for doing this are “model parallelism” which divides the model across machines and “data parallelism” which divides the data across multiple copies of the model.
Distributed deep learning brings together two advanced software engineering concepts: distributed systems and deep learning. In this episode, Will Constable, the head of distributed deep learning algorithms at Intel Nervana, joins the show to give us a refresher on deep learning and explain how to parallelize training a model.
Full disclosure: Intel is a sponsor of Software Engineering Daily, and if you want to find out more about Intel Nervana including other interviews and job postings, go to softwareengineeringdaily.com/intel. Intel Nervana is looking for great engineers at all levels of the stack, and in this episode we’ll dive into some of the problems the Intel Nervana team is solving.
Video Object Segmentation with the DAVIS Challenge Team
Jun 05, 2017
Video object segmentation allows computer vision to identify objects as they move through space in a video. The DAVIS challenge is a contest among machine learning researchers working off of a shared dataset of annotated videos.
The organizers of the DAVIS challenge join the show today to explain how video object segmentation models are trained and how different competitors take part in the DAVIS challenge. A good companion to this episode is our discussion of Convolutional Neural Networks with Matt Zeiler.
Software Engineering Daily is looking for sponsors for Q3. If your company has a product or service, or if you are hiring, Software Engineering Daily reaches 23,000 developers listening daily. Send me an email: jeff@softwareengineeringdaily.com
Poker Artificial Intelligence with Noam Brown
May 12, 2017
Humans have now been defeated by computers at heads up no-limit holdem poker.
Some people thought this wouldn’t be possible. Sure, we can teach a computer to beat a human at Go or Chess. Those games have a smaller decision space. There is no hidden information. There is no bluffing. Poker must be different! It is too human to be automated.
The game space of poker is different than that of Go. It has 10^160 different situations–which is more than the number of atoms in the universe. And the game space keeps getting bigger as the stack sizes of the two competitors gets bigger.
But it is still possible for a computer to beat a human at calculating game theory optimal decisions–if you approach the problem correctly.
Libratus was developed by CMU professor Tuomas Sandholm, along with my guest today Noam Brown. The Libratus team taught their AI the rules of poker, they gave it a reward function (to win as much money as possible), and they told it to optimize that reward function. Then they had Libratus train itself with simulations.
After enough training, Libratus was ready to crush human competitors, which it did in hilarious, entertaining fashion. There is a video from Engadget on YouTube about the AI competing against professional humans.
In this episode, Noam Brown explains how they built Libratus, what it means for poker players, and what the implications are for humanity–if we can automate poker, what can’t we automate?
Stay tuned at the end of this episode for the Indeed Prime tip on hiring developers.
Convolutional Neural Networks with Matt Zeiler
May 10, 2017
Convolutional neural networks are a machine learning tool that uses layers of convolution and pooling to process and classify inputs. CNNs are useful for identifying objects in images and video. In this episode, we focus on the application of convolutional neural networks to image and video recognition and classification.
Matt Zeiler is the CEO of Clarifai, an API for image and video recognition. Matt takes us through the basics of a convolutional neural network–you don’t need any background in machine learning to understand the content of the episode. He also discusses the subjective aspects of image and video recognition, and some of the tactics Clarifai has explored. This is far from a solved problem.
Matt also discusses the infrastructure of Clarifai–how they use Kubernetes, how models are deployed, and how models are updated.
Google Brain Music Generation with Doug Eck
May 01, 2017
Most popular music today uses a computer as the central instrument. A single musician is often selecting the instruments, programming the drum loops, composing the melodies, and mixing the track to get the right overall atmosphere.
With so much work to do on each song, popular musicians need to simplify–the result is that pop music today consists of simple melodies without much chord progression.
Magenta is a project out of Google Brain to design algorithms that learn how to generate art and music. One goal of Magenta is to advance the state of the art in machine intelligence for music and art generation. Another goal is to build a community of artists, coders, and machine learning researchers who can collaborate.
Engineers today are happy to outsource server management to a cloud service provider. Similarly, a musician can use Magenta for creation of a melody, so she can focus on other aspects of a song, such as instrumentation.
Doug Eck is a research scientist at Google. In today’s episode, we explore the Magenta project and the future of music.
Software Engineering Daily is having our third Meetup, Wednesday May 3rd at Galvanize in San Francisco. The theme of this Meetup is Fraud and Risk in Software. We will have great food, engaging speakers, and a friendly, intellectual atmosphere. To find out more, go to softwareengineeringdaily.com/meetup. We would love to get your feedback on Software Engineering Daily. Please fill out the listener survey, available on softwareengineeringdaily.com/survey.
Hedge Fund Artificial Intelligence with Xander Dunn
Apr 03, 2017
A hedge fund is a collection of investors that make bets on the future. The “hedge” refers to the fact that the investors often try to diversify their strategies so that the direction of their bets are less correlated, and they can be successful in a variety of future scenarios. Engineering-focused hedge funds have used what might be called “machine learning” for a long time to predict what will happen in the future.
Numerai is a hedge fund that crowdsources its investment strategies by allowing anyone to train models against Numerai’s data. A model that succeeds in a simulated environment will be adopted by Numerai and used within its real money portfolio. The engineers who create the models are rewarded in proportion to how well the models perform.
Xander Dunn is a software engineer at Numerai and in this episode he explains what a hedge fund is, why the traditional strategies are not optimal, and how Numerai creates the right incentive structure to crowdsource market intelligence. This interview was fun and thought provoking–Numerai is one of those companies that makes me very excited about the future.
Multiagent systems involve the interaction of autonomous agents that may be acting independently or in collaboration with each other. Examples of these systems include financial markets, robot soccer matches, and automated warehouses. Today’s guest Peter Stone is a professor of computer science who specializies in multiagent systems and robotics.
In this episode, we discuss some of the canonical problems of multiagent systems, which have some overlap with the canonical problems of distributed systems–for example, the problems of coordinating between different agents with varying levels of trust resembles the problem of establishing consistency across servers in a database cluster.
Peter has recently contributed to the 100 year study of artificial intelligence, so we also had a chance to discuss the opportunities and roadblocks for AI in the near future. And since Peter teaches computer science at my alma mater, UT Austin, I had to ask him a few questions about the curriculum.
Biological Machine Learning with Jason Knight
Mar 20, 2017
Biology research is complex. The sample size of a biological data set is often too small to make confident judgments about the biological system being studied.
During Jason Knight’s PhD research, the RNA sequence data that he was studying was not significant enough to make strong conclusions about the gene regulatory networks he was trying to understand.
After working in academia, and then at Human Longevity, Inc Jason came to the conclusion that the best way to work towards biology breakthroughs was to work on the computer systems that enable those breakthroughs. He went to work at Nervana Systems on hardware and software for deep learning. Nervana was subsequently acquired by Intel. In this episode, we discuss how machine learning can be applied to biology today, and how industrial research and development is key to enabling more breakthroughs in the future.
The main lesson I took away from this show is that while we have seen phenomenal breakthroughs in certain areas of health–like image recognition applied to diabetic retinopathy or skin cancer–the challenges of reverse engineering our genome to understand how nucleic acids fit together into humans are still out of reach, and improving the hardware used for deep learning will be necessary to tackle these kinds of informational challenges.
Stripe Machine Learning with Michael Manapat
Mar 17, 2017
Every company that deals with payments deals with fraud. The question is not whether fraud will occur on your system, but rather how much of it you can detect and prevent. If a payments company flags too many transactions as fraudulent, then real transactions might accidentally get flagged as well. But if you don’t reject enough of the fraudulent transactions, you might not be able to make any money at all.
Because fraud detection is such a difficult optimization problem, it is a good fit for machine learning. Today’s guest Michael Manapat works on machine learning fraud detection at Stripe.
This conversation explores aspects of both data science and data engineering. Michael seems to benefit from having a depth of knowledge in both aspects of the data pipeline, which made me question whether data science and data engineering are roles that an engineering organization wants to separate.
This is the third in a series of episodes about Stripe engineering. Throughout these episodes, we’ve tried to give a picture for how Stripe’s engineering culture works. We hope to do more experimental series like this in the future. Please give us feedback for what you think of the format by sending us email, joining the Slack group, or filling out our listener survey. All of these things are available on softwareengineeringdaily.com.
Machine Learning is Hard with Zayd Enam
Feb 16, 2017
Machine learning frameworks like Torch and TensorFlow have made the job of a machine learning engineer much easier. But machine learning is still hard. Debugging a machine learning model is a slow, messy process.
A bug in a machine learning model does not always mean a complete failure. Your model could continue to deliver usable results even in the presence of a mistaken implementation. Perhaps you made a mistake when cleaning your data, leading to an incorrectly trained model.
It is a general rule in computer science that partial failures are harder to fix than complete failures. In this episode, Zayd Enam describes the different dimensions on which a machine learning model can develop an error. Zayd is a machine learning researcher at the Stanford AI Lab, so I also asked him about AI risk, job displacement, and academia versus industry.
Deep learning uses neural networks to identify patterns. Neural networks allow us to sequence “layers” of computing, with each layer using learning algorithms such as unsupervised learning, supervised learning, and reinforcement learning. Deep learning has taken off in the last few years, but it has been around for much longer.
Adam Gibson founded Skymind, the company behind Deeplearning4j. Deeplearning4j is a distributed deep learning library for Scala and Java. It integrates with Hadoop and Spark, and is specifically designed to run in business environments on distributed GPUs and CPUs. Adam joins the show today to discuss the history and future of deep learning.
Go Data Science with Daniel Whitenack
Feb 09, 2017
Data science is typically done by engineers writing code in Python, R, or another scripting language. Lots of engineers know these languages, and their ecosystems have great library support. But these languages have some issues around deployment, reproducibility, and other areas. The programming language Golang presents an appealing alternative for data scientists.
Daniel Whitenack transitioned from doing most of his data science work in Python to writing code in Golang. In this episode, Daniel explains the workflow of a data scientist and discusses why Go is useful. We also talk about the blurry line between data science and data engineering, and how Pachyderm is useful for versioning and reproducibility. Daniel works at Pachyderm, and listeners who are more curious about it can check out the episode I did with Pachyderm founder Joe Doliner.
Translation is a classic problem in computer science. How do you translate a sentence from one human language into another? This seems like a problem that computers are well-suited to solve. Languages follow well-defined rules, we have lots of sample data to train our machine learning models.
And yet, the problem has not been solved–largely because languages don’t always follow rules. We have idioms and subtle contextual clues that make it hard to provide a computer with hard and fast rules for translation.
Unbabel is a company whose solution to translation puts a human in the loop to correct the error-prone translations that computers often make. In this episode, Vasco Pedro joins the show to explain Unbabel’s approach to translation, its technology stack, and the business applications for translation.
Medical Machine Learning with Razik Yousfi and Leo Grady
Jan 17, 2017
Medical imaging is used to understand what is going on inside the human body and prescribe treatment. With new image processing and machine learning techniques, the traditional medical imaging techniques such as CT scans can be enriched to get a more sophisticated diagnosis.
HeartFlow uses data from a standard CT scan to model a human heart and understand blockages of blood flow using simulations of fluid dynamics. In today’s episode, Razik Yousfi and Leo Grady from HeartFlow describe the data processing pipeline for the company and what their technology stack looks like.
Python Data Visualization with Jake VanderPlas
Jan 16, 2017
Data visualization tools are required to translate the findings of data scientists into charts, graphs, and pictures. Understanding how to utilize these tools and display data is necessary for a data scientist to communicate with people in other domains. In this episode, Srini Kadamati hosts a discussion with Jake VanderPlas about the Python ecosystem for data science and the different attempts at creating a data visualization library.
Jake VanderPlas is the Director of Research for Physical Sciences at the University of Washington’s eScience institute, where he also received his PhD in Astronomy. In addition to contributing to many Python data science libraries like scikit-learn, scipy, numpy, and matplotlib, he’s written multiple books that have been published by O’Reilly and has given many talks on data science tools and techniques. He’s also the co-creator of the Altair project, which is a declarative data visualization library for Python built on the Vega-Lite visualization grammar.
PANCAKE STACK Data Engineering with Chris Fregly
Oct 17, 2016
Data engineering is the software engineering that enables data scientists to work effectively. In today’s episode, we explore the different sides of data engineering–the data science algorithms that need to be processed and the implementation of software architectures that enable those algorithms to run smoothly.
The PANCAKE STACK is a 12-letter acronym that Chris Fregly gave to a collection of data engineering technologies including Presto, Cassandra, Kafka, Elastic Search, and Spark. In his current life, Chris travels around the world giving workshops on how to deploy and use the PANCAKE STACK. Before that, he was an engineer at Netflix, where he received an Emmy for Streaming Engineering Excellence.
Scikit-learn is a set of machine learning tools in Python that provides easy-to-use interfaces for building predictive models. In a previous episode with Per Harald Borgen about Machine Learning For Sales, he illustrated how easy it is to get up and running and productive with scikit-learn, even if you are not a machine learning expert. Srini Kadamati hosts today’s show and interviews Andreas Mueller, a core committer to scikit-learn. Srini and Andreas discuss the background and implementation of scikit-learn and walk through some prototypical workflows for using it.
Music Deep Learning with Feynman Liang
Sep 02, 2016
Machine learning can be used to generate music. In the case of Feynman Liang’s research project BachBot, the machine learning model is seeded with the music of famous composer Bach. The music that BachBot creates sounds remarkably similar to Bach, although it has been generated by an algorithm, not by a human.
BachBot is a research project on computational creativity. Feynman Liang created BachBot using Python machine learning tools to build a long-short term memory model. Our conversation explores artificial intelligence, music, and his approach to this research project.
You have probably read a news article that was written by a machine. When earnings reports come out, or a series of sports events like the Olympics occurs, there are so many small stories that need to be written that a news organization like the Associated Press would have to use all of its resources to write enough content to cover it all.
Wordsmith is a tool for automated content generation, and today’s guest Robbie Allen is the CEO of Automated Insights, the company that makes Wordsmith. He talks today about the wide range of uses for automated content, as well as how to engineer a product that takes data from a spreadsheet and turns it into a human-readable sentence.
Artificial Intelligence with Oren Etzioni
Aug 29, 2016
Research in artificial intelligence takes place mostly at universities and large corporations, but both of these types of institutions have constraints that cause the research to proceed a certain way. In a university, basic research might be hindered by lack of funding. At a big corporation, the researcher might be encouraged to study a domain that is not squarely in the interest of public good–such as targeted advertising.
Oren Etzioni is the CEO of the Allen Institute for Artificial Intelligence, and in this episode we discuss AI research–from the doomful premonitions of Nick Bostrom to the unbridled optimism of Ray Kurzweil, as well as the realities of how AI research actually proceeds. Projects at the Allen Institute are defined and structured to solve problems in an intelligent, scalable fashion, so that engineering can proceed steadily from the local maxima of a problem domain to the global maxima. The Allen Institute seeks to bridge the gap by providing ample funding for open source AI research for the common good.
TensorFlow in Practice with Rajat Monga
Aug 18, 2016
TensorFlow is Google’s open source machine learning library. Rajat Monga is the engineering director for TensorFlow. In this episode, we cover how to use TensorFlow, including an example of how to build a machine learning model to identify whether a picture contains a cat or not. TensorFlow was built with the mission of simplifying the process of deploying a machine learning model from research to production, so we also talk about that, as well as how TensorFlow can be used effectively in combination with Google’s open-source cluster manager, Kubernetes.
Data Validation is the process of ensuring that data is accurate. In many software domains, an application is pulling in large quantities of data from external sources. That data will eventually be exposed to users, and it needs to be correct. Radius Intelligence is a company that aggregates data on small businesses. In order to ensure that business addresses and phone numbers are correct, Radius uses human data validation to ensure that their machine-gathered data is correct. On today’s episode, Srini Kadamati interviews Dan Morris about human data validation, and how it fits into a machine learning pipeline.
Machine Learning for Sales with Per Harald Borgen
Aug 16, 2016
Machine learning has become simplified. Similar to how Ruby on Rails made web development approachable, scikit-learn takes away much of the frustrating aspects of machine learning, and lets the developer focus on building functionality with high-level APIs.
Per Harald Borgen is a developer at Xeneta. He started programming fairly recently, but has already built a machine learning application that cuts down on the time his sales team has to spend qualifying leads. What I found most interesting about this episode was that machine learning gets used by a single developer to solve a simple business problem and deliver solid value. This is in contrast to how many of us think about machine learning–as an intimidating domain that requires a large team to build anything meaningful.
Phone Spam with Truecaller CTO Umut Alp
Jun 08, 2016
The war against spam has been going on for decades. Email spam blockers and ad blockers help protect us from unwanted messages in our communication and browsing experience. These spam prevention tools are powered by machine learning, which catches most of the emails and ads that we don’t want to see. TrueCaller is a company that is bringing this quality of spam detection to our phone call systems.
Umut Alp is the CTO of TrueCaller, and he joins the show today to break down the engineering problems of preventing telephone call spam. Users of TrueCaller install it on their phones, and the software allows users to report when they have received a spam call. Using this reporting mechanism, and other learning algorithms, TrueCaller is able to learn what types of calls it should block from being accepted by your phone. Today on Software Engineering Daily, we discuss cell phone spam prevention.
Machine Learning in Healthcare with David Kale
Mar 08, 2016
“Building a model to predict disease and deploying that in the wild – the bar for success is much higher there than, say, deciding what ad to show you.”
Diagnosing illness today requires the trained eye of a doctor. With machine learning, we might someday be able to diagnose illness using only a data set. Today on Software Engineering Daily, we are joined by David Kale, a researcher at the intersection of machine learning and clinical data. We discuss the machine learning and research techniques he is using to diagnose illnesses using neural networks, and we also talk about the challenges of performing data science in hospitals, where the data is mostly confidential. David will also be presenting at Strata + Hadoop World in San Jose. We’re partnering with O’Reilly to support this conference – if you want to go to Strata, you can save 20% off a ticket with our code PCSED.
Questions
What kind of work does a data scientist at a children’s hospital do?
Where is machine learning actually improving healthcare?
What types of data are present in the intensive care unit?
Can you give me an example of how you used an LSTM to make a prediction?
What were the results of your recurrent neural network experiments?
Do you think that deep learning is overhyped right now?
Data Science at Monsanto with Tim Williamson
Feb 29, 2016
“Nothing’s cool unless you call it ‘as a service.’ ”
Monsanto is a company that is known for its chemical and biological engineering. It is less well known for its data science and software engineering teams. Tim Williamson is a data scientist at Monsanto, and on today’s show he talked about how he and a small group of engineers at Monsanto dramatically shifted the culture around data science-driven genetic engineering.
In this episode, Tim explains how useful graph databases are for modeling the genetic lineages, and talks about how Monsanto manages simulations and experiments on their genomics software pipeline. Tim also talks about how just a few engineers can create a cultural shift within a large company like Monsanto using the leverage allowed by software.
Questions
Why is data science important to Monsanto?
How will data science be used in the future to improve food production?
What are a genomics pipeline and a breeding cycle?
Can you use simulations to improve genetic predictions?
Why are graph databases useful for Monsanto?
What is ancestry-as-a-service?
Are there any agri-tech companies or products that are really exciting to you?
Is it realistic or desirable to move to a meat-free nutrition model?
Deep Learning and Keras with François Chollet
Jan 29, 2016
“I definitely think we can try to abstract away the first principles of intelligence and then try to go from these principles to an intelligent machine that might look nothing like the brain.”
Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. In this episode, François discusses the state of deep learning, and explains why the field is experiencing a cambrian explosion that eventually may taper off. He explains the need for Keras and why its simplicity and ease makes it a useful deep learning library for developers to experiment and build with.
François Chollet is the author of Keras and the founder of Wysp, learning platform for artists. He currently works for Google as a deep learning engineer and researcher.
Questions
Do you try to design intelligent machines using the human brain as a blueprint?
How has the structure of software engineering teams changed to accommodate the addition of machine learning?
What are the best practices for deploying machine learning systems developed in production by data scientists?
Why do neural network developers need to be able to perform fast experimentation?
Why is modularity important to a deep learning library?
How does Keras interface with the GPU?
What are the interesting trends you notice in machine learning?
Machine Learning for Businesses with Joshua Bloom
Jan 19, 2016
“You’ve got software engineers who are interested in machine learning, and think what they need to do is just bring in another module and then that will solve their problem. It’s particularly important for those people to understand that this is a different type of beast.”
Machine learning is something that many business are starting to tack onto their existing processes. Yet, to add machine learning capabilities after the fact is often a fool’s errand. Joshua argues that machine learning cannot be an afterthought, but rather must be custom developed to suit the specific problem or question that each company is trying to answer. His company, Wise.io, tackles this challenge of helping business build ground up machine learning applications that generate accurate predictions for use in an array of business processes.
Joshua Bloom is the cofounder and CTO of Wise.io. He is also an astrophysicist, and a professor of astronomy at UC Berkeley.
Questions
What is a machine learning system?
What is the broader impact of this improved ease of use of machine learning algorithms?
How do you think data scientists are stratified?
What does Wiseio do?
How do you abstract away machine learning implementations at large organizations with enterprise software systems?
What is it about machine learning systems that give rise to weak contracts between abstraction levels?
“You don’t mind if failures slow things down, but its very important that failures do not stop forward progress.”
TensorFlow is an open source machine learning library intended to bring large-scale, distributed machine learning and deep learning to everyone. Google recently released the framework to the public as a second-generation API, having learned from the successes and failures of DistBelief.
Greg Corrado is a senior research scientist and tech lead at Google, where he focuses on the research areas of machine intelligence, machine perception and natural language processing.
Questions
From the end-user’s point of view, how does Smart Reply work?
How can teams blend research and engineering to make better products?
How did the DistBelief project shape Tensor Flow?
How does Tensor Flow differ from streaming frameworks that are more generalized like Spark or Storm?
Why would I want to do machine learning on my phone?
How is Tensor Flow fault tolerant?
What are things the open source community should dive into in Tensor Flow, to fix and improve it?
Data Science at Spotify with Boxun Zhang
Dec 11, 2015
“I normally try to sit together or very close to a product team or engineering team. And by doing so, I get very close to the source of all kinds of challenging problems.”
Spotify is a streaming music service that uses data science and machine learning to implement product features such as recommendation systems and music categorization, but also to answer internal questions.
Boxun Zhang is a data scientist at Spotify where he focuses on understanding user behavior within the product.
Questions
What is the overlap between distributed systems and data science?
How has Spotify’s big data architecture evolved over time?
As a data scientist do you need to understand this big data architecture well?
What were the benefits for starting to use Kafka?
What kinds of data science problems do you tackle at Spotify?
Could you describe what a random forest is?
Why are there so many streaming systems, and what do you use at Spotify?
How will data science change moving towards the future?
Learning Machines with Richard Golden
Dec 08, 2015
“When I was a graduate student, I was sitting in the office of my advisor in electrical engineering and he said, ‘Look out that window – you see a Volkswagon, I see a realization of a random variable.’ ”
Richard Golden is the host of Learning Machines 101, a podcast that covers artificial intelligence and machine learning topics. Dr. Golden is also a full-time Professor of Cognitive Science and Electrical Engineering at UT Dallas.
Questions
What is machine learning?
What are the fundamental concepts to build artificial intelligence?
How do you define a rule in the domain of machine learning?
How can a machine learning system estimate the probability of something it has not seen?
Could you explain how ML could be applied to real world healthcare scenarios?
What is a neural network?
What is the difference between natural and artificial intelligence?
Bridging Data Science and Engineering with Greg Lamp
Oct 05, 2015
Current infrastructure makes it difficult for data scientists to share analytical models with the software engineers who need to integrate them.
Yhat is an enterprise software company tackling the challenge of how data science gets done. Their products enable companies and users to easily deploy data science environments and translate analytical models into production code.
Greg Lamp is the Co-founder and CTO of Yhat and previously worked as a product manager in financial services. Yhat was part of the Y Combinator winter 2015 class.
Questions
At a software company, what is the typical relationship between data scientists and software engineers?
Does Yhat turn data scientists into HTTP endpoints?
What was the most counterintuitive advice you received at Y-Combinator?
What is the moonshot goal for Yhat?
Is it easier to teach data science to an engineer or engineering to a data scientist?
Data science competitions are an effective way to crowdsource the best solutions for challenging datasets.
Kaggle is a platform for data scientists to collaborate and compete on machine learning problems with the opportunity to win money from the competitions’ sponsors.
Ben Hamner is the co-founder and CTO of Kaggle.
Questions
What is Kaggle?
How does the experience of an individual competitor compare to the experience of a data science team?
What is Kaggle’s tech stack?
Do companies collect too much data?
How do you use machine learning to convert neural patterns into control signals?