• Document Your Machine Learning Project in a Smart Way
    by Bildea Ana on 1 oktober 2022 at 06:38

    Step-by-step tutorial on how to use Sphinx with Vertex AI pipelines.By @sigmund, unsplash.comThe purpose of this article is to share the procedure for using Sphinx to auto-generate the documentation of your machine learning project.I‘m going to use advanced features of Sphinx such as the addition of logos, notes, images, and markdown documents. Also, I am going to show the python package you need so that Sphinx can extract the docstrings presented in your Vertex pipelines.Some contextLet’s get started! As you may know, having up-to-date documentation for your machine learning projects is vital for both the production and proof of concept phases. Why it is vital? Because it helps to clarify and simplify your modules, collaborate with your team, integrate quickly a new team member, make faster evolutions, and share with the business owners.Personally, I have experienced so many cases in which due to time-to-market constraints the documentation was ignored, but this turned out to be fatal once the project was released in production. Therefore I advise you to sidestep any manual procedure, to generate your documentation, as such procedures always end up getting desynchronized and time-consuming.So, before publishing your project, take some time to check the readability of your project. In my case, I tend to use the following files :README— a file easy to read that provides an introduction and general information on the project such as the purpose, technical information, the software componentsLICENCE — a file that mentions the license step-by-step procedure to follow for contributorsUSAGE — a file to explain how to use the projectCHANGELOG — a file that tracks the changes and the released versions of the projectPlease note that the most important file is the README. The contribution and usage information can be added directly to the Readme file. The changelog file can be added later on before releasing the project in production. To edit the files you can use markdown, simple text, or reStructuredText.See below the overview of the process we are going to describe.Overview of the process (Image by the author)What is Sphinx ?Sphinx is a powerful and easy-to-use open source auto-generator tool highly used by the Python community. It is able to generate excellent structured documentation. There exists a few alternatives such as MkDocs, Doxygen, pdoc, and others, but Sphinx remains a complete and easy-to-use strong competitor.The main features:support for several output formats: HTML, PDF, plain text, EPUB, TeX, etc.automatic generation of the documentationautomatic link generationmulti-language supportvarious extensions availableSteps:I. Set up the environmentII. Install a virtual environmentIII. Install SphinxIV. Set-up SphinxV. Build the documentationI. Set up the environmentPython 3Local virtual machine or Vertex AI Workbench (Jupyter notebook running in a virtual environment with Python 3)Python project that contains Vertex AI codeVirtualenvKfx — extension for kubeflow pipeline sdkMyST parser — flavor of MarkdownVertex project containing sdk pipelinesLet’s use an end-to-end open source example of a Vertex AI pipeline under the Apache-2.0 license. The project is a good example as the project uses Vertex pipelines and doesn't use a documentation generator.First, clone the source code and go to the vertex-pipelines-end-to-end-samples directory:git clone https://github.com/GoogleCloudPlatform/vertex-pipelines-end-to-end-samples.gitcd vertex-pipelines-end-to-end-samplesII. Create a virtual environment & activate ithttps://medium.com/media/c03f7a1857f2bb4a70e22bcaf8060b5f/hrefIII. Install SphinxCreate a file requirements-sphinx.txt and add :myst-parser==0.15requests==2.28.1sphinx==4.5.0sphinx-click==4.3.0sphinx-me==0.3sphinx-rtd-theme==1.0.0rst2pdf==0.99kfxInstall at once Sphinx and its extensions listed in the requirements-sphinx.txt:pip install -r requirements-sphinx.txtCreate a docs directory (if doesn’t exist) to store the Sphinx layout :mkdir docs cd docs Generate the initial directory structure with sphinx-quickstart command:sphinx-quickstartChoose separate sources and build directories, the project name, author name, project release, and the project language. You can find below my configuration:You should obtain the following tree structure :As you can see, we chose to separate the build and the source directories. Let’s give a few explanations about its content.The build/ directory is meant to keep the generated documentation. It is empty for now as we don’t have yet any generated documentation.The make.bat (Windows) and Makefile(Unix) files are scripts that simplify the generation of documentation.The source/conf.py is the configuration file of the Sphinx project. It contains the default configuration keys and the configuration you specified to sphinx-quickstart.The source/index.rst is the root document of the project that contains the table of contents tree (toctree) directive where you should list all the modules you want to include in your document.The _static directory contains custom stylesheets and other static files.The _templates directory stores the Sphinx templates.IV. Set up SphinxIdentify the python modules: /pipelinesThe directory /pipelines contains the python code we want to include in the Sphinx documentation. Note that Sphinx sees the submodules present in the pipelines package only if you add an __init__.py file in the /pipelines directory.Generate the Sphinx sourcesUse the sphinx-apidoc to build your API documentation (be sure you are at the root of the project). The created Sphinx sources are stored at docs/source/pipelines.sphinx-apidoc -f -o docs/source/pipelines pipelines/You can check that the following files were created at docs/source/pipelines:Copy the markdown files to the docs/sourceCopy the README.md, CONTRIBUTING.md, and USAGE.md files automatically in the Sphinx source directory (docs/source/). Add in the docs/Makefile the following lines to automate the synchronization of markdown files:COPY_README = ../README.mdCOPY_CONTRIBUTING = ../CONTRIBUTING.mdCOPY_USAGE = ../USAGE.md#sincronyze MD files$(shell cp -f $(COPY_README) $(SOURCEDIR))$(shell cp -f $(COPY_CONTRIBUTING) $(SOURCEDIR))$(shell cp -f $(COPY_USAGE) $(SOURCEDIR))Edit the index.rstUse the note directive for the information you want to highlight... note:: Sphinx with Vertex AI.Use image directive to add an image. The recommended image size has a width between 400–800 pixels... image:: ../images/xgboost_architecture.png :align: center :width: 800px :alt: alternate textUnder the toctree directive list all the modules you want to be included in the final documentation (README, modules)... toctree:: :maxdepth: 2 :caption: Contents: README pipelines/modules CONTRIBUTING USAGEPlease find my index.rst below:https://medium.com/media/f66531e9271d0db9e591de5489dfc8d9/hrefEdit conf.py — the main configuration file of SphinxDefine the path:# Define pathsys.path.insert(0, os.path.abspath("../.."))Add your extensions:extensions = [ "sphinx.ext.duration", "sphinx.ext.doctest", "sphinx.ext.viewcode", "sphinx.ext.autosummary", "sphinx.ext.intersphinx", "sphinx_rtd_theme", "sphinx_click", "myst_parser", "sphinx.ext.todo", "sphinx.ext.coverage", "myst_parser",]List the list of files to be parsed:source_suffix = { ".rst": "restructuredtext", ".md": "markdown",}Specify the HTML theme:html_theme = "sphinx_rtd_theme"To add a logo be sure that the image is present in source/_static. I have used the vertex AI logo . Then you can define the logo path :html_logo = "_static/vertex.png"List all the external links present in the markdown files:intersphinx_mapping = { "python": ("https://python.readthedocs.org/en/latest/", None)}See my configuration file conf.py:https://medium.com/media/79670a2d350b0e08b8faab7f55484d85/hrefV. Build the documentationTo generate HTML documentation with Sphinx go to /docs and use the command:make htmlUse Firefox to open the HTML page:firefox docs/build/html/index.htmlIf you managed to go through all the steps you should be able to see an even more appealing HTML page.The KFX extension will enable Sphinx to read the Kubeflow components, function names, parameters, and docstrings.Automate the build of the documentation using the Makefile(present at the root of the project). Edit the Makefile and add the following lines:create-sphinx-sources: cd docs; make clean; cd ..; rm -r docs/source/pipelines; sphinx-apidoc -f -o docs/source/pipelines pipelines/generate-doc: @ $(MAKE) create-sphinx-sources && \ cd docs; make htmlThen call the make generate-doc:make generate-docWe reached the end of our journey with Sphinx. I hope that you found the content useful!SummaryWe have seen how to use Sphinx, a powerful tool to generate documentation for your machine learning project. We have customized the documentation with logos, images, and markdown content. Of course, Sphinx comes with plenty of other extensions you can use to render your documentation even more appealing.Document Your Machine Learning Project in a Smart Way was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • Simple Probabilistic Inference in a Manufacturing Context
    by Somik Raha on 1 oktober 2022 at 01:00

    TL;DR This post applies probabilistic inference to a long-established mechanical engineering problem. If you don’t care much about theory and epistemology (how we got here), then just read The Problem and proceed to Fitting prior knowledge to a Beta distribution. If you are in a bigger rush, just read The Problem and open the model to figure it out.How does one learn from data? With the explosion of data in so many business contexts, data science is no longer an optional discipline. With advanced statistical methods packaged in pretty libraries and commoditized for machine learning, it can be all too easy to miss the foundations of probability theory that is at the heart of how we learn from data. Those foundations are over 250 years old, and are both intuitive and philosophical. Understanding these foundations helps us be better practitioners of data science, machine learning and experimentation.To do this, I am going to draw on a workshop I recently taught titled “The Magic of Probability” (public domain bilingual slides in English and Kannada here) under the auspices of the Dr. R. Venkatram Memorial Lecture Series at Bangalore Institute of Technology, my alma mater. The participants were mostly senior faculty members across all disciplines of engineering. As part of a deep dive, I asked for someone to volunteer a problem of inference. The professors of Mechanical Engineering gave me a great one to work with, and the goal of this article is to use that example to illustrate simple probabilistic inference. You can easily replace the manufacturing example with data-rich contexts. When extending this to AB tests, see this article.The Problem: Non-conformance of manufactured componentsA particular component being manufactured has a tolerance for acceptance, around 0.5 mm above or below 25 mm. If it falls outside this range, the component is called non-conforming and is rejected from the batch. The folks running the manufacturing shop believe that they will see a range of 4% to 6% non-conforming parts in each batch. When they run the next batch and count the number of non-conforming parts, how should they update their belief about non-conforming parts? What is the probability of the next part manufactured being non-conforming? What is the probability they should ascribe to being below an upper limit of the non-conformance level? How many batches should be run in order to get a 90% confidence level of being below the upper limit of the non-conformance level? Further, what annual operating profit should they forecast, and what is the probability of meeting a target operating profit?Classical Probability Theory to the rescueIn 1763, a great friendship changed the course of how humans do inference. The Reverend Thomas Bayes had passed away two years prior, leaving behind his unpublished work on probability. His dear friend and mathematician Richard Price published this work as An Essay towards solving a Problem in the Doctrine of Chances.(Left) Rev. Thomas Bayes, Public Domain| (Right) Bayes’ friend Thomas Price, Public Domain.This work carried two important advances. The first was the derivation of Bayes’ theorem from conditional probability. The second was the introduction of the Beta distribution. Pierre Simon Laplace independently worked out the same theorem and the Beta distribution, and also worked out a lot of what we call probability theory today.Beta distributions start with a coin toss metaphor, which have only two possible outcomes, heads or tails. Hence, the probability of k successes in n trials is given by the binomial distribution (bi = “two”). This was known already before Bayes and Laplace came to the scene. They took inspiration from the discrete binomial distribution and by applying calculus, took it to the continuous distribution land. They retained k and n but called them alpha (or number of successes =k) and beta(or number of failures = n-k), which became the shape parameters of this new distribution, which they called the “Beta” distribution.The Beta distribution set to (alpha = 1, beta = 1) produces a Uniform Distribution. Image produced by author.The amazing thing about this distribution is that it is shape shifting in a way that matches our common sense. If you were to start off your probabilistic inference by believing that you have only seen one success in two trials, then alpha = 1, beta = 2–1 = 1. The Beta(1,1) is actually the Uniform Distribution. Changing alpha and beta changes the distribution and helps us express different beliefs.The shape-shifting Beta Distribution, Public Domain.Bayes suggested “with a great deal of doubt” that the Uniform Distribution was the prior probability distribution to express ignorance about the correct prior distribution. Laplace had no such hesitation and asserted this was the way to go when we felt each outcome was equally likely. Further, Laplace provided a succession rule which basically said that we must use the mean of the distribution to place a probability on the next coin landing heads (or the next trial being considered a success).Such was Laplace’s impact with his magnum opus, Memoir on the Probability of the Causes of Events, that most people in the West stopped focusing on probability theory believing there wasn’t much more to be advanced there. The Russians didn’t get that memo, and so they continued to think about probability, producing fundamental advances like Markov processes.For our purposes though, the next big jump in classical probability came with the work of E. T. Jaynes, followed by Ronald A. Howard. Before we go there, did you notice an important detail? The x-axis of the first graphin this section says “long-run fraction of heads”, and not “probability of heads.” This is an important detail because one cannot have a probability distribution on a probability — that is non-interpretable. Where did this thought come from?Like Bayes, Jaynes’ seminal work was never published in his lifetime. His student Larry Bretthorst published Probability Theory: The Logic of Science after his passing. Jaynes’ class notes were a huge influence in the work of my teacher, Ronald A. Howard, the co-founder of Decision Analysis.Jaynes introduced the concept of a reasoning robot which would use principles of logic that we would agree with. He wrote in the book cited above: “In order to direct attention to constructive things and away from controversial irrelevancies, we shall invent an imaginary being. Its brain is to be designed by us, so that it reasons according to certain definite rules. These rules will be deduced from simple desiderata which, it appears to us, would be desirable in human brains; i.e. we think that a rational person, on discovering that they were violating one of these desiderata, would wish to revise their thinking.”“Our robot is going to reason about propositions. As already indicated above, we shall denote various propositions by italicized capital letters, {A, B, C, etc.}, and for the time being we must require that any proposition used must have, to the robot, an unambiguous meaning and must be of the simple, definite logical type that must be either true or false.”Jaynes’ robot is the ancestor of Howard’s clairvoyant [3], an imaginary being that does not understand models but can answer factual questions about the future. The implication: we can only place probabilities on distinctions that are clear, and have not a trace of uncertainty in them. In some early writings, you will see the Beta distribution formulated on the “probability of heads.” A probability distribution on the “probability of heads” would not be interpretable in any meaningful way. Hence, the edit that Ronald Howard provided in his seminal 1970 paper, Perspectives on Inference, is to reframe the distinction as the long-run fraction of heads (or successes), a question that the clairvoyant can answer.The beta distribution has a most interesting property. As we find more evidence, we can simply update alpha and beta, since they correspond to the number of successes and the number of failures, in order to obtain the updated probability distribution on the distinction of interest. Here is a simple example of different configurations of alpha and beta (S = number of successes, N = number of tosses):Updating the Beta(1,1) prior based on observations, Image created by author.We can use this distribution to do our inference. I have prepared a public domain Google sheet (US version, India English version, India Kannada version) that you can play with after making a copy. I will use this sheet to explain the rest of the theory.Finally, the last major advance we need for decision analytic inference is that of Howard’s notion of a personal probability. He introduced this in 1966 in the classic paper Decision Analysis: Applied Decision Theory. [5] While most people look at this paper as the one that kickstarted the field of Decision Analysis and also the first time the term “Decision Analysis” was coined, its contribution to probability theory is not as widely appreciated. Until that point, probability was an academic pursuit, not used in professional decision-making.This paper brought to the fore a key idea — that the construct of probability should be used to capture how you “feel” about uncertainty. On page 104, Howard discusses responses to why people haven’t heard of Decision Analysis before. One of them is that “the idea of probability as a state of mind and not of things is only now regaining its proper place in the world of thought.” In the paper, Howard goes into great lengths showing the rigor with which a good prior is to be constructed, which he calls an “almost psychoanalytic process of prior measurement,” and whose validity is established when the decision maker says, “yes, this is what I truly believe.” In a recent conversation with him, he much preferred the term “personal probability/prior” to a “psychoanalytic prior,” and I do too.Fitting prior knowledge to a Beta distributionRemember that we started with a distribution of non-conformance (4% to 6%)? As an exercise for the reader, refer to the mean and variance of the Beta distribution and derive the formulae for alpha and beta using mean and variance.The first two are from wikipedia, and the third and fourth are derived with basic algebra, Image created by author.How do we find the mean and variance of our prior assessment? We assessed the following percentile / probability pairs (the personal prior) from our mechanical engineering experts:Snapshot from spreadsheet, Image created by author.The interpretation of the above is that there is only a 10% chance of the non-conformance rate being below 4%, and a 10% chance of being above 6%. There is a 50–50 shot of being above or below 5%. A rule of thumb is to assign 25%/50%/25% to 10/50/90th percentile. If you’d like to read more about the theory behind it, see [1][2][3]. This shortcut makes it easy for us to compute the mean as:Mean = 25% x 4% + 50% x 5% + 25% x 6% = 5%Snapshot from spreadsheet, Image created by author.We can similarly calculate the variance using the standard formula to yield the following alpha and beta shape parameters.As you can see, the worksheet shows the equivalent number of successes and tosses. Providing an input of 4%-5%-6% as our prior belief is the same as saying, “we have a strength of belief that is equivalent to seeing 47 successes in 949 tosses.” This framing allows our experts to cross-check whether 47 out of 949 tosses makes intuitive sense to them.We can also discretize the fitted beta distribution and compare with the original inputs, which is below.Comparing the fitted distribution to the original inputs, snapshot from spreadsheet, Image created by author.Updating with ObservationsNow that we have the prior, we can easily update it with our observations. We ask the following question:Input for observations, snapshot from spreadsheet, Image created by author.The new alpha (successes) and beta (failures) parameters are simply the sum of the previous alpha with the new successes and the previous beta with the new failures respectively. This is shown separately in the section below:The section of the model which shows the updating of the beta distribution, snapshot from spreadsheet, Image created by author.This can now be visualized in different ways. First, we can see the posterior distribution discretized and compare it with the inputs:Comparing the posterior with the prior, snapshot from spreadsheet, Image created by author.We see that the posterior distribution is left-shifted. We can also see this in the visualizations that follow:Snapshot from spreadsheet, Image created by author.Answering Probability QuestionsFirst, by Laplace’s succession rule, we can answer the question: What is the probability of the next component being non-conforming?Snapshot from spreadsheet, Image created by author.This was arrived at by simply dividing the number of posterior successes (posterior alpha) by the number of posterior trials (posterior alpha + posterior beta), or the mean of the posterior distribution.Since we have the posterior cumulative distribution, we can easily read it to answer probability questions.Next, we are interested in knowing the probability of being below the target non-conforming level. We can answer this easily by reading the cumulative distribution function against the target level. In our example below, we can do the readout against both the prior and the posterior.Probability of being below non-conformance target, before and after testing current batch, snapshot from spreadsheet, Image created by author.As we can see, our observations made us far more confident about being below the non-conforming level. Note that this inference is good as far as it goes. One critique that can be leveled here is that we have taken all the data at face value and not discounted for the broader context in which this data may appear (for instance, how many batches are we going to see over the year?) Using that information would lead us to introduce a posterior scale power (a.k.a. data scale power) that would temper our inference from data.Posterior scale power or data scale power can be thought of as the answer to the question: “how many trials (successes) do I need to see in this test/batch to consider as one trial (success)?” The worksheet has set the data scale power to 1 by default, which means all of the data is taken at face value and fully used. The problem with this is that we can make up our mind too quickly. A data scale power of 10, which implies that we will take every 10 trials as 1 trial and every 10 successes as 1 success, will immediately change our conclusion. As we can see below, the needle will barely move from the prior as we are now treating the 30 successes in 1000 trials as 3 successes in 100 trials (dividing by 10).Reading Probabilities from distributions, snapshot from spreadsheet, Image created by author.Looking at the above, we will quickly realize that we need to run more batches in order to get more confidence, as it should be. Let’s say we ran 5 batches of 1000 components each, and saw the same proportion as 30 successes over 1,000 trials — only, we saw 30 x 5 = 150 successes over 1,000 x 5 = 5,000 trials. We now see a close to 90% confidence level that we will be below the 5% target non-conformance level.Snapshot from spreadsheet, Image created by author.Now, a key question is: what is a principled way of setting a data scale power? Let’s say we want the forecast to be valid annually. One principle we can use is the proportion of the manufactured batches used for inference out of the total batches to be manufactured over the year. Let’s say our plan was to manufacture 50 batches, and we haver used 5 batches for inference. Then, we can set our data scale power to 50/5 (=10). Another way to interpret the data scale power is that we have to dilute the data by 10 times in order to interpret it for the entire year.Let’s now turn to the final forecasting question on operating economics.Operating EconomicsIt is very easy to place an economic model on top of the forecasting work we have already done. By taking as inputs the price and cost of each component, the number of batches to be processed in a year, and the number of components in each batch, we can get a distribution of the number of non-compliant components by multiplying the total components manufactured (e.g. 50,000) by each item in the NC posterior distribution that we produced in the prior section. We can then directly calculate the loss distribution by multiplying the NC forecast by the cost of each component. The operating profit can also be calculated easily by calculating the net revenue of each compliant component and subtracting the loss of the non-compliant components.Forecasting operating profit. Also available: US Version and Kannada Version. Snapshot from spreadsheet, Image created by author.Further, as the screenshot above shows, we can calculate the probability of exceeding the target operating profit, which is the same as the probability of being below the implied non-conformance rate at that profit target, which we read off from the cumulative density function of the posterior (of non-conformance) in the previous section.Additional ThoughtsThis is a simple model to show how we can get started using probability in forecasting. There are limitations to this approach, and one important limitation is that we are considering the price, cost, number of batches run and the number of components produced as fixed. These might all be uncertain, and when that happens, the model has to become a little more sophisticated. The reader is referred to the Tornado Diagram tool to make more sophisticated economic models that handle multi-factor uncertainty.Further, the beta-binomial updating model works only if we assume stationarity in the process of making the parts, meaning there is no drift. The field of Statistical Process Control[4] gets into drift, and that is beyond the scope of this article.—Thanks to Dr. Brad Powley for reviewing this article, and to Anmol Mandhania for helpful comments. Mistakes are mine.References[1] Miller III, Allen C., and Thomas R. Rice. “Discrete approximations of probability distributions.” Management science 29, no. 3 (1983): 352–362. See P8.[2] McNamee, Peter, and John Nunzio Celona. Decision analysis for the professional. SmartOrg, Incorporated, 2007. Free online PDF of book. See page 36 in chapter: Encoding Probabilities.[3] Howard, Ronald A. “The foundations of decision analysis.” IEEE transactions on systems science and cybernetics 4, no. 3 (1968): 211–219.[4] “Out of the Crisis” Wheeler, D J & Chambers, D S (1992) Understanding Statistical Process Control[5] Howard, Ronald A. Decision analysis: Applied decision theory. Stanford Research Institute, 1966.Simple Probabilistic Inference in a Manufacturing Context was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • Instead of Virtue Signaling, I Signed Up to Write 300 GOTV Postcards
    by Monica P. on 1 oktober 2022 at 00:59

    Except I didn’t write them, Python did for me.Continue reading on Towards Data Science »

  • Causal Python: 3 Simple Techniques to Jump-Start Your Causal Inference Journey Today
    by Aleksander Molak on 30 september 2022 at 20:11

    Learn 3 techniques for causal effect identification and implement them in Python without losing months, weeks or days for researchContinue reading on Towards Data Science »

  • How To Get The Current Time In Python
    by Giorgos Myrianthous on 30 september 2022 at 16:40

    Computing the current date and time programmatically with PythonContinue reading on Towards Data Science »

  • What I learned building platforms at Stitch Fix
    by Stefan Krawczyk on 30 september 2022 at 16:31

    What I Learned Building Platforms at Stitch FixFive lessons learned while building platforms for Data Scientists.A blueprint for a platform. Image from pixabay. Note: this post original appeared on my substack.Why build a platform?Picture this. You’re an individual contributor working at some company that requires you to write “code” to get your job done. I’m trying to cast a wide net here, for example, you could be a full-stack data scientist at Stitch Fix creating models that plug back into the business, or you could be a software engineer at a startup writing product features, basically anyone where you have to develop some “software” and through your work, the business somehow moves forward. In general, it is easy to get started and deliver value to the business, since things are relatively simple. But consistently being able to deliver value and doing it over time is hard. You can easily reach terminal velocity and end up spending all your time keeping your prior efforts running, or fighting their details to expand and do more, versus moving your business forward. So how do you prevent this? At some point you need to start building out abstractions to reduce maintenance costs and increase your development velocity, this is, after all, what all the big tech companies do internally. What these abstractions build out is a platform, i.e. something you build on top of. Now, building good platforms isn’t that straightforward, especially as businesses grow and scale.I was lucky enough to spend the last six years focusing on “engineering for data science” and learning to build great platforms for the world class data science team at Stitch Fix. During this time, I saw lots of platform successes and failures first hand. Now, there is plenty of material available on what types of platforms have been built (see any big tech company’s blog) and how to think about building a software product (e.g. building a MVP), but very few on how to start a platform and build one out. In this post I will synthesize my major learnings about how to build platforms into five lessons. My hope is that these five lessons will come in handy for any one trying to build a platform, especially in the data/ML space.Background ContextWhen I joined the data platform team back in 2016, Jeff Magnusson had just written Engineers Shouldn’t write ETL. I was excited to build out capabilities for Data Scientists who were operating in a no hand off model. At the time it was an avant-garde way to run a data science department (if you haven’t read either post it’s worth the read). At a high level, the platform team operated without product managers and had to come up with platform capabilities to move Data Scientists forward, who in turn moved the Stitch Fix business forward. As cheesy as it sounds, what Jeff Magnusson wrote ‘Engineers should see themselves as being “Tony Stark’s tailor”, building the armor that prevents data scientists from falling into pitfalls that yield unscalable or unreliable solutions.’ is true, we really did get to dream big with the tooling we built. Now, how did things work out in practice? Well some people’s ideas and efforts flopped hard and others were smashing successes, hence the motivation for this post.Before we go further, a quick work on nomenclature. I will use the term “platform” in a loose metaphorical sense — it is anything you build on top of. So if you are someone who provides a web service API, a library, a UI, etc., that other people use to construct on top of, then you are building a platform. I’m also being liberal with the term “API” to cover all of your UX on your platform unless noted otherwise.Lessons LearnedHere I’ll present five lessons. While the lessons can be read independently of one another, I highly recommend reading them in order.Lesson 1: Focus on adoption, not completenessEveryone wants to build the perfect platform for their stakeholders, with all the bells and whistles attached. While this is well-intentioned, they commonly fall into a trap — that of building too much with no early adopters. For those familiar with the terms MVP & PMF, this is basically what this lesson is about.Let me put this into context. The Stitch Fix Data Platform team operated without product managers. So each platform team had to figure out what to build and who to build it for. A simple solution here could be “just hire a PM”, but (1) technical ones are hard to find (especially back in 2016) and (2) it would go against how we wanted to operate. A lot of engineers had to learn the hard way that they couldn’t just build something in isolation; going off for a quarter and then going “tada”🎊 wasn’t going to guarantee anyone would use what you were building. In fact, that was a recipe to get you fired!Why might this happen? Well, if you have a vision for a platform that fulfills a wide array of use cases, it’s tempting to build for all use cases from the very beginning. This is an arduous process, and takes a long time before you arrive at something usable. My metaphor to describe this is: if you want to build a house (which represents your platform), you generally start with all the foundations, and then build upwards, adding walls, the ceiling, and then once the exterior is done, the internals — the house isn’t livable or usable until everything is completed. If you build a platform this way, it’s very easy to go away for a long time and not have anything to show for it. Worse yet, you waste a lot of effort building a house that no one wants, e.g. with only one bathroom, only to discover your end users need a bathroom for each room.Images to help grok my metaphor. On the left “building up vertically”. On the right “all at once”. Image by author.So instead, one should try to find a way to build up “vertically” for a single room at a time, so that it’s habitable and someone can make use of it before the entire “house” is completed. Yep, go ahead, try to picture a house, where only the structure for one room exists, that room is functional — that’s the image I’m going for. While we might not build a house like this in the real world, we can always build software like this, so bear with me. That said, modular construction is all the rage these days, so maybe I am onto something with this metaphor… By building up a room at a time, you will get faster validation/have the time to pivot/correct as you fill out the rest of the house. Now, this doesn’t solve the problem of knowing what room to build first and thus who to build first for. Remember there is a human side to building a platform. Determining who and getting their commitment arguably can make or break your project. Here are two patterns that I saw work well:Adopt existing user toolingPartner closely with a team and a specific use caseAdopt existing user toolingThe data scientists Stitch Fix hired were a capable lot. If there was a gap in some area of the platform, you can be sure that data scientists filled that void themselves and built something. As a team determining its own product roadmap, we were on the hunt for capabilities to build and extend. Inheriting homegrown tooling/frameworks/software made a lot of sense. Why? Adoption was all but guaranteed — the platform team only had to polish and generalize. If they built a shack that worked for them, then coming in and doing a remodel gave you a very specific set of parameters to work with. One caveat with this approach is that you need to see a bigger vision than what their solution currently provides, e.g. more capabilities, or supporting more users, else you’ll be doing a remodel for little likely benefit.For example, there was a homegrown tool that one of the teams had come up with for their own particular business context. It was a configuration driven approach to standardize their team’s model training pipelines. They had built it because they needed it to solve some very specific pain points they were having. We did not partner in building it because we were not in a place to support such an endeavor at the time (we were even skeptical of it). Fast forward a year, and suddenly more data science teams hear about it and want to start using it. Problem is that it was very coupled to the context of the originating team, who had little incentive to support other teams using it. Perfect problem for a platform team to step in and own! Importantly, we could see a grander vision with it and how it could serve more use cases. See this post for the outcome and extensions we added.I like this approach in particular because:You didn’t spend time iterating yourself to determine what to build for people to adopt it1. Win.You got someone else to prove its value. Win.You can then have good reason to inherit it and improve it. Win.Note: inheriting can get political at times, especially when the person building it doesn’t want to give it up. If there are clear platform responsibility boundaries in place this isn’t a hard pill to swallow, but if it’s a surprise to the creator then options are to have them transfer to platform, or simply have a hard conversation… In general, however, this should be a win-win for everyone involved:a win for the team that created the tool because they are now unburdened by maintenance of it.a win for you because you can take over the tool and take adoption and capabilities further than it would otherwise would have gone.a win for the business because it’s not wasting resources on speculative efforts.Partnering very closely with a team and a specific use caseI recall one conversation with a platform engineer. They were balking at feedback that they should be able to deliver something sooner for people to get their hands on it. “No, that’s not possible, that will take two months” (or something to that effect). I agreed, yes, this is a challenge, but if you think about it long enough, there generally are ways for any platform project to be chunked in a way that can show incremental value to bring a stakeholder along.Showing incremental value is important; it helps keep you aligned with your stakeholders/users that you’re targeting. It is also a good way to de-risk projects. When building platforms you have technological risk to mitigate, i.e. proving that the “how” will actually work, and adoption risk, i.e. will someone actually use what I’ve built. With our house-building metaphor, this is what I mean by figuring out how to build a habitable room without completing the entire house. You want to bring your stakeholder along from architecture diagrams, to showing sample materials, to building something that minimally works for their use case.Practically speaking, a way to frame delivering incremental value is to do time boxed prototyping and make go/no-go decisions based on the results. It is far better to pay a small price here and learn to kill a project early, versus getting a lot of investment without mitigating the key risks to success. Do this by targeting a specific, narrow use-case, then determining how to broaden the appeal by expanding the platform “horizontally” to support wider use-cases. For example, when we set out to build out our capability to capture a machine learning model and enable no work on part of the data scientist to deploy the model, we partnered very closely with a team that was embarking on building out a new initiative. You could think of them as a “design partner”. They had a narrow use case with which they wanted to track what models were built and then selectively deploy their models in batch. This enabled us to focus narrowly on two parts: saving their models, and owning a batch job operator that they could insert into their offline workflows for model prediction. Constraining it to a team which had a deadline gave us some clear constraints with which to deliver incrementally. First the API to save models, and then the job to orchestrate batch predictions. Because we had a vision to support other teams with these capabilities, we knew not to over index on engineering towards this one team. By working closely with them we ensured we got adoption early, which helped provide valuable feedback on our intended APIs and batch prediction functionality. In turn, they got a partner that supported and heard their concerns, and was aligned to ensure that they were successful.As an astute reader, you might be thinking this just sounds like agile project management applied to building a platform. My answer is yes, you’re basically right, but many a platform engineer likely hasn’t had this sort of framing or mentorship to see the connection, especially in a world where product managers would do this type of thing for you.Lesson 2: Your users are not all equalAs engineers we love building for possibilities. It’s very easy for us to want to ensure that anyone can do anything with the platforms that we provide. Why is that? Well I’m stereotyping here, but we generally want to be egalitarian and treat every user that we’re building for equally in terms of providing support and functionality.That is a mistake.Two facts:Users you build for will fall on a spectrum (a bell curve, if you will) of abilities. There will be average users, as well as outlier users. Outlier users are your most sophisticated users.Features you add to the platform do not contribute equally to development costs and maintenance.In my experience, the outlier users want your platform to support more complex capabilities/needs, because they want you to support their more sophisticated desires. This in general means higher development costs and maintenance costs for you to implement such a feature. So you really have to ask yourself, should I:(1) design for this feature at all?(2) then actually spend time building it and maintaining it?Or (3), push back and tell that user they should build that themselves.You might be thinking that what I’m talking about is simply a case of over engineering. While, yes, this does have that flavor, over engineering has more to do with what the solution is, versus actually deciding whether you should support some functionality in the platform or not. Using our building a house metaphor, should you build in some sophisticated custom home automation system because someone wants voice activated lights, or should you just tell the user to figure out how to provide that feature themselves?Unless you’re looking to build a completely new platform and searching for a customer, or there are compelling business reasons to do so otherwise, you as a platform builder should learn to say no (in a nice way of course). In my experience, more often than not, these features end up being related to speculative efforts. I found it is better to wait it out and ensure that this effort proves to be valuable first, before determining if it should be supported. Remember, these asks come from sophisticated end users, so they very likely can get by, by supporting it themselves. Note, if you take this strategy, then it can feed into the “adopting homegrown tooling” strategy from lesson 1.Lesson 3: Abstract away the internals of your systemOver time, less and less infrastructure/tooling is being built within an organization as the maturity of technology providers in whatever domain you’re in has grown. Invariably you, as a platform builder, will integrate with some third party vendor, e.g. AWS, GCP, a MLOps vendor, etc. It’s very tempting, especially if the vendor solves the exact problem you want to solve, to straight up expose their API to users you’re building the platform for since it’s a quick way to deliver some value.Exposing APIs like this to an end user is a great recipe for:Vendor lock in.Painful migrations.Why? You have just given up your ability to control the API of your users.Instead, provide your version2 of that API. This should take the form of a lightweight wrapper that encapsulates this vendor API. Now it’s easy to do this poorly, and couple your API with the underlying API, e.g. using the same verbiage, same data structures, etc.Your design goal should be to ensure your API does not leak what you’re using underneath. That way, you retain the ability to change the vendor without forcing users to migrate, because you retain the degrees of freedom you need to do it without requiring users to change their code. This is also a good way to simplify the experience of using a vendor API too, as you can lower your users’ cognitive burden by making common decisions on their behalf, e.g. how things are named, structured, or stored.For example, we integrated an observability vendor into our systems at Stitch Fix. Exposing their python client API directly would mean that if we ever wanted to change/migrate away it would be difficult to do so. Instead, we wrapped their API in our own client library, being sure to use in-house nomenclature and API data structures. That way we could easily facilitate swapping this vendor out if we wanted to in the future.Note, this isn’t an unreasonable approach to also take with your sister platform teams either if you use their APIs. Some rhetorical questions to think about, do you want to control your own destiny? Or, be coupled to their goals and system design?Lesson 4: Live your users’ life cycleIf you operate with product managers then they should ostensibly know and be aware of your users’ life cycle, to help guide you as you build your platform. As we had no product managers at Stitch Fix, we were forced to do this ourselves, hence this lesson. Now, even if you do have product managers, my guess is that they will still appreciate you taking on a bit of this burden.The capabilities and experiences that you provide for your end users result in downstream effects over time. While it can be easy to gloss over the intricacies of your users’ workflows, especially if they stretch past your platform, doing so will inevitably result in tenant and community issues (to use our housing metaphor).Tenant issues are generally small problems, like when simultaneous faucet usage reduces everyone’s water pressure. These problems only require some small tweaks to fix/mitigate. E.g. you made it super easy to launch parameterized jobs and people clog up your cluster with work in addition to your cloud expenses jumping up. What’s the quick fix here? Perhaps you ensure jobs are always tagged with a user and SLA, so you can quickly identify who is utilizing all your cloud resources/use it to make decisions as to where to route tasks based on priority. Or, just identify who you need to talk to, to kill their jobs.“Community issues” are bigger problems. For example, say you’ve built an awesome house (platform) that can support many tenants (users), but street parking around it is minimal; you didn’t account for this. Anytime someone (i.e. a potential user) wants to visit the house they struggle to park their car and have to walk a long way. If not fixed quickly, these issues can really hurt your platform. To try to illustrate this point, say you focused on making one aspect of a user’s workflow really easy with your platform, but you neglected how it fits into their bigger picture. For example, you might have just increased the total amount of work someone needs to fulfill to get something to production, because their development work isn’t directly translatable to your production platform system. In which case, your platform solution that was initially met with enthusiasm, turns into dread because there is a particular sticking point that your end users hit time and again. A smoking gun indicating that this is happening is when end users come up with their own tooling to get around this problem.So what should you do? Walk in the shoes of the end user, and take a macro view of how what you are providing fits into what work they need to get done. Here are a few approaches to mitigate problems:Be an end user: actually use your platform and get things to production on it.Model the hypothetical: draw the flow chart of your users’ workflows and then think about ramifications of whatever platform feature you’re providing (works for every situation).Bring in an end user: bring a user on for an internal rotation — they should be able to understand and explain this to you and your team (bring someone to help be a better voice for your users).Build relationships: build deep enough trust and relationships with your peers such that you can ask blunt questions like “what do you hate about your workflow?”, “if there was something you wouldn’t have to do in getting X to production, what could it be?”. Sometimes your users are just anchored and resigned to the fact they can’t change the world, where in fact they can by giving you feedback. Other times they don’t feel safe enough to give you the feedback you really need, so you’ll need to build trust for that to happen.If you do the above for long enough, you can start to intuit what’s going to happen more easily and thus determine what extra features you might need, or potential issues to anticipate and plan for.Lesson 5: The two layer API trickIn this lesson, I put forward my high level framing of how I think when I set out to build a platform. This is essentially the playbook I came up with to help deliver successful platforms at Stitch Fix. I concede it might not be always possible to fulfill this approach due to tight requirements/the nature of your platform. But as you build higher level abstractions you should be able to apply this way of thinking. Otherwise as you go through this lesson, you’ll hopefully see connections with the prior four lessons. But first, some motivation.Motivation(1) remember the sophisticated user of your platform that asks for that complex feature? Since you said “no, go build it yourself”, they will likely go ahead and do so. But, if they’re successful you’re going to want to inherit their code, right? Wouldn’t you want to make that a simpler process to inherit if you could?(2) it is easy to write very coupled, non-generalizable code when providing platform capabilities, i.e. it’s hard to break apart and extend/reuse. This isn’t a bad thing if you’re getting started and need to get something out there, but it becomes a problem when you want to extend your platform. In my experience, especially if you don’t have the time for “tech debt” projects, it’s easy for such coupled code to snowball and thus significantly impact your team’s delivery of work.(3) in lesson three, the focus is on not leaking vendor API details. I think that’s a good approach, in effect you create two layers of APIs, but it’s quite focused on the micro problem of vendor API encapsulation. How can we extend that thinking further and provide ourselves with some framing for our entire platform?Two layers of APIsTo help with maintaining and growing a platform, you should think about building two layers of APIs:A bottom layer allows one to build “anything” but in a bounded way.A second, higher-level layer provides a less cognitively taxing, opinionated way to do something.Using our house building analogy here, the lower-layer represents the house’s foundation, plumbing and electrical; it bounds the shape and surface area of the house. The higher-level API corresponds to what a room is; its features and layout, e.g. for your users you have placed the refrigerator, stove, and sink to form a kitchen triangle because for anyone doing cooking that’s a pretty good setup. Then if someone wants something more complex in their room, we’ve made it easy to take the walls off and get access to the plumbing and electrical so they can rearrange it how they want instead.Let’s expand on these two layers more concretely.Two layer API in relation to the housing metaphor. Image by author.What is this bottom API layer?The purpose of this “low level API” is that you can express anything you want your platform to do, i.e. this API captures base level primitives. That is, this is your base capability layer, using it means you can control all the minutiae.The goal of this layer is not to expose it to your end users per se. Instead, the goal is to make you define yourself a clear foundation (pun intended with our house building metaphor), with which to build off of. Therefore you should consider yourself the primary target of this layer. For example, this layer could have APIs for reading and writing data in various formats, where to use it one needs to make decisions about file names, locations, which function to use for what format, etc.What is this second API layer?The purpose of this “higher level API” is to provide a simple experience for your average user built solely on top of your lower level API. You are essentially defining a convention into this API to simplify the user’s platform experience, as you have made some lower level API decisions for them. For example, building off the example for the lower level layer, this layer could expose simple APIs for saving machine learning model objects. This is a simpler API because you’ve already made decisions on file name conventions, location, format, etc. to save that model so your platform end user doesn’t have to.The goal of this layer is to be the main API interface for your platform end users. Ideally they can get everything they need done with it. But if they need to do something more complex, and it doesn’t exist in this higher level API, they can drop down to the lower level API you have provided to build what they need for themselves.Why two layers should workBy forcing yourself to think about two layers you:Make it harder for you and your team to couple concerns together. Since, by design, you’re forcing yourself to determine how a more opinionated capability (higher level API) on your platform is decomposed into base level primitives (lower level API).You can more easily bound how the platform takes shape because you define your base foundation layer. This helps provide support for the more sophisticated user you support, who can peel back the opinionated layer and do more complex things, without you having to explicitly support that. By enabling more complex users in this way, you have time to think whether you should support their more complex use case in a first class manner (see how this can feed in the “Adopt existing user tooling” part of Lesson 1).Now some of you might react and balk at the idea of supporting two APIs, as it sounds like a whole bunch of work on API development, maintenance, and versioning. To that I say, yes, but you’re largely going to be paying it anyway if you’re following good documentation and API versioning practices. Whether it’s internal or external to your team shouldn’t really change much, except how and where you communicate. If you take the alternative approach of building a single API layer, your initial costs might be lower, but the future maintenance and development costs are going to be much higher; you should expect that your platform needs to change over time. E.g. security related updates, major library versions, new features, etc. My argument here is that it’ll be easier to do so with two API layers than a single API layer.Two brief examplesTo help crystalize this, let’s look at two examples of this two layer API thinking in action.Example 1For example, when we introduced our configuration based approach to training models, it was built on top of our model envelope approach to capturing models and then enabling deployment. So if someone didn’t want to use our configuration approach to creating a model, they could still make use of the model envelope benefits by dropping down to use that API.Example 2At Stitch Fix we made it easy to build FastAPI web services, but users did not actually have to know or care that they were using FastAPI. That’s because they were using a higher level opinionated API that enabled them to just focus on writing python functions, which would then be turned into web service endpoints running on a web server; they didn’t need to configure the FastAPI web service by writing that code themselves because it was already taken care of for them. This functionality was built on top of FastAPI as the base foundational layer. Should a user want more functionality than the upper opinionated layer could provide, one was able to invoke the lower level FastAPI API directly instead.SummaryThanks for reading! In case you’ve been skimming, here’s what I want you to take home. To build platforms:Build for a particular vertical/use case first and deliver incremental value, where either you inherit something that works, or target a specific team that will adopt your work as soon as it’s ready.Don’t build for every user equally. Let sophisticated users fend for themselves until it’s proven that you should invest your time for them.Don’t leak underlying vendor/implementation details if you can. Provide your own thin wrapper around underlying APIs to ensure you have more options you can control when you have to make platform changes.Live your users’ lifecycles. Remember that you provide for and shape the experience of users using your platform, so don’t forget the macro context and the implications of your UX; drink your own champagne/eat your own dog food so that you can ensure you can foresee/understand resonating impacts of what you provide.Think about providing two layers of APIs to keep your platform development nimble:(i) Think about a bounded foundational API layer. That is, what are the base level primitives/capabilities you want your platform to provide, and thus what’s a good base to build on top of for yourself.(ii) Think about an opinionated higher level API layer. This layer should be much simpler to use for the average user than your lower foundational API layer. To handle more complex cases, it should still be possible for more advanced users to drop down to your bounded lower level foundational API.If you disagree, have questions or comments, I’d love to hear them below.To closeI am thrilled to share the insights I’ve garnered from my time at Stitch Fix with you (hopefully they’ve been useful!). Since I left, however, I have not just been editing this blog post. I’ve been scheming about building a platform myself. Stay tuned!Also, special thanks to Elijah, Chip, and Indy who gave valuable feedback on a few drafts of this post; errors and omissions are all mine.Before you go: links you might be interested in📣 Follow me on:LinkedInTwitter⭐ Checkout GitHub — stitchfix/hamilton:A scalable general purpose micro-framework for defining dataflows. You can use it to build dataframes, numpy matrices, python objects, ML models, etc🤓 Read some blogs:Configuration Driven Machine Learning Pipelines | Stitch Fix Technology — MultithreadedDeployment for Free — A Machine Learning Platform for Stitch Fix’s Data ScientistsFunctions and DAGs: Hamilton, a General Purpose Micro-Framework for Pandas Dataframe GenerationAggressively Helpful Platform Teams | Stitch Fix Technology — Multithreaded (not authored by me)📝 Sign up for my upcoming model deployment & inference class (TBD when exactly).What I learned building platforms at Stitch Fix was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • Three Must Haves for Machine Learning Monitoring
    by Yotam Oren on 30 september 2022 at 16:30

    A guide for data scientists evaluating solutionsMonitoring is critical to the success of machine learning models deployed in production systems. Because ML models are not static pieces of code but, rather, dynamic predictors which depend on data, hyperparameters, evaluation metrics, and many other variables, it is vital to have insight into the training, validation, deployment, and inference processes in order to prevent model drift and predictive stasis, and a host of additional issues. However, not all monitoring solutions are created equal. In this post, I highlight three must-haves for machine learning monitoring, which hopefully serve you well whether you are deciding to build or buy a solution.Source: MonaComplete Process VisibilityFirst, models have to be evaluated in the context of the business function they aim to serve. Such function is often realized multiple steps downstream from the model. Outside of the lab, many AI-driven applications involve multiple models working in tandem. Furthermore, the behavior of the models will likely depend on data transformations which are multiple steps upstream. Thus, a monitoring solution which focuses on single model behavior will not capture the holistic picture of model performance as it relates to the global business context and will fail to uncover many of the issues that begin or end outside of the model. A proper assessment of ML model viability only comes from complete process visibility — having insight into the entire dataflow, metadata, context, and overarching business processes on which the modeling is predicated.For example, as part of a credit approval application, a bank may deploy a suite of models which assess credit worthiness, screen for potential fraud, and dynamically allocate trending offers and promos. A simple monitoring system might be able to evaluate any one of these models individually, but solving the overall business problem demands an understanding of the interlocution between them. While they may have divergent modeling goals, each of these models rests upon a shared foundation of training data, context, and business metadata. Thus, an effective monitoring solution will take all of these disparate pieces into account and generate unified insights which harness this shared information. These might include identifying niche and underutilized customer segments in the training data distribution, flagging potential instances of concept and data drift, understanding the aggregate model impact on business KPIs, and more.The best monitoring solutions are able to extend to all process stages, including the ones which do not involve a model component.Source: MonaAutomatic, granular insightsA common misconception is that a monitoring solution should simply enable visualization and troubleshooting of the common metrics associated with an ML model in production. While this is helpful, visualization and troubleshooting implies that you are already in “investigation mode”. Worse yet, you might be “fire-fighting” after the business complained that a KPI dropped (and asked “what’s wrong with the model?”).So, how about being more proactive?How about detecting issues weeks or even longer before overall performance declined?You should expect your monitoring solution to automatically detect problems when they are still small, in granular segments of data. Allowing you ample time to take corrective or preemptive action. The meaning of “automatically” deserves some further elaboration here. Some monitoring tools will provide dashboards that allow you to manually investigate subsegments of data to see what’s performing well and what’s not. However, this sort of facile introspection requires painstaking manual intervention and misses the greater point, which is that a true monitoring solution will be able to intrinsically detect anomalies via its own mechanisms without external reliance on an individual to provide a hypothesis of their own.The more granular you get, the more you should pay attention to noise reduction. It’s expected that single anomalies would propagate issues in multiple places. It’s via detection of the root causes of issues that the monitoring truly succeeds, not just by flagging surface-level data discrepancies or the like.Source: MonaTotal ConfigurabilityDifferent ML systems have different data and flow, different business cycles, different success indicators, and different types of models. You should seriously doubt “plug-and-play” monitoring solutions.A complete ML monitoring solution has to be configurable to any problem and across all of its components. It should be able to take in any model metric, any unstructured log, and, really, any piece of tabular data and make it easy to:Construct and continuously update a single performance databaseCreate and customize dynamic visualizations and reportsSetup and tweak automatic, granular insights and notificationsOne simple example for the need for configurability lies in the contrast between systems in which you can gain (near) real time feedback for model fidelity (e.g., consumer recommendation systems) and ones in which a feedback loop requires human intervention and more time (e.g., fraud detection, credit scoring and more).Most enterprise ML teams are working on a variety of ML projects on very different business problems. Consequently, monitoring requirements are broad and require nuance and flexibility to accommodate the differences. If you are on one of those teams, you may have established strong data science standards, a unified stack for data preparation, model development and deployment. Now, will you be able to monitor and govern your systems with unified standards and a single solution? You should definitely expect to do so.ConclusionGiven the ever-increasing hype around machine learning, there exist many solutions which will take an ML model and provide superficial insights into its feature behavior, output distributions and basic performance metrics. However, solutions which exhibit complete process visibility, proactive, intelligent insights, and total configurability are much, much rarer. Yet, it is these three attributes which are key for squeezing the highest performance and downstream business impact out of ML models. Therefore, it’s crucial to evaluate any monitoring solution through the lens of these three must-haves and ensure that it provides not only model visibility but a more global and complete understanding of the business context.Originally published at https://www.monalabs.io.Three Must Haves for Machine Learning Monitoring was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.