3. How much time should be invested in data preparation before training an algorithm?
“Data preparation is the most crucial and time-consuming step, and often the most underestimated. On average, 80% of data scientists’ time is dedicated to this task. This process demands time and patience, but is fundamental to train reliable models. We’re talking about artificial ‘intelligence’, of course, but we still need something substantial and accurate to fuel it. Many factors can lead to erroneous results, such as missing or incomplete data, uncorrelated batches, typos, etc. This data must therefore be cleaned, that is, prepared and filtered to ensure its accuracy, relevance, and format consistency.
To create a reliable algorithm, we need a data pipeline: a computer architecture that organizes and transfers data from all the sources of information used by a company (databases, applications, Excel spreadsheets, etc.) so that it can be exploited further on. This pipeline must therefore be robust and consistent, and have as few manual inputs as possible. To err is human, which is why the automation of data entry is the method to prioritize. Ideally, companies should no longer use expansive Excel files to store and manage their databases, since they are neither a reliable source of information nor a stable format.”
4. How much data does artificial intelligence projects require?
“It’s a recurring question, and unfortunately it doesn’t have a final answer. Of course, AI doesn’t invent anything: it only looks into the past to make correlations, propose solutions and predict results. Because every business and every project vary so much, the amount of data required will not always be the same. It’s always better to have more than not enough, though, and it’s never too late to start storing and classifying your company’s data. However, its quality is paramount. If you have a bunch of data, but its traceability and reliability cannot be demonstrated, you may end up with inconclusive results.
The beauty of AI is that it can always be trained and optimized with newer and better data. The earlier you start your process, the more your model will be able to evolve over time and offer interesting results.
Conversely, some artificial intelligence projects do not require a ton of historical data. In computer vision, for example, we can rapidly generate information with cameras. In these types of projects, the images produced must be labelled and categorized – a whole different preparatory work, since it requires human intervention.”