1/18/2024 0 Comments Tabular analysisDespite much effort, deep learning has not shown to be as effective in this area. Partitioning, boosting, and/or bagging models have been developed for and do exceedingly well in this domain. The last is often the most interesting here, as the data scientist is generally always asking herself “what different data can I get to make my predictions better?” Further improving models comes from three areas: adding more data, improving methods, or acquiring more and different features. When that type of question arises, the ML model produces one or more predictions. At a high level, the process can be thought of developing a “context” for which only one specific type of questions can be asked. The typical machine learning application involves cleaning and training a narrow set of data typically collected, held, or acquired by an organization. To date, we haven’t seen an LLM used this way and hope this is the beginning of something exciting! Challenges of Applying Deep Learning to Tabular Data Included in this write-up is the first approach we have seen that merges traditional tabular datasets and XGBoost models with LLMs using latent structure embeddings, allowing the tabular approaches to work off of the numerical “features” produced internally by the LLM. We expect this to be the beginning of a number of techniques that use LLMs on tabular data and would not be surprised to see the use of LLMs in tabular data widen and compete favorably to more traditional model development processes. LLM predictions were not competitive with the leading models produced with lengthy and extensive tabular methods, but were strong enough to place well higher than the median score on the leaderboard rankings. However, we found with little background knowledge, zero data cleaning, and zero feature development required by traditional methods, LLMs were able to return results with predictive power. The typical Kaggle competition supplies tabular data and is dominated by traditional ML approaches. In order to demonstrate the efficacy of the approach, we submitted results to several blind Kaggle competitions (including the popular “House Prices – Advanced Regression Techniques” competition). With the recent successes of large language models including OpenAI’s GPT-4 and others, we wanted to see if we could use modern LLM results to help make predictions on tabular datasets. Each group uses its own techniques and models that have, in large part, developed separately. For the purposes of this piece, we call the former the “tabular” or “traditional” group and the latter the “LLM” group. The other works on deep learning applications including vision, audio, large language models (LLMs), etc. One works with highly organized data collected in tables – the tabular focused data scientist. There are two distinct groups in the ML ecosystem. □ Want to try it yourself? Follow along with this blog’s accompanying Colab ✨ This blog is co-authored by Aparna Dhinakaran, CPO and Co-Founder of Arize AI, and Christopher Brown, CEO and Founder of Decision Patterns An adventure in unleashing large language models (LLMs) on tabular Kaggle competitions
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |