Learning Library

← Back to Library

Enterprise Generative AI Cost Factors

19m • Unknown Channel • ai-ml • deep-dive • intermediate • Watch on YouTube ↗

Key Points

Enterprise generative AI costs go far beyond a simple chatbot subscription, requiring careful evaluation of data security, compliance, and production‑grade platforms.
Seven major cost drivers must be considered when scaling LLMs: the specific use case, model size, pre‑training from scratch, inference compute, fine‑tuning, hosting infrastructure, and deployment model (cloud SaaS vs. on‑prem).
The choice of use case dramatically influences required compute and pricing, so companies should treat AI purchases like vehicle selections—matching features to needs rather than expecting a one‑size‑fits‑all quote.
Conducting a pilot with a capable vendor lets enterprises identify pain points, test multiple models, and determine the most effective solution in terms of efficacy, speed, and cost.
While consumer tools like ChatGPT are cheap and convenient for personal tasks, enterprise deployments demand robust, secure, and customizable solutions that carry significantly higher but justifiable expenses.

Sections

00:00:00 Enterprise Generative AI Cost Factors - The speaker explains that beyond cheap consumer subscriptions, enterprises must assess use cases, model size, data security, and vendor partnerships to grasp the true total cost of deploying large language models.

Full Transcript

# Enterprise Generative AI Cost Factors **Source:** [https://www.youtube.com/watch?v=7gMg98Hf3uM](https://www.youtube.com/watch?v=7gMg98Hf3uM) **Duration:** 00:19:19 ## Summary - Enterprise generative AI costs go far beyond a simple chatbot subscription, requiring careful evaluation of data security, compliance, and production‑grade platforms. - Seven major cost drivers must be considered when scaling LLMs: the specific use case, model size, pre‑training from scratch, inference compute, fine‑tuning, hosting infrastructure, and deployment model (cloud SaaS vs. on‑prem). - The choice of use case dramatically influences required compute and pricing, so companies should treat AI purchases like vehicle selections—matching features to needs rather than expecting a one‑size‑fits‑all quote. - Conducting a pilot with a capable vendor lets enterprises identify pain points, test multiple models, and determine the most effective solution in terms of efficacy, speed, and cost. - While consumer tools like ChatGPT are cheap and convenient for personal tasks, enterprise deployments demand robust, secure, and customizable solutions that carry significantly higher but justifiable expenses. ## Sections - [00:00:00](https://www.youtube.com/watch?v=7gMg98Hf3uM&t=0s) **Enterprise Generative AI Cost Factors** - The speaker explains that beyond cheap consumer subscriptions, enterprises must assess use cases, model size, data security, and vendor partnerships to grasp the true total cost of deploying large language models. ## Full Transcript

0:00today we're going to talk about the true 0:02cost of generative AI for the Enterprise 0:04specifically focusing on large language 0:06models and it's important for an 0:08Enterprise to consider all of the costs 0:10Beyond simply subscribing to a chatbot 0:13llm like chat GPT I'd like to start with 0:16a a quick story last week I was at a 0:19wedding and no one could find the best 0:21man we were in the the dinner and the 0:23rehearsal and all of a sudden he came 0:26out of the back room with his laptop and 0:28he was in this back room 0:30writing his bassmen speech on his laptop 0:33using chat GPT and you know what it came 0:35out fantastic for a consumer use like 0:38writing a last minute speech or a funny 0:41poem using a consumer chatbot is awesome 0:44and that's why for the consumer use case 0:46spending under $25 a month for Access 0:49feels like a great deal because it is 0:52but we're comparing Enterprise use 0:55versus consumer use for the enterprise 0:58we have to safeguard what putting into 1:00production when we're dealing with 1:02sensitive confidential and proprietary 1:04data now it's essential for the business 1:07to evaluate working with a platform 1:09partner or vendor that is geared towards 1:12the Enterprise and this comes with very 1:14different cost factors than the Consumer 1:16today we're going to touch on seven of 1:18these important cost factors that 1:20influence how to scale generative AI 1:22across the Enterprise number one use 1:25case what is it that you actually want 1:27to do with generative AI number two 1:29model size what type and parameter size 1:32of the model are you leveraging three 1:35pre-training costs are you looking to 1:37build an llm from scratch four 1:40inferencing this is the cost of 1:42generating a response using the llm five 1:46tuning which is the cost of adapting the 1:48pre-train model to do new tasks six 1:52hosting which is the cost of deploying 1:54and maintaining that model and then 1:56seven deployment are you going to be 1:59deploying this in the cloud on SAS or on 2:01premise now these are all areas to 2:04consider and the first that we're going 2:05to cover is use case now I can't tell 2:09you how often I have sellers customers 2:12come to me and ask me to create a 2:15blanket statement for what generative AI 2:18is going to cost them for their 2:20Enterprise and what I say to that is 2:23that this is very similar to walking 2:25into a car dealership and asking how 2:28much a vehicle is going to be 2:30different use cases will require 2:32different methods and are going to drive 2:33different amounts of compute we need to 2:36understand are you looking for a 2:37convertible a truck off-roading leather 2:40interior we need some specifics and my 2:43recommendation here is to work with a 2:45partner or a vendor that allows you to 2:47participate in a pilot so that you can 2:49identify all of your pain points and 2:51first see if generative AI makes sense 2:53as a solution this will give you the 2:55opportunity to really Workshop out what 2:58it needs to test and evaluate for your 3:01Enterprise play around with different 3:03models see what delivers the best 3:05efficacy at the lowest cost and the 3:08fastest speeds if you have access to an 3:11entire workbench of models and numerous 3:13tuning methods you won't be locked in 3:16and you can do what's right for your 3:17Enterprise and truly customize that now 3:20we're going to move on to our second 3:22cost driver which is evaluating model 3:25size 3:27now when we talk about model size the 3:29size and complexity of the generative AI 3:32model can really impact pricing and what 3:35we see here is that the larger the model 3:38is the more parameters it has and that's 3:40going to drive compute and different 3:43resources so what we'll find is that 3:45vendors will offer different pricing 3:47tiers based on model size so we can look 3:49here at some examples right we have our 3:52smallest model here which is flan where 3:56we see this at 11 billion parameters we 3:58have Granite which is a midle middle 3:59tier size at 13 billion and then we have 4:03the largest of them all llama 2 at 70 4:07billion parameters and what's important 4:09to know is that different models are 4:12going to serve different use cases some 4:14are better for language translation 4:16others are better for Q&A and as you 4:19move across different models you have 4:21the opportunity to assess what's going 4:23to best suit your use case so something 4:26to look out for when assessing a vendor 4:28or partner to work with is what's their 4:30stance on model access specifically are 4:34they locking you into one model for 4:36every use case or do you have the option 4:38to select what works best for you 4:41another thing is to assess whether or 4:43not they're continuously innovating on 4:44their own proprietary models we found 4:47that Innovation at the model level can 4:49actually provide you with some key 4:51advantages when it comes to domain 4:53specific task generation and 4:55experimenting with different parameter 4:57sizes now we're going to go to our third 5:00cost driver which is pre-training 5:03this is the process of building and 5:05training a foundation model from scratch 5:09now if we look at what this means it's 5:13been very hos prohibitive for a lot of 5:15Enterprises to do so because it requires 5:18a tremendous amount of compute time and 5:21effort and while this does give 5:23Enterprises the control of the data used 5:25to train an llm it does come with a cost 5:28we can look at something that everyone's 5:30familiar with gpt3 and if we look at 5:33what some of these cost factors were it 5:35was over a th000 gpus 5:39over a 30-day period and what this 5:42particular 30-day period cost was over 5:454.6 million so we can see that this is 5:50very very expensive and really why we 5:53only see a few key players that have 5:55emerged on the marketplace to take on 5:57that challenge of pre-training llms from 6:00scratch so if you're not going to 6:02pre-train you can certainly leverage and 6:04take advantage of an llm that's already 6:06been tree pre-trained so now we'll move 6:08on to the next cost factor of working 6:11with a pre-trained llm which is 6:14inferencing 6:15and when we talk about inferencing we 6:17are talking about the process that the 6:19model uses to understand the prompt 6:21question and the process that it uses to 6:24generate a response so essentially this 6:26is how the model figures out what it is 6:28that you want and then uses its own 6:31knowledge to create the answer 6:33inferencing operates on a discrete unit 6:36of information that we call a token and 6:39this is a common industry term where one 6:42token roughly equates to 3/4 of a word 6:48now you can expand that out to identify 6:51that 100 tokens would equate to roughly 6:5375 words and if you're still trying to 6:56figure out what that benchmarks into the 6:58entire work of Shakespeare would come 7:00out to roughly 900,000 words now the 7:03size of a token can vary depending on 7:06which tokenizer you use tokenizer refers 7:09to the tool that actually converts the 7:12text to a token but 3/4 of a word is a 7:15rough rule of thumb that you can go off 7:17of now when we talk about the cost for a 7:20single inference it's important to note 7:22that this includes the number of tokens 7:24in both the prompt 7:27and the completion which would be the 7:31output 7:33so important to note that it covers both 7:35of those now there's one other term 7:38that's important when we're covering 7:40inferencing and that is prompt 7:42engineering 7:44now when we talk about prompt 7:46engineering 7:48this is how we interact with the prompt 7:51itself and this is an industry term for 7:54really the methodology used to craft 7:57effective prompts with the ultimate goal 8:00of eliciting a desired response from the 8:02llm what's important to note here is 8:04that it does not touch the parameters of 8:06the model itself it's more like choosing 8:08the right words and formatting how you 8:10ask the question to really help the 8:12model better understand and it's a 8:14really cost effective way to achieve 8:16tailored results without extensive model 8:19alteration and this is different than 8:21tuning because it does not require the 8:23high compute resources or any of the 8:25hosting so now we're moving on to cost 8:27Factor number five 8:30which is tuning now when we talk about 8:33tuning we're talking about the process 8:35of adjusting the internal settings or 8:37parameters of the model itself to really 8:39improve performance tuning is measured 8:42in hours and you'll often see that there 8:44are different hourly rate charges 8:46depending on what model size you're 8:48using we talked earlier about how 8:50different parameter sizes lead to 8:52different cost increases now when you're 8:55making the decision to tune 8:57there maybe two reasons 9:00why you would choose to do so so reason 9:03number one maybe it's to achieve better 9:06performance from your base model when we 9:09talk about better performance 9:12we're going to evaluate whether or not 9:15tuning the model on a large number of 9:17labeled data could really enhance 9:19performance when we do this we see that 9:22you can actually optimize it by bringing 9:25in your own data the other option for 9:27tuning may be to evaluate whether or not 9:30you can lower the cost at scale by 9:33deploying a smaller model than maybe 9:36what you initially used one thing that's 9:38really important to keep in mind here is 9:40that the cost of label data acquisition 9:42is a really important factor now when we 9:45talk about tuning there are two main 9:47functions of tuning and two main 9:50methodologies of how you can tune we 9:52have fine tuning 9:54and then we have parameter efficient 9:57fine-tuning otherwise known as PFT 10:02now to cover the difference between 10:04these two when we talk about fine-tuning 10:06we're talking about extensive adaptation 10:09of the model itself so you're tuning all 10:12of the parameters and changing them 10:15you're going to be generating a forked 10:17version of this base model that actually 10:20requires you to then go on and host that 10:23and it requires 10:26hundreds of thousands 10:29of data points label data for you to 10:32bring in here so this is ideal for 10:35highly specialized tasks where 10:37performance is critical when we talk 10:40about parameter efficient fine-tuning 10:42this really aims to achieve task 10:44specific performance without the high 10:46costs that are associated with extensive 10:48fine-tuning and this is really achieved 10:50by avoiding any changes to the model 10:52itself so here you could think of this 10:55as tuning smaller models by adding 10:58additional parameters is not altering 11:00what exists so for this you can see this 11:02more around hundreds to thousands of 11:07label data sources uh and we see 11:10different types of cost-effective ways 11:13to apply parameter efficient fine tuning 11:16some of these types that you may have 11:18heard of may include prefix tuning 11:20prompt tuning ptuning Laura but these 11:24are all methods of parameter efficient 11:26fine-tuning so let's drive this home an 11:29analogy let's say you buy a home you 11:31move in congratulations everything's 11:34perfect but after a couple of months you 11:36discover that it actually snows quite a 11:38bit more uh in your environment than 11:41what you thought it would and it gets a 11:43lot colder and initially you had windows 11:47that served you quite well in the summer 11:49but now that it's winter they no longer 11:52work it's really drafty so you decide to 11:55go and completely change the structure 11:57of your windows you put new windows in 12:00that provide you with insulation and you 12:03even get some really nice curtains to go 12:05along with that but on top of that it 12:07snows so much more so now you actually 12:10have to go on and buy a ton of new 12:13equipment you have to buy snow plow snow 12:15shovel snow boots snow tires and 12:18actually build yourself a garage to 12:20store all of this this is an example of 12:23fine tuning in the sense that you are 12:26making structural changes to the 12:28architecture of your house how this 12:30would relate to the model making 12:32structural changes to the underlying 12:33parameters if we look at what this would 12:36mean for p well perhaps here you're not 12:39going to do anything to actually change 12:41the underlying structure maybe instead 12:43of rebuilding your windows you put a 12:45towel under there to block the draft 12:47instead of building a garage to house 12:50all of your new snow equipment maybe you 12:52reuse something you already have and you 12:55just use your broom and use that to 12:57scrape snow away so it's helpful to 13:00consider here that there's different 13:02methods that make sense for different 13:04use cases perhaps in some we mentioned 13:07earlier you can get everything you need 13:08in your output from prompt engineering 13:11as you need to make your models tuned 13:13for more specific use cases it's helpful 13:16to have different methods of doing so 13:18and working with a partner or vendor 13:20that provides you with the ability to 13:21explore different parameter efficient 13:24fine-tuning methods as well as 13:25fine-tuning for when you really need it 13:27because then you have the advantage of 13:29selecting the most cost effective method 13:31for your needs now we're going to move 13:34on to discuss evaluating factor six 13:37which is when do you need to host a 13:41model now there are different 13:43circumstances that would require you to 13:46actually have to host a model to then go 13:48back and interact with it in these ways 13:50that we discussed before for inferencing 13:53and things of that sort now if you are 13:56going to make an llm available for you 13:58use there are two ways to go about doing 14:00it one would be hosting it or one would 14:03be using an inference API Each of which 14:06becomes relevant depending on whether or 14:08not you're going to fine-tune a model so 14:11if you're not fine-tuning a model and 14:13you're going to use some of those 14:14earlier methods we discussed where maybe 14:15you're using parameter efficient 14:17fine-tuning or you're using prompt 14:19engineering this is when you can go on 14:22and us an API for inferencing and 14:26essentially what this means is that 14:28you're going to stay consistent with 14:30those initial cost factors that we 14:32described with the token unit of driving 14:36price and cost and compute and here the 14:39llm is predeployment as 14:57you need it again as I mentioned the 15:00cost for inference when you're using an 15:01API inference is based on the number of 15:04tokens that have been processed by The 15:06Prompt plus the completion of that 15:08prompt via the API call this again is 15:11used when you're not making changes to 15:13the underlying model so you're not 15:15fine-tuning and you're not bringing 15:16something new in that's not already 15:18hosted by your platform the other way we 15:21consider hosting is actually when it 15:23becomes relevant when you are 15:25fine-tuning or you are bringing your own 15:27model at that point you are required to 15:31then go and host that model because 15:32you're essentially creating either a 15:34forked version or bringing something new 15:36in and in this case the llm is actually 15:39made available for deployment by the 15:41platform Provider by taking this in and 15:43so for this you could think of it as 15:45rather than phoning a friend because 15:48they don't have access to a phone you 15:50actually build them a room in your house 15:52and that's where they are for easy 15:53access whenever you need to chat them up 15:55so when we're talking about how we think 15:58about cost factors when you are actually 16:01hosting a new forked version of the 16:03model on top of what you use with your 16:06tokens for prompting you also have to 16:08consider in the hours that are required 16:11for 16:13hosting this model and again you would 16:15pay for hours based off of the amount of 16:17time that you want to interact with the 16:19model so if this is something you need 16:20to interact with all the time you'd be 16:23paying the hour cost for 24x7 access to 16:26this model now again it's it's very 16:28important to realize that different use 16:31cases and circumstances are going to 16:33require different methods of connecting 16:35to your model it's helpful to have a 16:37vendor or partner that actually allows 16:40you to interact with your model in 16:41numerous ways if you do need to host it 16:44with fine tuning to have that option but 16:46you know if you can do an API inference 16:49to also have the ability to access it in 16:51that way as well 16:53and now we've made it to our seventh 16:55cost factor which is deployment 16:59now when we think about deployment every 17:01industry has different standards and 17:04every business has different needs so we 17:06are referring to where you're putting 17:09the platform and the cost of using 17:11generative AI can vary significantly 17:13depending on whether you choose SAS or 17:15an on-prem deployment so when we talk 17:18about SAS there are some benefits here 17:21when it comes to cost standpoint uh 17:23first of all you're using a subscription 17:25feed this is often a predictable and 17:27manage 17:29cost structure as you're paying that 17:30recurring fee to access the AI service 17:33you have a different approach to the 17:35infrastructure here you're not going out 17:37and procuring your own hardware and data 17:40centers you are um avoiding that aspect 17:43of the cost and when it comes to 17:44scalability you have the ability to 17:46increase or decrease usage as needed and 17:50again all of the maintenance and updates 17:51to the infrastructure are included the 17:54big thing here is that when it comes to 17:56generative AI a lot of people are 17:58concerned about acquiring gpus with this 18:02you don't need to go out and procure 18:03your own gpus the SAS providers are 18:06actually sharing those GPU resources 18:08across multiple users so it ends up 18:10being more cost effective for the end 18:12user on the other end we have on premise 18:16now for some Industries there are 18:18regulations that require you to do 18:21things on Prem and you're not allowed to 18:23host your data in the cloud and for that 18:26there are solutions out there for on 18:28premise deployments as well what's 18:30important to note here though is that 18:32you are required to purchase and 18:33maintain those gpus uh the amount of 18:36which would be contingent to the amount 18:38of compute that is required um based off 18:41of your inferencing tuning model 18:43selection but you do have the benefit 18:46here of having full control over the 18:48architecture and how your data is 18:50deployed there are no black boxes so 18:53again depending on your use case in 18:55Industry you might have different needs 18:57but the recommendation here is to find a 18:59partner or vendor to work with that can 19:01meet you where you're at that can 19:02provide you with the opportunity to 19:04leverage generative AI whether in the 19:07cloud or on premise thank you for 19:09watching if you've liked this video and 19:11you want to see more like it please like 19:13And subscribe if you have any questions 19:15please drop them in the comments below