Databricks launches first open source instruction-following LLM Dolly 2.0

Ali Ghodsi, co-founder and Chief Executive Officer at Databricks
Ali Ghodsi, co-founder and Chief Executive Officer at Databricks

Databricks, the data lakehouse and AI company, has announced the launch of Dolly 2.0, the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for commercial use. Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.

What does Databricks’ Dolly 2.0 bring to the market?

This comes only two weeks after the launch of Dolly, an LLM which was trained in under 30 minutes for less than $30 to exhibit ChatGPT-like human interactivity. In order to create the dataset on which Dolly 2.0 is trained on, Databricks incentivised over 5000 of its employees by gamifying the process. As a result, it managed to break 15,000 results within a week.

As a part of the company’s continuing commitment to open source, Databricks is also releasing the dataset (databricks-dolly-15k)that Dolly was trained on. It contains 15,000 records generated by thousands of Databricks employees, and to the best of the company’s knowledge, is the first open source, human-generated instruction dataset specifically designed to make large language models exhibit the magical interactivity of ChatGPT.

These training records are natural, expressive and designed to represent a wide range of the behaviours outlined in the original InstructGPT paper, from brainstorming and content generation to information extraction and summarisation. The fact that Dolly 2.0 is trained exclusively on databricks-dolly-15k is further evidence that the level of effort and expense necessary to build powerful AI tech is orders of magnitudes less than previously imagined.

What does Dolly 2.0 launch mean for Databricks?

“With Dolly 2.0, any firm can create, own, and customise a powerful LLM that understands how to talk to people. Dolly 2.0 is available for commercial use without paying for API access or share data with third parties,” said Ali Ghodsi, co-founder and CEO at Databricks.

“Dolly 2.0 is the next step in Databricks’ mission to help every organisation harness the power of large language models. It is also a response to customer feedback which has stressed on the importance of companies owning their models, and being able to manage tradeoffs in terms of model quality, cost and behaviour,” Ghodsi further commented.

For additional technical details, please visit the official dataset repository on Hugging Face.