Shades of Open Source - Understanding The Many Meanings of "Open"

17 Jun 2024

Open source has evolved from a few pioneering transparent projects into the backbone of modern development across the industry. As a result, many projects now use the term "open source" to convey a positive impression. However, with a wide range of development practices and open-source licenses, the meaning of "open source" can vary significantly.

In this article, I aim to explore the true value of openness and identify what is and isn't genuinely open. Additionally, I will discuss the different levels of openness that projects may adopt, helping you navigate the diverse landscape of open-source projects more effectively.

The Value of Open Source

The value of open source manifests in various ways. One significant advantage is transparency, which allows you to understand the code you are running, especially when processing sensitive data and information. Open source code also enables you to make repairs or enhancements to the software you use in your business or project.

However, for projects aspiring to become foundational standards for others to build upon, users seek more than just transparency—they seek certainty. This includes assurance that the project will not undergo sudden changes that could disrupt everything built on top of it, and that it will continue to be actively developed and maintained for the foreseeable future.

In this context, the approach to open source becomes crucial.

In Apache We Trust

It's this kind of certainty that underscores the vital role of the Apache Software Foundation (ASF). Many first encountered Apache through its pioneering project, the open-source web server framework that remains ubiquitous in web operations today. The ASF was initially created to hold the intellectual property and assets of the Apache project, and it has since evolved into a cornerstone for open-source projects worldwide.

The ASF enforces strict standards for diverse contributions, independence, and activity in its projects, ensuring they can withstand the test of time as standards in software development.

Many open-source projects strive to become Apache projects to gain the community credibility necessary for adoption as standard software building blocks, such as Apache Tomcat for Java web applications, Apache Arrow for in-memory data representation, and Apache Parquet for data file formatting, among others.

Other organizations, like the Linux Foundation, also host and guide open-source projects, independently managing assets and providing oversight. However, they often do not adhere to the same rigorous independence standards as the ASF. This is a significant reason why the Apache brand has become the gold standard for the independence of open-source projects. In essence, one could say, "In Apache we trust."

Why Does Independence Matter?

In reality, independence isn't always crucial. Many open-source standards in web development, like React, are not Apache projects and are heavily directed by their creators, such as Meta. However, a web framework like React isn't responsible for the interoperability of web applications.

Instead, long-standing standards like REST and HTTP serve as the glue that connects web applications across various backend languages, frontend frameworks, and more.

In the realm of data, standards are still emerging. Some notable standards are Apache Arrow and Apache Arrow Flight for data representation in memory and data transfer, and Apache Parquet for how datasets are persisted on the file system for analytics. As datasets grow larger, there is a need for standards on how datasets spanning multiple files are represented (table formats) and how these datasets are tracked, governed, and discovered by different tools (metadata catalogs).

In the world of table formats, there are three competing standards: Apache IcebergApache Hudi, and Delta Lake, with two out of the three being Apache projects (and there is also Apache XTable for interoperability between these and future formats). For catalogs, options include NessieGravitinoPolaris, and Unity Catalog, all of which are open source but not yet Apache projects.

When a particular standard significantly impacts how businesses must build their enterprises to interoperate with the broader ecosystem, there is greater pressure for independence. This is because the lack of assured independence can pose potential risks to ecosystem partners.

The Pros and Cons of Vendor Dependence

Many popular open-source projects are beloved and closely tied to particular vendors. For example, web frameworks like React and Angular are associated with Meta and Google, respectively. Database software like MongoDBElasticsearch, and Redis are also tied to specific commercial entities but are widely used and praised for their functionality. When there is a clear driver of a project, it can offer some benefits:

  • Agility in development: With more top-down direction, new features can be delivered quickly.

  • Financial support: Projects that are central to a commercial entity's business often receive substantial financial backing for their development.

However, there are clear risks when the underlying project is intended to be a standard that many commercial enterprises need to build and stake their business on:

  • Rapid changes: A project steered by one entity can make large changes quickly, but these changes can be disruptive, creating intense migration challenges for users and businesses dependent on it. For example, the release of Angular 2 was a complete rewrite of the framework, forcing businesses using Angular to essentially rewrite their applications.

  • Narrow feedback: A project driven by one entity may receive a lot of feedback from its customers but may factor in less input from the broader ecosystem. This can lead to new features that favor the main driver, which can be problematic if the project is supposed to be the foundation for an entire ecosystem of tool interoperability.

Independence isn't the end-all, be-all for open source projects, but the more a project represents a standard format whose value lies in its ecosystem, the more independence should matter.

Blurred Lines Made Less Blurry

Beyond unexpected changes, licensing shifts, and an uneven playing field for the ecosystem, there are other practices to be cautious of under the guise of being open. One strategy used to avoid some traditional licensing conflicts is to offer two versions of a project: an open-source version and a proprietary version controlled by a commercial entity. The proprietary version often receives new or exclusive features first.

This practice, in itself, isn't inherently bad. Many businesses maintain commercial proprietary forks of open-source projects, but usually, the commercial version has a different name than the open-source project. For example, in the world of data catalogs, Dremio is the main developer of Nessie, and Snowflake drives Polaris.

Both aim to become community-driven projects over time but will also drive integrated features in their respective commercial products under different names. For instance, if you set up your own Nessie catalog, it has a distinct name compared to the Dremio Enterprise Catalog (formerly Arctic) integrated into Dremio Cloud.

The Dremio Enterprise Catalog is powered by Nessie but has additional features, so the different names prevent confusion about available features or which documentation to reference.

In contrast, Databricks maintains internal forks of Spark, Delta Lake, and Unity Catalog, using the same names for both the open-source versions and the features specific to the Databricks platform. While they do provide separate documentation, online discussions often reflect confusion about how to use features in the open-source versions that only exist on the Databricks platform.

This creates a "muddying of the waters" between what is open and what is proprietary. This isn't an issue if you are a Databricks user, but it can be quite confusing for those who want to use these tools outside of the Databricks ecosystem.

Closed Does Not Mean Bad

To clarify, the fact that a project does not adhere to the highest standards of openness or is even proprietary does not diminish the quality of the project's code, the skills of its developers, or the value it can provide to its users. However, openness can serve as a signal of certainty, fostering ecosystems for standards that benefit from a growing network effect.

Independent actors within these ecosystems feel more comfortable building upon such projects, which is particularly important for standards that affect how systems communicate with each other.