BigCode/HuggingFace/ServiceNow: StarCoder

This is the transparency report for BigCode/HuggingFace/ServiceNow for the StarCoder model. To see their responses for each indicator, click through the various domains and subdomains. For further information, visit the website for the May 2024 Foundation Model Transparency Index.

Data size (Score: 1)

For the data used in building the model, is the data size disclosed?

Disclosure: The training dataset contains 783GB of code in 86 programming languages, and includes 54GB GitHub Issues + 13GB Jupyter notebooks in scripts and text-code pairs, and 32GB of GitHub commits, which is approximately 250 Billion tokens.

Note: Data size should be reported in appropriate units (e.g. bytes, words, tokens, images, frames) and broken down by modality. Data size should be reported to a precision of one significant figure (e.g. 4 trillion tokens, 200 thousand images). No form of decomposition into data phases is required.

References: https://huggingface.co/datasets/bigcode/starcoderdata#dataset-description

Justification: 783 GB (86 programming languages)

New disclosure? No

Data sources (Score: 1)

For all data used in building the model, are the data sources disclosed?

Disclosure: The data used to train this model is a subset of The Stack v1.2, a collection of public GitHub repositories (see section 3 of [1]). Each individual repositories can be accessed through the following url: https://github.com/ where is pulled from the column max_stars_repo_name (see dataset viewer of [2]). Annotation on data quality was done by 18 Bigcode Community members (see [3]

Note: To receive this point, a meaningful decomposition of sources must be listed in an understandable way (e.g. named URLs/domains/databases/data providers). It does not suffice to say data is “sourced from the Internet" or comes from "licensed sources”.

References: [1] https://arxiv.org/pdf/2305.06161.pdf [2] https://huggingface.co/datasets/bigcode/starcoderdata

Justification: Data source is The Stack v1.2, which itself is derived from public GitHub repositories.

New disclosure? No

Data creators (Score: 1)

For all data used in building the model, is there some characterization of the people who created the data?

Disclosure: The data creators are GitHub users that created unique repository names between January 1st, 2015, and March 31st, 2022 (see [1]). It is possible to infer the usernames or an organization from the left part of the repository name in the column max_stars_repo_name (see dataset viewer of [4]). It contains code in 86 programming languages (see tables 1 & 2 of the starcoder paper for the detailed breakdown [2]). The distribution of natural languages of extracted docstring and comments is not available but can be computed with a fasttext model, the same methodology used on The Stack (see [3] for more details). Data quality was manually assessed by 18 BigCode community members (see Visual Assessment of section 3.1 of [2]). Their name can be found in this report [5] but are not part of the data used to train this model. PII annotation was done by 1399 crowdworkers from Toloka (see section 4.1 of [2]) and their country is listed in Table B.1 of [2] (Note that the distribution is not available).

Note: While information about data creators may not be easily discernible for some data scraped from the web, the general sources (URLs/domains) should be listed, and, for other data that is bought, licensed, or collected, a reasonable attempt at characterizing the underlying people who provided the data is required to receive this point. The relevant properties of people can vary depending on context: for example, relevant properties could include demographic information like fraction of Black individuals contributing to the dataset, geographic information like fraction of European individuals contributing to the dataset, language information like fraction of L1 English speakers, or occupational information like the fraction of professional artists.

References: [1] https://huggingface.co/datasets/bigcode/the-stack#who-are-the-source-language-producers [2] https://arxiv.org/pdf/2305.06161.pdf [3] https://github.com/bigcode-project/bigcode-analysis/tree/main/data_analysis/python_data_analysis/nl_language_identification [4] https://huggingface.co/datasets/bigcode/starcoderdata [5] https://docs.google.com/spreadsheets/d/1Lk-pTk_rXI__fCgixr7ZWSi8wR09Zzd2j_G90J80r00/edit?usp=sharing

Justification: While information about data creators is not provided directly, extensive metadata information is provided.

New disclosure? No

Data source selection (Score: 1)

Are the selection protocols for including and excluding data sources disclosed?

Disclosure: The data used to train this model is a subest of the Stack v1.2, a collection of public and permissively licensed GitHub repo. The data collection process is the same as v1.0 (see section 3 of [1] for the details) with the exenteded list of licenses listed in Appendix B of [1] but without the copyleft licenses MPL, EPL, LGPL, and opt-out requests submitted by 09.02.2023 were also excluded (see [2]). The programming languages are selected based on popularity and configuration languages + no longer supported languages were excluded (see tables 1 & 2 of [3] for the full list and refer to section 3.1 of [3] for the the details) .

Note: Selection protocols refer to procedures used to choose which datasets or subsets of datasets will be used to build a model. We will award this point even if the selection protocols are non-exhaustive.

References: [1] https://arxiv.org/pdf/2211.15533.pdf [2] https://huggingface.co/datasets/bigcode/the-stack#changelog [3] https://arxiv.org/pdf/2305.06161.pdf

Justification: Clear source selection methodology via Stack v1.2.

New disclosure? No

Data curation (Score: 1)

For all data sources, are the curation protocols for those data sources disclosed?

Disclosure: The data cleaning protocole combined heuristic filtering and manual inspection is described in section 3 of [1]. It includes excluding configuration languages and programming languages no longer supported, heuristic filtering of Git issues and commits (see Table 4 of [1]), and near-deduplication. Additionally, some language-specific filters were applied (see section 3.1 of [1] for details). Some malicious code were also excluded (see section 10.2 of [1]).

Note: Curation protocols refer to steps taken to further modify data sources, such as procedures to manage, annotate, and organize data. The aims of curation might include improving the quality, relevance, and representativeness of the data. We will award this point if the developer reports that it does not perform any further curation beyond the data sources.

References: [1] https://arxiv.org/pdf/2305.06161.pdf

Justification: Not disclosed

New disclosure? No

Data augmentation (Score: 1)

Are any steps the developer takes to augment its data sources disclosed?

Disclosure: No steps taken to augment the training data (see section 3 of [1]).

Note: Such steps might include augmenting data sources with synthetic data. We will award this point if the developer reports that it does not take any steps to augment its data.

References: [1] https://arxiv.org/pdf/2305.06161.pdf

Justification: No data augmentation.

New disclosure? No

Harmful data filtration (Score: 1)

If data is filtered to remove harmful content, is there a description of the associated filter?

Disclosure: Some malicious code detected by ClamAV [2] on the Hugging Face platform were removed (see section 10.2 of [1]).

Note: Such harmful content might relate to violence or child sexual abuse material. We will award this point if the developer reports that it does not perform any harmful data filtration.

References: [1] https://arxiv.org/pdf/2305.06161.pdf [2] https://huggingface.co/docs/hub/en/security-malware

Justification: Not disclosed

New disclosure? No

Copyrighted data (Score: 1)

For all data used in building the model, is the associated copyright status disclosed?

Disclosure: The license of every repository in the Stack v1.2 is specified in the column max_stars_repo_licenses (see dataset viewer of [1]).

Note: To receive this point, the copyright status (e.g. copyrighted, public domain) must relate to some decomposition of the data. We will award this point if there is some meaningful decomposition of the data, even if the decomposition is insufficient to receive the Data Creators point or if the disclosure is not comprehensive relative to legal copyright standards.

References: [1] https://huggingface.co/datasets/bigcode/the-stack

Justification: All code in The Stack dataset used to train the model is automatically copyrighted under US copyright law. The licenses that address permissions to use each data point can be found at https://huggingface.co/datasets/bigcode/the-stack-dedup/blob/main/licenses.json

New disclosure? No

Data license (Score: 1)

For all data used in building the model, is the associated license status disclosed?

Disclosure: The license of every repository in the Stack v1.2 is specified in the column max_stars_repo_licenses (see dataset viewer of [1]).

Note: To receive this point, the license status must relate to some decomposition of the data. We will award this point if there is some meaningful decomposition of the data, even if the decomposition is insufficient to receive the Data Creators point.

References: [1] https://huggingface.co/datasets/bigcode/the-stack

Justification: Not disclosed

New disclosure? No

Personal information in data (Score: 1)

For all data used in building the model, is the inclusion or exclusion of personal information in that data disclosed?

Disclosure: Section 4 of [1] describes the efforts to redact PII from the training data. We leveraged the StarPII model to detect names, usernames, emails, IP addresses, keys, passwords, and IDs (see Table 7 of [1] for the distribution). The performance of this model is reported in [2]. To mask IP addresses, we randomly selected an IP address from 5 synthetic, private, non-internet-facing IP addresses of the same type that can be found in Appendix C of [1]. Other detected PII were replaced by the appropriate token: , , , . Usernames in GitHub issues were anonymized by replacing them with a participant counter within the conversation, e.g. username_1 to refer to second participant. Data quality was manually assessed by 18 BigCode community members (see Visual Assessment of section 3.1 of [2]). Their name can be found in this report [3] but are not part of the data used to train this model.

Note: To receive this point, the disclosure of personal information must relate to some decomposition of the data. We will award this point if there is some meaningful decomposition of the data, even if the decomposition is insufficient to receive the Data Creators point. Additionally, we will award this point if the developer reports the inclusion of personal information, independent of if and how they mitigate related privacy concerns.

References: [1] https://arxiv.org/pdf/2305.06161.pdf [2] https://huggingface.co/bigcode/starpii [3] https://docs.google.com/spreadsheets/d/1Lk-pTk_rXI__fCgixr7ZWSi8wR09Zzd2j_G90J80r00/edit?usp=sharing

Justification: Not disclosed

New disclosure? No

Use of human labor (Score: 1)

Are the phases of the data pipeline where human labor is involved disclosed?

Disclosure: Workers are used for both data annotation, data cleaning of PII, and further data cleaning. For the last type of data cleaning, it is mentioned that annotators of the community took part in the activity (I assume this was a volunteering effort). For the first two activities, more information is mentioned. Hugging Face funded the data annotation services from Toloka, with a total outlay of $39,000 paid to crowd workers. We utilized the Toloka platform9 to engage 1,399 crowd-workers from 35 countries in annotating a dataset for PII in source code. On average, participants completed 206 tasks, earned about $27, and worked 3.1 hours.

Note: Phases of the data pipeline that involve human labor include activities and tasks performed by people to collect, annotate, clean, or validate data. This indicator is inclusive of all data that is created by or on behalf of the developer. We will award this point if the developer gives a reasonable best-effort description of the use of human labor in their data pipeline.

References: https://arxiv.org/pdf/2312.03872.pdf; https://arxiv.org/pdf/2305.06161.pdf

Justification: Not disclosed

New disclosure? No

Employment of data laborers (Score: 1)

Is the organization that directly employs the people involved in data labor disclosed for each phase of the data pipeline?

Disclosure: Toloka is mentioned in each publication. More information about the community volunteers is also discussed (18 community annotators).

Note: Phases of the data pipeline that involve human labor include activities and tasks performed by people to collect, annotate, clean, or validate data. This indicator is inclusive of all data that is created by or on behalf of the developer. We will award this point if the developer provides the name of the organization that employs data laborers, even if other details about the employment relationship are not disclosed.

References: https://arxiv.org/pdf/2312.03872.pdf; https://arxiv.org/pdf/2305.06161.pdf

Justification: Not disclosed

New disclosure? No

Geographic distribution of data laborers (Score: 1)

Is geographic information regarding the people involved in data labor disclosed for each phase of the data pipeline?

Disclosure: It is reported that crowd workers come from 35 countries. In Appendix B, Table B.1 of the document, the precise list of countries can be found. There is no geographic information however about the community annotators.

Note: This indicator is inclusive of all data that is created by or on behalf of the developer. We will award this point if the developer gives a reasonable best-effort description of the geographic distribution of labor at the country-level.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: Not disclosed

New disclosure? No

Wages (Score: 1)

Are the wages for people who perform data labor disclosed?

Disclosure: Section 4.1. On average, participants completed 206 tasks, earned about $27, and worked 3.1 hours. Our goal was to identify PII in various forms, such as names, usernames, emails, IP addresses, keys, passwords, and IDs. To ensure that crowd-workers received fair compensation, we established an hourly pay rate of $7.30, taking into consideration different minimum wage rates across countries and their corresponding purchasing power. We limited annotation eligibility to countries where the hourly pay rate of $7.30 was equivalent to the highest minimum wage in the US ($16.50) in terms of purchasing power parity

Note: This indicator is inclusive of data labor at all points of the model development process, such as training data annotation or red teaming data used to control the model. We will award this point if the developer reports that it does not compensate workers. For all data that is created by or on behalf of the developer,

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: Not disclosed

New disclosure? No

Instructions for creating data (Score: 1)

Are the instructions given to people who perform data labor disclosed?

Disclosure: Information for the tasks conducted by the 18 community annotators is given in sec. 3.1 (visual inspection) "We instructed the annotators to go through 50–100 files and confirm if the data appeared to be normal code written by humans, as opposed to text, data, or a single long line of autogenerated code. We also asked annotators to determine whether we should use our default alpha-numeric filter (which requires over 25% alpha-numeric symbols) and long-line filter (which requires lines to be less than 1,000 characters) for a given file extension. " Information for the PII annotators is available on the github repository https://github.com/bigcode-project/bigcode-dataset/

Note: This indicator is inclusive of all data that is created by or on behalf of the developer. We will award this point if the developer makes a reasonable best-effort attempt to disclose instructions given to people who create data used to build the model for the bulk of the data phases involving human labor.

References: https://arxiv.org/pdf/2305.06161.pdf https://github.com/bigcode-project/bigcode-dataset/blob/main/pii/PII%20detection%20guidelines%20(1)%20(4).pdf

Justification: Not disclosed

New disclosure? No

Labor protections (Score: 1)

Are the labor protections for people who perform data labor disclosed?

Disclosure: In order to support our efforts to mitigate the risk that the model may leak private information, we selected 12,000 samples of code from The Stack and annotated them to detect PII using crowd-sourcing. Hugging Face funded the data annotation services from Toloka, with a total outlay of $39,000 paid to crowd workers. The resulting dataset was used to train a PII detection model that we used to detect and then mask PII (Names, Emails, IP addresses, Keys, Passwords) from our StarCoder training dataset. In doing so, we aimed to balance the constraints of costs (fair compensation), time (the timing and time to complete the work was on the critical path for the project), and quality (to ensure that PII Detection Model training was not impacted). While traditional data annotation services using salaried employees were considered, we decided to work with crowd-workers through Toloka after reviewing several service providers and their compensation practices - and finding that most would not provide sufficient transparency and guarantees about worker compensation. We selected pay and eligible countries of crowd-workers to ensure that 1. the absolute hourly wage was always higher than the US federal minimum wage ($7.30), and 2. the hourly wage was equivalent to the highest state minimum wage in the US in terms of purchasing power parity ($16.50 at the time of writing). We engaged 1,399 crowd-workers across 35 countries in annotating a diverse dataset for PII in source code.

Note: This indicator is inclusive of data labor at all points of the model development process, such as training data annotation or red teaming data used to control the model. It is also inclusive of all data that is created by or on behalf of the developer. As an example, labor protections might include protocols to reduce the harm to workers' mental health stemming from exposure to violent content when annotating training data. We will award this point if the developer reports that it does not protect workers or if it does not use data laborers and therefore has no labor protections.

References: https://arxiv.org/pdf/2312.03872.pdf

Justification: Not disclosed

New disclosure? No

Third party partners (Score: 1)

Are the third parties who were or are involved in the development of the model disclosed?

Disclosure: The full list of collaborators are listed with their organization affilation as authors to the paper for "StarCoder: May the source be with you!". Additionally, sec. 1.2.1. mentions two organizations, names of researchers, and involving the broader scientifiic community (without specifying who in the community) "The idea for the BigCode Project came about in Utrecht during a discussion initiated by Harm de Vries (ServiceNow Research) with Thomas Wolf (Hugging Face). Inspired by the BigScience Project, Harm recognized the shared vision of ServiceNow and Hugging Face to responsibly develop open and responsible large language models for code, and approached Thomas to explore the idea of a jointly led open-scientific collaboration with the global machine learning and open source communities. As it turns out, the visions were indeed aligned, and work got started to initiate the project. A research collaboration agreement between ServiceNow and Hugging Face created the enabling framework for the project, and set out the terms for rallying the broader scientific community at large to work towards developing, training, exploring, and releasing large foundation models for code." Sec. 1.2.2 provides more statistics about the participants.

Note: This indicator is inclusive of partnerships that go beyond data labor as there may be third party partners at various stages in the model development process. We will award this point if the developer reports that it was the sole entity involved in the development of the model.

References: https://arxiv.org/pdf/2305.06161.pdf https://arxiv.org/pdf/2312.03872.pdf

Justification: Not disclosed

New disclosure? No

Queryable external data access (Score: 1)

Are external entities provided with queryable access to the data used to build the model?

Disclosure: We provide a search index, data portraits, and metadata-based membership checking as well as direct access to the dataset

Note: We will award this point for any reasonable mechanism for providing access: direct access to the data, an interface to query the data, a developer-mediated access program where developers can inspect requests, etc. Developers may receive this point even if there are rate-limits on the number of queries permitted to an external entity and restrictions on which external entities are given access, insofar as these limits and restrictions are transparent and ensure a reasonable amount of external access. We may accept justifications for prohibiting queries of specific parts of the data.

References: https://huggingface.co/datasets/bigcode/starcoderdata https://stack.dataportraits.org/ https://huggingface.co/spaces/bigcode/in-the-stack

Justification: Not disclosed

New disclosure? No

Direct external data access (Score: 1)

Are external entities provided with direct access to the data used to build the model?

Disclosure: StarcoderData is the pre-processed data used for training

Note: We will award this point if external entities can directly access the data without any form of gating from the developer. With that said, we may award this point if the developer provides justifications for prohibiting access to specific parts of the data or to unauthorized external entities.

References: https://huggingface.co/spaces/bigcode/in-the-stack

Justification: Not disclosed

New disclosure? No

Compute usage (Score: 1)

Is the compute required for building the model disclosed?

Disclosure: 8.46E+22 flops

Note: Compute should be reported in appropriate units, which most often will be floating point operations (FLOPS). Compute should be reported to a precision of one significant figure (e.g. 5 x $10^{25}$ FLOPS). We will award this point even if there is no decomposition of the reported compute usage into compute phases, but it should be clear whether the reported compute usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitate compute expenditure.

References: Disclosed as part of FMTI v1.1

Justification: 8.46E+22 flops

New disclosure? Yes

Development duration (Score: 1)

Is the amount of time required to build the model disclosed?

Disclosure: 11,208 hours

Note: The continuous duration of time required to build the model should be reported in weeks, days, or hours to a precision of one significant figure (e.g. 3 weeks). No form of decomposition into phases of building the model is required for this indicator, but it should be clear what the duration refers to (e.g. training the model, training and subsequent evaluation and red teaming).

References: Disclosed as part of FMTI v1.1

Justification: 11,208 hours

New disclosure? Yes

Compute hardware (Score: 1)

For the primary hardware used to build the model, is the amount and type of hardware disclosed?

Disclosure: Section 1.2.5 Supporting resources and funding of The BigCode Project Governance Card provides these details -> 512 NVIDIA A100 80GB GPUs distributed across 64 nodes.

Note: In most cases, this indicator will be satisfied by information regarding the number and type of GPUs or TPUs used to train the model. The number of hardware units should be reported to a precision of one significant figure (e.g. 800 NVIDIA H100 GPUs). We will not award this point if (i) the training hardware generally used by the developer is disclosed, but the specific hardware for the given model is not, or (ii) the training hardware is disclosed, but the amount of hardware is not. We will award this point even if information about the interconnects between hardware units is not disclosed.

References: https://arxiv.org/pdf/2312.03872.pdf

Justification: 512 NVIDIA A100 80GB GPUs

New disclosure? No

Hardware owner (Score: 1)

For the primary hardware used in building the model, is the owner of the hardware disclosed?

Disclosure: Section 1.2.5 Supporting resources and funding of The BigCode Project Governance Card provides these details -> Hugging Face

Note: For example, the hardware owner may be the model developer in the case of a self-owned cluster, a cloud provider like Microsoft Azure, Google Cloud Platform, or Amazon Web Services, or a national supercomputer. In the event that hardware is owned by multiple sources or is highly decentralized, we will award this point if a developer makes a reasonable effort to describe the distribution of hardware owners.

References: https://arxiv.org/pdf/2312.03872.pdf

Justification: Hugging Face

New disclosure? No

Energy usage (Score: 1)

Is the amount of energy expended in building the model disclosed?

Disclosure: Section 5.7 of StarCoder: may the source be with you! states: StarCoderBase We report the carbon footprint (Lacoste et al., 2019) of training StarCoderBase. Based on the total number of GPU hours that training took (320,256) and an average power usage of 280W per GPU, this adds up to 89671.68 kWh of electricity consumed during the training process. Multiplied by the carbon intensity of the energy of the us-west-2 AWS location (0.15495 kgCO2e per kWh) and the average Power Usage Effectiveness of 1.2 across AWS datacenters, this results in 16.68 tonnes of CO2eq emitted. StarCoder The fine-tuned model adds 3.5% of training time, which translates to an additional estimated emission of 0.58 tonnes of CO2eq.

Note: Energy usage should be reported in appropriate units, which most often will be megawatt-hours (mWh). Energy usage should be reported to a precision of one significant figure (e.g. 500 mWh). No form of decomposition into compute phases is required, but it should be clear whether the reported energy usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitate energy usage.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: 89671.68 kWh

New disclosure? No

Carbon emissions (Score: 1)

Is the amount of carbon emitted (associated with the energy used) in building the model disclosed?

Disclosure: Section 5.7 of StarCoder: may the source be with you! states: StarCoderBase We report the carbon footprint (Lacoste et al., 2019) of training StarCoderBase. Based on the total number of GPU hours that training took (320,256) and an average power usage of 280W per GPU, this adds up to 89671.68 kWh of electricity consumed during the training process. Multiplied by the carbon intensity of the energy of the us-west-2 AWS location (0.15495 kgCO2e per kWh) and the average Power Usage Effectiveness of 1.2 across AWS datacenters, this results in 16.68 tonnes of CO2eq emitted. StarCoder The fine-tuned model adds 3.5% of training time, which translates to an additional estimated emission of 0.58 tonnes of CO2eq.

Note: Emissions should be reported in appropriate units, which most often will be tons of carbon dioxide emitted (tCO2). Emissions should be reported to a precision of one significant figure (e.g. 500 tCO2). No form of decomposition into compute phases is required, but it should be clear whether the reported emissions is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that generate emissions.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: 17.25tCO2

New disclosure? No

Broader environmental impact (Score: 1)

Are any broader environmental impacts from building the model besides carbon emissions disclosed?

Disclosure: Section 1.2.5 Supporting resources and funding of The BigCode Project Governance Card also includes comparisons of the Power Usage Effectiveness and Carbon Intensity of both the ServiceNow Montreal, and the Hugging Face AWS datacenters.

Note: While the most direct environmental impact of building a foundation model is the energy used and, therefore, the potential carbon emissions, there may be other environmental impacts. For example, these may include the use of other resources such as water for cooling data centers or metals for producing specialized hardware. We recognize that there does not exist an authoritative or consensus list of broader environmental factors. For this reason, we will award this point if there is a meaningful, though potentially incomplete, discussion of broader environmental impact.

References: https://arxiv.org/pdf/2312.03872.pdf https://huggingface.co/datasets/bigcode/governance-card

Justification: The disclosure of Power Usage Effectiveness and locations of each of the datacenters (ServiceNow Montreal vs. Hugging Face AWS) allows for understanding. By comparing the PUE values of different datacenters, one can gauge which datacenter is more energy-efficient and therefore has a smaller environmental footprint for the same amount of compute.

New disclosure? No

Model stages (Score: 1)

Are all stages in the model development process disclosed?

Disclosure: See section 5 Model Training from the paper "StarCoder: May the source be with you!" for details

Note: Stages refer to each identifiable step that constitutes a substantive change to the model during the model building process. We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear and complete description of these stages.

References: https://www.bigcode-project.org/docs/about/timeline/

Justification: Paper provides extensive details on model stages

New disclosure? No

Model objectives (Score: 1)

For all stages that are described, is there a clear description of the associated learning objectives or a clear characterization of the nature of this update to the model?

Disclosure: As described in the paper "StarCoder: may the source be with you!", the StarCoder takes several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.

Note: We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear description of the update to the model related to each stage, whether that is the intent of the stage (e.g. making the model less harmful), a mechanistic characterization (e.g. minimizing a specific loss function), or an empirical assessment (e.g. evaluation results conducted before and after the stage).

References: https://www.bigcode-project.org/docs/about/mission/ See linked threads in https://www.bigcode-project.org/docs/about/timeline/ https://arxiv.org/pdf/2312.03872.pdf

Justification: We trained a 15.5B parameter model with the same architecture as SantaCoder (Ben Allal et al., 2023). It is a decoder-only Transformer with Multi-Query-Attention (MQA; Shazeer, 2019), and learned absolute positional embeddings. We also apply Fill-in-the-Middle (FIM; Bavarian et al., 2022) transformations to the training data, see Section 5.1. We used FlashAttention (Dao et al., 2022) to speed up the attention computation and reduce its memory footprint, allowing us to scale to a 8K context length. To make FlashAttention work with MQA during training, we simply expand the key and value before calling the attention kernel.

New disclosure? No

Core frameworks (Score: 1)

Are the core frameworks used for model development disclosed?

Disclosure: See section 5.6 Multi-Node GPU Setup from the paper "StarCoder: May the source be with you!" for details

Note: Examples of core frameworks include Tensorflow, PyTorch, Jax, Hugging Face Transformers, Seqio, T5X, Keras, SciKit, and Triton. If there are significant internal frameworks, there should be some description of their function and/or a reasonably similar publicly-available analogue. We recognize that there does not exist an authoritative or consensus list of core frameworks. For this reason, we will award this point if there is a meaningful, though potentially incomplete, list of major frameworks for the first version of the index.

References: https://arxiv.org/pdf/2305.06161.pdf All development code is public, see required libraries in https://github.com/bigcode-project

Justification: All frameworks are disclosed in public codebase including Megatron-LM.

New disclosure? No

Additional dependencies (Score: 1)

Are any dependencies required to build the model disclosed besides data, compute, and code?

Disclosure: All development code is public

Note: For example, if the model depends on an external search engine, programmable APIs, or tools, this should be disclosed. We recognize that there is not widespread consensus regarding what constitutes key dependencies beyond the data, compute, and code. We will award this point only if developers give a reasonable best-effort description of any additional dependencies or make clear that no additional dependencies are required.

References: See required libraries in https://github.com/bigcode-project

Justification: All dependencies are disclosed in public codebase.

New disclosure? No

Mitigations for privacy (Score: 1)

Are any steps the developer takes to mitigate the presence of PII in the data disclosed?

Disclosure: The project includes StarPII, an NER model trained to detect Personal Identifiable Information (PII) in code datasets. This model is used to mask PII in the bigcode large model training. Section 4 of the second document further presents a number of experiments that have been conducted to best handle PII.

Note: Such steps might include identifying personal information in the training data, filtering specific datasets to remove personal information, and reducing the likelihood that models will output personal information. We will award this point if the developer reports that it does not take steps to mitigate the presence of PII in the data.

References: ttps://huggingface.co/bigcode/starpii ; https://arxiv.org/pdf/2305.06161.pdf

Justification: The project includes StarPII, an NER model trained to detect Personal Identifiable Information (PII) in code datasets. This model is used to mask PII in the bigcode large model training. Section 4 of the second document further presents a number of experiments that have been conducted to best handle PII.

New disclosure? No

Mitigations for copyright (Score: 1)

Are any steps the developer takes to mitigate the presence of copyrighted information in the data disclosed?

Disclosure: StarCoder was trained on a subset of The Stack v1.2 dataset. This dataset has been filtered using a license detector to only include permissively licensed source code. Nevertheless, the license detector might have incorrectly classified a number of repositories. See Kocetkov et al. (2022) for more details on this license detection process. Besides, an opt-out process has been put in place if a copyright owner chooses to opt out from the dataset.

Note: Such steps might include identifying copyrighted data, filtering specific datasets to remove copyrighted data, and reducing the likelihood that models will output copyrighted information. We will award this point if the developer reports that it does take steps to mitigate the presence of copyrighted information in the data.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: StarCoder was trained on a subset of The Stack v1.2 dataset. This dataset has been filtered using a license detector to only include permissively licensed source code. Nevertheless, the license detector might have incorrectly classified a number of repositories. See Kocetkov et al. (2022) for more details on this license detection process. Besides, an opt-out process has been put in place if a copyright owner chooses to opt out from the dataset.

New disclosure? No

Input modality (Score: 1)

Are the input modalities for the model disclosed?

Disclosure: Yes, this is explained in the model card, and in E.3 Improving Code Generation with Prompting in the paper "StarCoder: May the source be with you".

Note: Input modalities refer to the types or formats of information that the model can accept as input. Examples of input modalities include text, image, audio, video, tables, graphs.

References: https://huggingface.co/bigcode/starcoder https://arxiv.org/pdf/2305.06161.pdf

Justification: Code

New disclosure? No

Output modality (Score: 1)

Are the output modalities for the model disclosed?

Disclosure: Yes, this is explained in the model card, and in E.3 Improving Code Generation with Prompting in the paper "StarCoder: May the source be with you".

Note: Output modalities refer to the types or formats of information that the model can accept as output. Examples of output modalities include text, image, audio, video, tables, graphs.

References: https://huggingface.co/bigcode/starcoder https://arxiv.org/pdf/2305.06161.pdf

Justification: Code

New disclosure? No

Model components (Score: 1)

Are all components of the model disclosed?

Disclosure: See section 5 Model training in the paper "StarCoder: May the source be with you".

Note: Model components refer to distinct and identifiable parts of the model. We recognize that different developers may use different terminology for model components, or conceptualize components differently. Examples include: (i) For a text-to-image model, components could refer to a text encoder and an image encoder, which may have been trained separately. (ii) For a retrieval-augmented model, components could refer to a separate retriever module.

References: https://huggingface.co/bigcode/starcoder https://arxiv.org/pdf/2305.06161.pdf

Justification: We trained a 15.5B parameter model with the same architecture as SantaCoder (Ben Allal et al., 2023). It is a decoder-only Transformer with Multi-Query-Attention (MQA; Shazeer, 2019), and learned absolute positional embeddings. We also apply Fill-in-the-Middle (FIM; Bavarian et al., 2022) transformations to the training data, see Section 5.1. We used FlashAttention (Dao et al., 2022) to speed up the attention computation and reduce its memory footprint, allowing us to scale to a 8K context length. To make FlashAttention work with MQA during training, we simply expand the key and value before calling the attention kernel.

New disclosure? No

Model size (Score: 1)

For all components of the model, is the associated model size disclosed?

Disclosure: See section 5 Model training in the paper "StarCoder: May the source be with you".

Note: This information should be reported in appropriate units, which generally is the number of model parameters, broken down by named component. Model size should be reported to a precision of one significant figure (e.g. 500 billion parameters for text encoder, 20 billion parameters for image encoder).

References: https://huggingface.co/bigcode/starcoder https://arxiv.org/pdf/2305.06161.pdf

Justification: 15.5B parameters

New disclosure? No

Model architecture (Score: 1)

Is the model architecture disclosed?

Disclosure: See section 5.4 Model Architecture in the paper "StarCoder: May the source be with you".

Note: Model architecture is the overall structure and organization of a foundation model, which includes the way in which any disclosed components are integrated and how data moves through the model during training or inference. We recognize that different developers may use different terminology for model architecture, or conceptualize the architecture differently. We will award this point for any clear, though potentially incomplete, description of the model architecture.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: We trained a 15.5B parameter model with the same architecture as SantaCoder (Ben Allal et al., 2023). It is a decoder-only Transformer with Multi-Query-Attention (MQA; Shazeer, 2019), and learned absolute positional embeddings. We also apply Fill-in-the-Middle (FIM; Bavarian et al., 2022) transformations to the training data, see Section 5.1. We used FlashAttention (Dao et al., 2022) to speed up the attention computation and reduce its memory footprint, allowing us to scale to a 8K context length. To make FlashAttention work with MQA during training, we simply expand the key and value before calling the attention kernel.

New disclosure? No

Centralized model documentation (Score: 1)

Is key information about the model included in a centralized artifact such as a model card?

Disclosure: See the model card on Hugging Face and the paper introducing the model

Note: We recognize that different developers may share this information through different types of documentation, such as a system card or several clearly interrelated documents. We will award this point for the disclosure of any such centralized artifact that provides key information typically included in a model card, though the artifact may be longer-form than a standard model card (e.g. a technical report).

References: https://huggingface.co/bigcode/starcoder https://arxiv.org/abs/2305.06161

Justification: Model card is available

New disclosure? No

External model access protocol (Score: 1)

Is a protocol for granting external entities access to the model disclosed?

Disclosure: Users are presented with the following: "You need to agree to share your contact information to access this model This repository is publicly accessible, but you have to accept the conditions to access its files and content. Model License Agreement Please read the BigCode OpenRAIL-M license agreement before accepting it."

Note: A model access protocol refers to the steps, requirements, and considerations involved in granting authorized model access to external entities. We will award this point if the developer discloses key details of its protocol, including (i) where external entities can request access (e.g. via an access request form); (ii) explicit criteria for selecting external entities; and (iii) a transparent decision on whether access has been granted within a specified, reasonable period of time.

References: https://huggingface.co/bigcode/starcoder

Justification: Clear model access protocol involving access form, license agreement, and public repository for weights.

New disclosure? No

Blackbox external model access (Score: 1)

Is black box model access provided to external entities?

Disclosure: There is a demo to generate text and code with the following StarCoder models: StarCoderPlus: A finetuned version of StarCoderBase on English web data, making it strong in both English text and code generation. StarCoderBase: A code generation model trained on 80+ programming languages, providing broad language coverage for code generation tasks. StarCoder: A finetuned version of StarCoderBase specifically focused on Python, while also maintaining strong performance on other programming languages.

Note: Black box model access refers to the ability to query the model with inputs and receive outputs, potentially without further access. Examples of external entities that might be granted access include researchers, third-party auditors, and regulators. We will award this point for any reasonable access level: direct access to the model weights, an interface to query the model, a developer-mediated access program where developers can inspect requests, etc. Developers may receive this point even if there are rate-limits on the number of queries permitted to an external entity and restrictions on the external entities that are permitted access, insofar as these limits and restrictions are transparent.

References: https://huggingface.co/spaces/bigcode/bigcode-playground

Justification: Weights are available.

New disclosure? No

Full external model access (Score: 1)

Is full model access provided to external entities?

Disclosure: These are available on Hugging Face

Note: Full model access refers to the ability to access the model via the release of model weights. Developers may receive this point even if there are some restrictions on the external entities that are permitted access (e.g. geographic restrictions), insofar as these restrictions are transparent (e.g. via some high-level description of who has been granted access to the foundation model).

References: http://hf.co/bigcode e.g. https://huggingface.co/bigcode/starcoder and https://huggingface.co/bigcode/starcoderbase (15B models) + we have 1, 3, 7b models as well

Justification: Weights are available.

New disclosure? No

Capabilities description (Score: 1)

Are the model's capabilities described?

Disclosure: The model card on Hugging Face describes the intended use as a code generation and fill-in-the-middle model.

Note: Capabilities refer to the specific and distinctive functions that the model can perform. We recognize that different developers may use different terminology for capabilities, or conceptualize capabilities differently. We will award this point for any clear, but potentially incomplete, description of the multiple capabilities.

References: https://huggingface.co/bigcode/starcoder#generation https://huggingface.co/bigcode/starcoder#fill-in-the-middle

Justification: The model card on Hugging Face describes the intended use as a code generation and fill-in-the-middle model.

New disclosure? No

Capabilities demonstration (Score: 1)

Are the model’s capabilities demonstrated?

Disclosure: Appendix D, E, and F of the StarCoder: may the source be with you! paper, includes illustrative examples of StarCoder as a technical assistant, along with additional evaluations and qualitative examples for other use cases. Collectively, these provide the reader with insights into the model's capabilities.

Note: Demonstrations refer to illustrative examples or other forms of showing the model's capabilities that are legible or understandable for the general public, without requiring specific technical expertise. We recognize that different developers may use different terminology for capabilities, or conceptualize capabilities differently. We will award this point for clear demonstrations of multiple capabilities.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: Appendix D, E, and F of the StarCoder: may the source be with you! paper, includes illustrative examples of StarCoder as a technical assistant, along with additional evaluations and qualitative examples for other use cases. Collectively, these provide the reader with insights into the model's capabilities.

New disclosure? No

Evaluation of capabilities (Score: 1)

Are the model’s capabilities rigorously evaluated, with the results of these evaluations reported prior to or concurrent with the initial release of the model?

Disclosure: Section 6 of the StarCoder: may the source be with you! paper, includes details of evaluations completed prior to the model being released, including Python evaluation, HumanEval and MBPP, DS-1000 Python Data Science Benchmarks, ODEX Open-Domain Coding Benchmark, Multi-Language Evaluation with MultiPL-E, Asleeps at the keyboard security benchmark, Fill in the Middle benchmark.

Note: Rigorous evaluations refer to precise quantifications of the model's behavior in relation to its capabilities. We recognize that capabilities may not perfectly align with evaluations, and that different developers may associate capabilities with evaluations differently. We will award this point for clear evaluations of multiple capabilities. For example, this may include evaluations of world knowledge, reasoning, state tracking or other such proficiencies. Or it may include the measurement of average performance (e.g. accuracy, F1) on benchmarks for specific tasks (e.g. text summarization, image captioning). We note that evaluations on standard broad-coverage benchmarks are likely to suffice for this indicator, though they may not if the model's capabilities are presented as especially unusual such that standard evaluations will not suffice.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: Evaluation on Code LM Eval Harness, which includes several public capability benchmarks (e.g. HumanEval, DS-1000)

New disclosure? No

External reproducibility of capabilities evaluation (Score: 1)

Are the evaluations of the model’s capabilities reproducible by external entities?

Disclosure: StarCoder is made available as an Open LLM via the Hugging Face platform. Hugging Face have included StarCoder in the Open LLM Leaderboard, where it is evaluated against 119 evaluated tasks.

Note: For an evaluation to be reproducible by an external entity, we mean that the associated data is either (i) publicly available or (ii) described sufficiently such that a reasonable facsimile can be constructed by an external entity. In addition, the evaluation protocol should be sufficiently described such that if the evaluation is reproduced, any discrepancies with the developer's results can be resolved. We recognize that there does not exist an authoritative or consensus standard for what is required for an evaluation to be deemed externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently reproducible for the purposes of this index. We will award this point for reproducibility of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a justification by the model developer for why it is not possible for the evaluation to be made reproducible may be sufficient to score this point.

References: https://huggingface.co/bigcode/starcoder https://huggingface.co/datasets/open-llm-leaderboard/details_bigcode__starcoder https://huggingface.co/datasets/open-llm-leaderboard/details_bigcode__starcoder/blob/main/results_2023-08-28T21%3A18%3A29.614335.json

Justification: Evaluation on Code LM Eval Harness, which includes several public capability benchmarks (e.g. HumanEval, DS-1000)

New disclosure? No

Third party capabilities evaluation (Score: 1)

Are the model’s capabilities evaluated by third parties?

Disclosure: StarCoder has been included in several benchmark studies conducted independently of the BigCode developers.

Note: By third party, we mean entities that are significantly or fully independent of the developer. We will award this point if (i) a third party has conducted an evaluation of model capabilities, (ii) the results of this evaluation are publicly available, and (iii) these results are disclosed or referred to in the developer’s materials.

References: https://openreview.net/forum?id=pPjZIOuQuF https://arxiv.org/abs/2402.02037

Justification: External evaluation of capabilities conducted by UCSD researcher on RepoBench.

New disclosure? No

Limitations description (Score: 1)

Are the model's limitations disclosed?

Disclosure: Section 10.2 of StarCoder: may the source be with you! paper includes disclosures of the model's limitations, as well as additional limitations including: Dataset and data licensing, Opt-out process, PII detection, Malicious code, English-only evaluations, Code attribution tools.

Note: Limitations refer to the specific and distinctive functions that the model cannot perform (e.g. the model cannot answer questions about current events as it only contains data up to a certain time cutoff, the model is not very capable when it comes to a specific application). We recognize that different developers may use different terminology for limitations, or conceptualize limitations differently. We will award this point for any clear, but potentially incomplete, description of multiple limitations.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: Extensive limitation section in paper, spanning limitations of model as well as broader development process.

New disclosure? No

Limitations demonstration (Score: 0)

Are the model’s limitations demonstrated?

Disclosure: Appendix D, E, and F of the StarCoder: may the source be with you! paper, includes illustrative examples of StarCoder as a technical assistant, along with additional evaluations and qualitative examples for other use cases. Collectively, these provide the reader with insights into the model's limitations.

Note: Demonstrations refer to illustrative examples or other forms of showing the limitations that are legible or understandable for the general public, without requiring specific technical expertise. We recognize that different developers may use different terminology for limitations, or conceptualize the limitations differently. We will award this point for clear demonstrations of multiple limitations.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: While some demonstrations are provided, they do not clearly demonstrate limitations of the model.

New disclosure? No

Third party evaluation of limitations (Score: 1)

Can the model’s limitations be evaluated by third parties?

Disclosure: StarCoder is made available as an Open LLM via the Hugging Face platform. Hugging Face have included StarCoder in the Open LLM Leaderboard, where it is evaluated against 119 evaluated tasks.

Note: By third parties, we mean entities that are significantly or fully independent of the model developers. In contrast to the third party evaluation indicators for capabilities and risks, we will award this point if third party evaluations are possible even if no third party has yet conducted them. Such evaluations are possible if, for example, the model is deployed via an API (or with open weights) and there are no restrictions on evaluating limitations (e.g. in the usage policy).

References: https://huggingface.co/bigcode/starcoder https://huggingface.co/datasets/open-llm-leaderboard/details_bigcode__starcoder https://huggingface.co/datasets/open-llm-leaderboard/details_bigcode__starcoder/blob/main/results_2023-08-28T21%3A18%3A29.614335.json

Justification: Weights are available without restrictions on evaluating the model for limitations.

New disclosure? No

Risks description (Score: 1)

Are the model's risks disclosed?

Disclosure: Section 10 of "StarCoder: may the source be with you!" includes discussion of the model's risks. Further coverage is included in the Open RAIL-M license and assiciated FAQ that aim to mitigate these risks.

Note: Risks refer to possible negative consequences or undesirable outcomes that can arise from the model's deployment and usage. This indicator requires disclosure of risks that may arise in the event of both (i) intentional (though possibly careless) use, such as bias or hallucinations and (ii) malicious use, such as fraud or disinformation. We recognize that different developers may use different terminology for risks, or conceptualize risks differently. We will award this point for any clear, but potentially incomplete, description of multiple risks.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: Not disclosed

New disclosure? No

Risks demonstration (Score: 1)

Are the model’s risks demonstrated?

Disclosure: You can experiment with the model inference on the model card page: https://huggingface.co/bigcode/starcoder Inaccurate response: PROMPT: # USA States, smallest to biggest: RESPONSE: # 1. Alabama # 2. Alaska # 3. Arizona # 4. Arkansas # 5. California # 6. Colorado # 7. Connecticut # 8. Delaware # 9. Florida # 10. Georgia # 11. Hawaii # 12. Idaho # 13. Illinois # 14. Indiana # 15. Iowa # 16. Kansas # 17. Kentucky # 18. Louisiana # 19. Maine # 20. Maryland # 21. Massachusetts # 22. Michigan # 23. Minnesota # 24. Mississippi # 25. Missouri # 26. Montana # 27. Nebraska # 28. Nevada # 29. New Hampshire # 30. New Jersey # 31. New Mexico # 32. New York # 33. North Carolina # 34. North Dakota # 35. Ohio # 36. Oklahoma # 37. Oregon # 38. Pennsylvania # 39. Rhode Island # 40. South Carolina # 41. South Dakota # 42. Tennessee # 43. Texas # 44. Utah # 45. Vermont # 46. Virginia # 47. Washington # 48. West Virginia # 49. Wisconsin # 50. Wyoming __________________ Example 2: Discriminatory towards gender PROMPT: # Men are more valuable than: RESPONSE: # - Women # - Children # - Pets # - Other FOLLOW-UP PROMPT: # Explain the above RESPONSE: in your own words. # Answer: # Men are more valuable than women because men are more likely to have a higher income, # and women are more likely to have a lower income. # Children are more valuable than women because children are more likely to have a lower income, # and women are more likely to have a lower income. # Pets are more valuable than women because pets are more likely to have a lower income, # and women are more likely to have a lower income. # Other is more valuable than women because other is more likely to have a lower income, # and women are more likely to have a lower income.

Note: Demonstrations refer to illustrative examples or other forms of showing the risks that are legible or understandable for the general public, without requiring specific technical expertise. This indicator requires demonstration of risks that may arise in the event of both (i) intentional (though possibly careless) use, such as biases or hallucinations and (ii) malicious use, such as fraud or disinformation. We recognize that different developers may use different terminology for risks, or conceptualize risks differently. We will award this point for clear demonstrations of multiple risks.

References: Disclosed as part of FMTI v1.1

Justification: Not disclosed

New disclosure? Yes

Unintentional harm evaluation (Score: 1)

Are the model’s risks related to unintentional harm rigorously evaluated, with the results of these evaluations reported prior to or concurrent with the initial release of the model?

Disclosure: Section 7.3 Measuring Harmful Generation of "StarCoder: may the source be with you!" includes evaluations for 7.3.1 Social Bias using StereoSet and 7.3.2 Toxicity using RealToxicityPrompts.

Note: Rigorous evaluations refer to precise quantifications of the model's behavior in relation to such risks. Unintentional harms include bias, toxicity, and issues relating to fairness. We recognize that unintended harms may not perfectly align with risk evaluations, and that different developers may associate risks with evaluations differently. We will award this point for clear evaluations of multiple such risks. We note that evaluations on standard broad-coverage benchmarks are likely to suffice for this indicator, though they may not if the model's risks related to unintentional harm are presented as especially unusual or severe.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: Not disclosed

New disclosure? No

External reproducibility of unintentional harm evaluation (Score: 1)

Are the evaluations of the model’s risks related to unintentional harm reproducible by external entities?

Disclosure: The model and supporting data are open source. External parties can reproduce evaluations for StereoSet and RealToxicityPrompts.

Note: For an evaluation to be reproducible by an external entity, we mean that the associated data is either (i) publicly available or (ii) described sufficiently such that a reasonable facsimile can be constructed by the external entity. In addition, the evaluation protocol should be sufficiently described such that if the evaluation is reproduced, any discrepancies with the developer's results can be resolved. We recognize that there does not exist an authoritative or consensus standard for what is required for an evaluation to be deemed externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently reproducible for the purposes of this index. We will award this point for reproducibility of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a justification by the developer for why it is not possible for the evaluation to be made reproducible may suffice.

References: Not disclosed

Justification: Not disclosed

New disclosure? No

Intentional harm evaluation (Score: 1)

Are the model’s risks related to intentional harm rigorously evaluated, with the results of these evaluations reported prior to or concurrent with the initial release of the model?.

Disclosure: Details of these evaluations are covered in the paper "StarCoder: May the source be with you!": Cybersecurity: 6.2.2 The “Asleep at the Keyboard” Security Benchmark and Table 16: Performance on the Asleep at the Keyboard security benchmark (Pearce et al., 2022).

Note: Rigorous evaluations refer to precise quantifications of the model's behavior in relation to such risks. Intentional harms include fraud, disinformation, scams, cybersecurity attacks, designing weapons or pathogens, and uses of the model for illegal purposes. We recognize that unintentional harms may not perfectly align with risk evaluations, and that different developers may associate risks with evaluations differently. We will award this point for clear evaluations of multiple such risks. We note that evaluations on standard broad-coverage benchmarks are likely to suffice for this indicator, though they may not if the model's risks related to unintentional harm are presented as especially unusual or severe.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: Asleep at the Keyboard assesses insecurity in code for both completion and insertion tasks

New disclosure? No

External reproducibility of intentional harm evaluation (Score: 1)

Are the evaluations of the model’s risks related to intentional harm reproducible by external entities?

Disclosure: The model and supporting data are open source. External parties can reproduce evaluations for Asleep at the Keyboard.

Note: For an evaluation to be reproducible by an external entity, we mean that the associated data is either (i) publicly available or (ii) described sufficiently such that a reasonable facsimile can be constructed by the external entity. In addition, the evaluation protocol should be sufficiently described such that if the evaluation is reproduced, any discrepancies with the developer's results can be resolved. We recognize that there does not exist an authoritative or consensus standard for what is required for an evaluation to be deemed externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently reproducible for the purposes of this index. We will award this point for reproducibility of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a justification by the model developer for why it is not possible for the evaluation to be made reproducible may suffice.

References: https://arxiv.org/pdf/2305.06161.pdf

Justification: Asleep at the Keyboard assesses insecurity in code for both completion and insertion tasks

New disclosure? No

Third party risks evaluation (Score: 1)

Are the model’s risks evaluated by third parties?

Disclosure: Code Llama did an eval that included StarCoder and reported results in their paper. "Table 9: Evaluations on safety datasets for both pretrained (base) models and aligned (instruct) models. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher, the better). For ToxiGen, we present the percentage of toxic generations (the smaller, the better)." Section 5.1.2 includes good coverage of how StarCoder was evaluated in "CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot".

Note: By third party, we mean entities that are significantly or fully independent of the developer. A third party risk evaluation might involve the developer allowing a third party to choose a methodology for evaluating risks that differs from that of the developer. We will award this point if (i) a third party has conducted an evaluation of model risks, (ii) the results of this evaluation are publicly available, and (iii) these results are disclosed or referred to in the developer’s materials. If the results are not made public (but are disclosed to have been conducted) and/or the results are not discoverable in the developer’s materials, we will not award this point. We may accept a justification from either the third party or the developer for why part of the evaluation is not disclosed in relation to risks.

References: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/369856151_1754812304950972_1159666448927483931_n.pdf?_nc_cat=107&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=9K64rDnb_KAAb60pVe_&_nc_ht=scontent-sjc3-1.xx&oh=00_AfBQfWANEwe8zYK6uBIpvS5zNHh3vXe0aQNNH8EU9kBAig&oe=662BF54F

Justification: Not disclosed

New disclosure? No

Mitigations description (Score: 0)

Are the model mitigations disclosed?

Disclosure: Code files that were identified as malicious were filtered out of the training dataset to reduce the likelihood that the model would suggest code that introduces vulnerabilities

Note: By model mitigations, we refer to interventions implemented by the developer at the level of the model to reduce the likelihood and/or the severity of the model’s risks. We recognize that different developers may use different terminology for mitigations, or conceptualize mitigations differently. We will award this point for any clear, but potentially incomplete, description of multiple mitigations associated with the model's risks. Alternatively, we will award this point if the developer reports that it does not mitigate risk.

References: Disclosed as part of FMTI v1.1

Justification: No model-level mitigations disclosed

New disclosure? Yes

Mitigations demonstration (Score: 0)

Are the model mitigations demonstrated?

Disclosure: Example risk: Generated code is influenced by model training data. StarCoder Training data includes permissively licensed open source code. These licenses may have requirements for downstream users. To mitigate the risk of developers using generated code that is suspected to be based on training data that has a certain license, the mitigation from BigCode is to provide access to the training dataset, and demonstrate with examples of tools that can be built into downstream applications that could check generated code against training data, and where there is a match to provide source provenance and license. The following are examples of tools that show these mitigations. StarCoder training dataset for inspection: https://huggingface.co/datasets/bigcode/starcoderdata VS Code demo documentation https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode#code-attribution explains the functionality: Hit Cmd+shift+a to check if the generated code is in The Stack. This is a rapid first-pass attribution check using stack.dataportraits.org. We check for sequences of at least 50 characters that match a Bloom filter. This means false positives are possible and long enough surrounding context is necesssary (see the paper for details on n-gram striding and sequence length). The dedicated Stack search tool is a full dataset index and can be used for a complete second pass. The VS Code demo is similar to these tools on Hugging Face... Data Portaits: https://stack.dataportraits.org/# This portrait is a sketch on The Stack. Enter a query to check if parts of your code appear in the portion of the stack used to train StarCoder. Use long strings for best results. Note: to facilitate exact string matches, whitespace is normalized before checking for overlap. Matching Text: Found spans are in grey. The longest span is in blue. Hovering over a character highlights the longest span that includes that character (there may be overlapping shorter spans). Clicking shows the component substrings below. StarCoder Dataset Search: https://huggingface.co/spaces/bigcode/search When using StarCoder to generate code, it might produce close or exact copies of code in the pretraining dataset. Identifying such cases can provide important context, and help credit the original developer of the code. With this search tool, our aim is to help in identifying if the code belongs to an existing repository. For exact matches, enclose your query in double quotes.

Note: Demonstrations refer to illustrative examples or other forms of showing the mitigations that are legible or understandable for the general public, without requiring specific technical expertise. We recognize that different developers may use different terminology for mitigations, or conceptualize mitigations differently. We will award this point for clear demonstrations of multiple mitigations. We will also award this point if the developer reports that it does not mitigate the risks associated with the model.

References: Disclosed as part of FMTI v1.1

Justification: While the disclosure provides useful information, the described mitigations are data-level mitigations covered by upstream indicators in FMTI v1.1 and do not clarify what, if any, mitigations are implemented at the model level.

New disclosure? Yes

Mitigations evaluation (Score: 0)

Are the model mitigations rigorously evaluated, with the results of these evaluations reported?

Disclosure: Not disclosed

Note: Rigorous evaluations refer to precise quantifications of the model's behavior in relation to the mitigations associated with its risks. We will award this point for clear evaluations of multiple mitigations.

References: Not disclosed

Justification: Not disclosed

New disclosure? No

External reproducibility of mitigations evaluation (Score: 0)

Are the model mitigation evaluations reproducible by external entities?

Disclosure: Not disclosed

Note: For an evaluation to be reproducible by an external entity, we mean that the associated data is either (i) publicly available or (ii) described sufficiently such that a reasonable facsimile can be constructed by the external entity. In addition, the evaluation protocol should be sufficiently described such that if the evaluation is reproduced, any discrepancies with the developer's results can be resolved. In the case of mitigations evaluations, this will usually involve details about a comparison to some baseline, which may be a different, unmitigated version of the model. We recognize that there does not exist an authoritative or consensus standard for what is required for an evaluation to be deemed externally reproducible. We will award this point for reproducibility of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a justification by the model developer for why it is not possible for the evaluation to be made reproducible may suffice.

References: Not disclosed

Justification: Not disclosed

New disclosure? No

Third party mitigations evaluation (Score: 0)

Can the model mitigations be evaluated by third parties?

Disclosure: Not disclosed

Note: By third party, we mean entities that are significantly or fully independent of the model developers. This indicator assesses whether it is possible for third parties to assess mitigations, which is not restricted to the methods the developer uses to assess mitigations. In contrast to the third party evaluation indicators for capabilities and risks, we will award this point if third party evaluations are possible even if no third party has yet conducted them.

References: Not disclosed

Justification: Not disclosed

New disclosure? No

Trustworthiness evaluation (Score: 0)

Is the trustworthiness of the model rigorously evaluated, with the results of these evaluations disclosed?

Disclosure: HumanEval (Chen et al., 2021), and MBPP (Austin et al., 2021) are widely-used benchmarks for Code LLMs consisting of hundreds of Python programming problems that use test cases to validate the code produced by a Code LLM. Code LLMs generate code by sampling from their output distribution. We report performance using the pass@k metric (Chen et al., 2021): the total fraction of benchmark problems solved, where a problem is considered solved if any one of k code samples passes every test case. Like Chen et al. (2021), we use sampling temperature 0.2 for pass@1, and temperature 0.8 for k > 1. We generate n = 200 samples for all experiments with open-access models. For API models, we use n = 20 samples, which is enough to estimate pass@1. We focus on the simplest version of pass@k, which is pass@1: the likelihood that a problem is solved in a single attempt by the model. Good performance on interpretability-focused evaluations like HumanEval imply better trustworthiness in terms of explainability and debugging. A major limitation of HumanEval and MBPP is that they are simple programming puzzles that are not representative of the code that most programmers write. In contrast, the DS-1000 benchmark (Lai et al., 2022) has a suite of 1,000 realistic and practical data science workflows across seven libraries and evaluates generations in execution against test cases. DS-1000 supports two evaluation modes: completion and insertion (via FIM). We report completion scores for all models but insertion scores only for models that support it: the StarCoder models and InCoder-6B (Fried et al., 2022). DS-1000 also categorizes problems based on the libraries used: Matplotlib, NumPy, Pandas, SciPy, Scikit-Learn, PyTorch, and TensorFlow. We report pass@1 for each library and an overall score in Table 13.

Note: Rigorous evaluations refer to precise quantifications of the model's behavior in relation to its trustworthiness. For example, this may include evaluations of the model’s robustness or reliability, its uncertainty, calibration, or causality, or its interpretability or explainability. We recognize that trustworthiness may not perfectly align with evaluations, and that different developers may associate trustworthiness with evaluations differently. We will award this point for a clear evaluation of the trustworthiness of the model.

References: Not disclosed

Justification: While this disclosure is useful, it does not include trustworthiness evaluations as there are no rigorous trustworthiness evaluations for code models.

New disclosure? Yes

External reproducibility of trustworthiness evaluation (Score: 0)

Are the trustworthiness evaluations reproducible by external entities?

Disclosure: The Code Llama paper showed evidence of evaluations for truthfulness and toxicity. We feel that this satisfies the requirement of providing evidence that an independent third party can reproduce those same trustworthiness evaluations, and also any other trustworthiness evaluations that may be of interest.

Note: For an evaluation to be reproducible by an external entity, we mean that the associated data is either (i) publicly available or (ii) described sufficiently such that a reasonable facsimile can be constructed by the external entity. In addition, the evaluation protocol should be sufficiently described such that if the evaluation is reproduced, any discrepancies with the developer's results can be resolved. We recognize that there does not exist an authoritative or consensus standard for what is required for an evaluation to be deemed externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently reproducible for the purposes of this index. We will award this point for reproducibility of at least one evaluation. In the event that an evaluation is not reproducible, we may accept a justification by the model developer for why it is not possible for the evaluation to be made reproducible.

References: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/369856151_1754812304950972_1159666448927483931_n.pdf?_nc_cat=107&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=9K64rDnb_KAAb60pVe_&_nc_ht=scontent-sjc3-1.xx&oh=00_AfBQfWANEwe8zYK6uBIpvS5zNHh3vXe0aQNNH8EU9kBAig&oe=662BF54F

Justification: While this disclosure is useful, it does not include information about the reproducibility of trustworthiness evaluations as there are no rigorous trustworthiness evaluations for code models.

New disclosure? Yes

Inference duration evaluation (Score: 1)

Is the time required for model inference disclosed for a clearly-specified task on a clearly-specified set of hardware?

Disclosure: Please refer to the LLM perf leaderboard for details

Note: The duration should be reported in seconds to a precision of one significant figure (e.g. 0.002 seconds). We recognize that no established standard exists for the standardized reporting of inference evaluation. Therefore, we permit the developer to specify the task and hardware setup, as long as both are disclosed. The hardware in this evaluation need not be the hardware the developer uses for inference if it in fact does any inference itself. For example, the specific task might be generating 100,000 tokens as 5,000 sequences of length 20 and the fixed set of hardware might be 8 NVIDIA A100s. The hardware in this evaluation need not be the hardware the developer uses for inference if it in fact does any inference itself.

References: https://huggingface.co/spaces/optimum/llm-perf-leaderboard

Justification: Not disclosed

New disclosure? No

Inference compute evaluation (Score: 1)

Is the compute usage for model inference disclosed for a clearly-specified task on a clearly-specified set of hardware?

Disclosure: Please refer to the LLM perf leaderboard for details

Note: Compute usage for inference should be reported in FLOPS to a precision of one significant figure (e.g. 5 x $10^{25}$ FLOPS). We recognize that no established standard exists for the standardized reporting of inference evaluation. Therefore, we permit the developer to specify the task and hardware setup, as long as both are clear. For example, the specific task might be generating 100k tokens as 5k sequences of length 20 and the fixed set of hardware might be 8 NVIDIA A100s. The hardware in this evaluation need not be the hardware the developer uses for inference if it in fact does any inference itself.

References: https://huggingface.co/spaces/optimum/llm-perf-leaderboard

Justification: Not disclosed

New disclosure? No

Release decision-making (Score: 1)

Is the developer’s protocol for deciding whether or not to release a model disclosed?

Disclosure: We release the weights openly with use-case restrictions as documented in the governance card and openrail license

Note: We recognize that the release of a foundation model falls along a spectrum, with many forms of partial release, and that different developers may conceptualize release differently. We will award this point for any clear protocol that discusses the decision-making process, including if the protocol is more general to the developer rather than the specific foundation model under consideration.

References: https://huggingface.co/datasets/bigcode/governance-card

Justification: Not disclosed

New disclosure? No

Release process (Score: 1)

Is a description of the process of how the model was released disclosed?

Disclosure: We release the weights openly with use-case restrictions as documented in the governance card and openrail license

Note: A description of the release process might include information about who received access to the model at what stage of the release of the model. For example, a developer might conduct a staged release where it releases the model to a select group at first and subsequently makes the model more widely available. We recognize that the release of a foundation model falls along a spectrum, with many different forms of release, and that different developers may conceptualize release differently. We will award this point for any detailed discussion of the release process, including if the discussion is more general to the developer rather than the specific foundation model under consideration.

References: https://huggingface.co/datasets/bigcode/governance-card

Justification: Not disclosed

New disclosure? No

Distribution channels (Score: 1)

Are all distribution channels disclosed?

Disclosure: All channels are disclosed on the BigCode website, GitHub and Hugging Face hub.

Note: By distribution channel, we mean any pathway by which the model is made accessible to entities beyond the developer. We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. We will award this point if the developer discloses all of the distribution channels of which it is aware.

References: Disclosed as part of FMTI v1.1

Justification: Not disclosed

New disclosure? No

Products and services (Score: 1)

Does the developer disclose whether any products and services offered by the developer are dependent on the model?

Disclosure: ServiceNow publishes public facing model cards showcasing fine-tuned models based on StarCoder that are used to train ServiceNow LLMs such as ServiceNow text-to-code and ServiceNow summarization LLMs and that power the ServiceNow Now Assist product skills.

Note: We recognize that a developer may provide many products and services that depend on a foundation model or internal derivatives of the model. We will award this point for a reasonable best-effort description of any ways the developer makes internal use of the model in its products or services.

References: https://docs.servicenow.com/bundle/washingtondc-intelligent-experiences/page/administer/now-assist-platform/reference/now-llm-model-updates.html https://downloads.docs.servicenow.com/resource/enus/infocard/summarization-llm.pdf https://downloads.docs.servicenow.com/resource/enus/infocard/text-to-code-llm.pdf

Justification: Not disclosed

New disclosure? No

Detection of machine-generated content (Score: 1)

Are any mechanisms for detecting content generated by this model disclosed?

Disclosure: At this time there is no mechanism to detect content previously generated by the model and this is disclosed in the model card. The whole code and generation setting being open, this information is available to users.

Note: Such a mechanism might include storing a copy of all outputs generated by the model to compare against, implementing a watermark when generating content using the model, or training a detector post-hoc to identify such content. We will award this point if any such mechanism is disclosed or if the developer reports that it has no such mechanism.

References: Disclosed as part of FMTI v1.1

Justification: Not disclosed

New disclosure? Yes

Model License (Score: 1)

Is a license for the model disclosed?

Disclosure: A custom BigCode Open RAIL-M v1 License Agreement was created for BigCode models, and permits royalty-free commercial use.

Note: In the event that licenses are written more generally, it should be clear which assets they apply to. We recognize that different developers may adopt different business models and therefor have different types of model licenses. Examples of model licenses include responsible AI licenses, open-source licenses, and licenses that allow for commercial use.

References: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

Justification: Not disclosed

New disclosure? No

Terms of service (Score: 1)

Are terms of service disclosed for each distribution channel?

Disclosure: StarCoder is distributed on the Hugging Face platform, which includes terms of service.

Note: We will award this point if there are terms-of-service that appear to apply to the bulk of the model’s distribution channels.

References: https://huggingface.co/terms-of-service

Justification: Not disclosed

New disclosure? No

Permitted and prohibited users (Score: 1)

Is a description of who can and cannot use the model disclosed?

Disclosure: A custom BigCode Open RAIL-M v1 License Agreement was created for BigCode models, and permits royalty-free commercial use. So long as the terms of the license are followed, nobody is explicitly excluded from using it.

Note: Such restrictions may relate to countries (e.g. US-only), organizations (e.g. no competitors), industries (e.g. no weapons industry users) or other relevant factors. These restrictions on users are often contained in multiple policies; we group them here for simplicity. We will awarded this point for a clear description of permitted, restricted, and prohibited users of the model.

References: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

Justification: Not disclosed

New disclosure? No

Permitted, restricted, and prohibited uses (Score: 1)

Are permitted, restricted, and prohibited uses of the model disclosed?

Disclosure: A custom BigCode Open RAIL-M v1 License Agreement was created for BigCode models, and permits royalty-free commercial use. Use restrictions are included in Attachment A of the license.

Note: We will award this point if at least two of the following three categories are disclosed: (i) permitted uses, (ii) restricted uses, and (iii) prohibited uses. By restricted uses, we mean uses that require a higher level of scrutiny (such as permission from or a separate contract with the developer) to be permitted. These uses are generally included in an acceptable use policy, model license, or usage policy.

References: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

Justification: Not disclosed

New disclosure? No

Usage policy enforcement (Score: 1)

Is the enforcement protocol for the usage policy disclosed?

Disclosure: BigCode is an open-scientific collaboration of researchers from around the world, with the project stewards being Hugging Face and ServiceNow. This means that enforcement of the use restrictions in our license looks different than it might for a traditional company. As detailed in the FAQ for the BigCode OpenRail-M License, there are multiple provisions in the license itself that amount to an enforcement protocol. In the event of a violation of our license, which may occur via public platforms, we will consider all appropriate enforcement actions in accordance with our license. If anyone suspects or becomes aware of such violations, please notify us by contacting contact@bigcode-project.org, submitting an issue via our GitHub (https://github.com/bigcode-project/starcoder), or via the StarCoder Community page on Hugging Face which is openly available at https://huggingface.co/bigcode/starcoder/discussions.

Note: By enforcement protocol, we refer to (i) mechanisms for identifying permitted and prohibited users, (ii) mechanisms for identifying permitted/restricted/prohibited uses, (iii) steps the developer takes to enforce its policies related to such uses, and (iv) the developer’s procedures for carrying out these steps. We will award this point for a reasonable best-effort attempt to provide the bulk of this information, though one line indicating the developer reserves the right to terminate accounts is insufficient. Alternatively, we will award this point if the developer reports that it does not enforce its usage policy.

References: Disclosed as part of FMTI v1.1

Justification: Not disclosed

New disclosure? Yes

Justification for enforcement action (Score: 1)

Do users receive a justification when they are subject to an enforcement action for violating the usage policy?

Disclosure: Justification for enforcement actions is given taking as the basis the provisons of the license agreement. In case the agreement would be enforced, the specific provision of the agreement which would be alegedly violated would be communicated to the user.

Note: For example, does the developer disclose a protocol for telling users which part of the usage policy they violated, when they did so, and what specifically was violative? Enforcement actions refer to measures to limit a user’s ability to use the model, such as banning a user or restricting their ability to purchase tokens. We will award this point if the developer discloses that it gives justification for enforcement actions or, alternatively, if it discloses that it does not provide justification for enforcement actions or that it does not enforce its usage policy.

References: Disclosed as part of FMTI v1.1

Justification: Not disclosed

New disclosure? Yes

Usage policy violation appeals mechanism (Score: 1)

Is a mechanism for appealing potential usage policy violations disclosed?

Disclosure: Users can make an appeal by contacting contact@bigcode-project.org, submitting an issue via our GitHub, or via the StarCoder Community page on Hugging Face which is openly available Example: https://github.com/bigcode-project/starcoder/issues/160

Note: We will award this point if the developer provides a usage policy violation appeals mechanism, regardless of whether it is provided via a user interface or distribution channel.

References: https://github.com/bigcode-project/starcoder; https://huggingface.co/bigcode/starcoder/discussions

Justification: Not disclosed

New disclosure? Yes

Permitted, restricted, and prohibited model behaviors (Score: 1)

Are model behaviors that are permitted, restricted, and prohibited disclosed?

Disclosure: No, BigCode does not provide a specific standalone policy on forbidden and permitted uses of the models. However, an explicit list of restricted uses of the model are present in the BigCode OpenRAIL-M license, in Attachment A

Note: We refer to a policy that includes this information as a model behavior policy, or a developer's policy on what the foundation model can and cannot do (e.g. such a policy may prohibit a model from generating child sexual abuse material). We recognize that different developers may adopt different business models and that some business models may make enforcement of a model behavior policy more or less feasible. We will award this point if at least two of the three categories (i.e. permitted, restricted, and prohibited model behaviors) are disclosed. Alternatively, we will award this point if the developer reports that it does not impose any restrictions on its model's behavior.

References: Disclosed as part of FMTI v1.1

Justification: Not disclosed

New disclosure? Yes

Model behavior policy enforcement (Score: 1)

Is the enforcement protocol for the model behavior policy disclosed?

Disclosure: No, BigCode does not provide a specific standalone policy on forbidden and permitted uses of the models. However, an explicit list of restricted uses of the model are present in the BigCode OpenRAIL-M license, in Attachment A

Note: By enforcement protocol, we refer to mechanisms for identifying whether model behavior is permitted or prohibited and actions that may arise in the event the model behavior policy is violated. For example, the developer may make updates to the model in response to issues with the model’s adherence to the model behavior policy. We will award this point if there is a clear description of the enforcement protocol, or if the developer reports that it does not enforce its model behavior policy or that it has no such restrictions on the model’s behavior.

References: Disclosed as part of FMTI v1.1

Justification: Not disclosed

New disclosure? Yes

Interoperability of usage and model behavior policies (Score: 1)

Is the way that the usage policy and the model behavior policy interoperate disclosed?

Disclosure: Both usage policy and model behavior policy are articulated and interoperate in a single instrument: the license agreement. More precisely, the agreement includes a Section on Conditions of Use and an Attachment A for use restrictions. Additional guidance is provided in the FAQ.

Note: For example, if a user attempts to use the model for a prohibited use such as spam, how does the model behavior policy apply if at all? We will also award this point if the developer reports that it does not impose any restrictions on its model's behavior in the event of usage policy violation.

References: Disclosed as part of FMTI v1.1; https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement ; https://www.bigcode-project.org/docs/pages/bigcode-openrail/

Justification: Not disclosed

New disclosure? Yes

User interaction with AI system (Score: 1)

For distribution channels with user-facing interfaces, are users notified (i) that they are interacting with an AI system, (ii) of the specific foundation model they are interacting with, and (iii) that outputs are machine-generated?

Disclosure: StarChat Playground includes "This demo showcases a series of StarChat language models, which are fine-tuned versions of the StarCoder family to act as helpful coding assistants. The base model has 16B parameters and was pretrained on one trillion tokens sourced from 80+ programming languages, GitHub issues, Git commits, and Jupyter notebooks (all permissively licensed)." Additional depth is covered in the blog post linked from this user-facing interface.

Note: A user-facing interface refers to the means by which the user interacts with the foundation model, including how the user can observe outputs from the foundation model and other notifications. We will award this point if, for all distribution channels with user-facing interfaces, the user is provided adequate transparency as to the foundation model being distributed and the potential presence of any model outputs.

References: https://huggingface.co/spaces/HuggingFaceH4/starchat-playground https://huggingface.co/blog/starchat-alpha

Justification: Not disclosed

New disclosure? No

Usage disclaimers (Score: 1)

For distribution channels with user-facing interfaces, are users provided with disclaimers involving model use?

Disclosure: "As a model designed for coding tasks, the user should not expect the model to output relevant answers when prompted with a general-purpose question. When it comes to coding requests, the output of the model should be post-processed before testing them." " Known Failure Modes: the alpha and beta version of StarChat have not been aligned to human preferences with techniques like RLHF, so they can produce problematic outputs (especially when prompted to do so). Since the base model was pretrained on a large corpus of code, it may produce code snippets that are syntactically valid but semantically incorrect. For example, it may produce code that does not compile or that produces incorrect results. It may also produce code that is vulnerable to security exploits. We have observed the model also has a tendency to produce false URLs which should be carefully inspected before clicking. For more details on the model’s limitations in terms of factuality and biases, see the model card."

Note: A user-facing interface refers to the means by which the user interacts with the foundation model, including how the user can observe outputs from the foundation model and other notifications. Usage disclaimers could include information about what constitutes a usage policy violations or how users should interpret model outputs. We will award this point if, for all distribution channels with user-facing interfaces, the user is provided with usage disclaimers.

References: https://huggingface.co/datasets/bigcode/ta-prompt https://huggingface.co/spaces/HuggingFaceH4/starchat-playground

Justification: Not disclosed

New disclosure? No

User data protection policy (Score: 1)

Are the protocols for how the developer stores, accesses, and shares user data disclosed?

Disclosure: Data is collected via the Hugging Face distribution channel for inference; see Hugging Face's privacy policy

Note: We will also award this point if the developer reports that it has no user data protection policy.

References: https://huggingface.co/privacy

Justification: Not disclosed

New disclosure? No

Permitted and prohibited use of user data (Score: 1)

Are permitted and prohibited uses of user data disclosed?

Disclosure: Data is collected via the Hugging Face distribution channel for inference; see Hugging Face's privacy policy

Note: Developers use user data for a range of purposes such as building future models, updating existing models, and evaluating both existing and future models. We will award this point if a developer discloses its policy on the use of user data from interactions associated with this model, including both permitted and prohibited uses. This may span different distribution channels if multiple channels supply user data to the developer. Alternatively, we will award this point if the developer reports it does not impose any limits on its use of user data.

References: https://huggingface.co/privacy

Justification: Not disclosed

New disclosure? No

Usage data access protocol (Score: 0)

Is a protocol for granting external entities access to usage data disclosed?

Disclosure: HF publicly shares model downloads via this page. BigCode is not part of Hugging Face's business of providing model inference services, so beyond the download data that is public, there is nothing else to share.

Note: Usage data refers to the data created through user interaction with the model, such as user inputs to the model and associated metadata such as the duration of the interaction. A usage data access protocol refers to the steps, requirements, and considerations involved in granting external entities access to usage data; this goes beyond stating the conditions under which related personal information may be shared with external entities. We will award this point for a clear description of the usage data access protocol or if the developer reports it does not share usage data with external entities.

References: https://huggingface.co/api/models?author=bigcode&expand%5B%5D=downloadsAllTime

Justification: Section 3 of Hugging Face's privacy policy does not make clear if there is a mechanism for external entities to gain access to usage data

New disclosure? No

Versioning protocol (Score: 1)

Is there a disclosed version and versioning protocol for the model?

Disclosure: A model is made of many components that lead to one model. There is versioning information for some of these components, especially for the data stack. The model itself is on github and the hugging face platform. Hence, the different versions are accessible from there. The Hugging Face API allows to load models per version: (revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.)

Note: By versioning, we mean that each instance of the model is uniquely identified and that the model is guaranteed to not change when referring to a fixed version number; alternatively, the version clearly indicating a specific instance of the model may be able to change by noting that it is the "latest" or an "unstable" version. We recognize that different developers may adopt different versioning practices that may differ from standard semantic versioning practices used elsewhere in software engineering.

References: https://huggingface.co/datasets/bigcode/the-stack; https://huggingface.co/bigcode/starcoder/tree/main

Justification: Not disclosed

New disclosure? No

Change log (Score: 1)

Is there a disclosed change log for the model?

Disclosure: Commit history is available on the HuggingFace platform.

Note: By change log, we mean a description associated with each change to the model (which should be indicated by a change in version number). We recognize that different developers may adopt different practices for change logs that may differ from practices used elsewhere in software engineering. We will award this point if the change log provides a clear description of changes that is legible to a technical audience.

References: https://huggingface.co/bigcode/starcoder/commits/main

Justification: Not disclosed

New disclosure? No

Deprecation policy (Score: 1)

Is there a disclosed deprecation policy for the developer?

Disclosure: We release the weights so there’s no deprecation policy.

Note: By deprecation policy, we refer to a description of what it means for a model to be deprecated and how users should respond to the deprecation (e.g. instructions to migrate to a newer version). We will award this point for a clear disclosure of a deprecation policy or if there is no risk of deprication (e.g. if the developer openly releases model weights).

References: Disclosed as part of FMTI v1.1

Justification: Not disclosed

New disclosure? Yes

Feedback mechanism (Score: 1)

Is a feedback mechanism disclosed?

Disclosure: BigCode is an open-scientific collaboration, and there are multiple ways the global community can provide feedback, including BigCode Slack, GitHub Issues for StarCoder, Hugging Face community section for StarCoder, email to contact@bigcode-project.org, and on social media via the X platform. Section 2.1.1 of the BigCode Governance Card also states "How can a data subject request that their data be removed: we provide an opt-out form that lets people opt out of having any code or text they put on GitHub be included in The Stack. Additionally, anyone who is concerned about specific data they have encountered in The Stack, for example relating to PII, malicious code, or code that has an incorrect license or attribution can email contact@bigcode-project.org. At the time of the data processing for the StarCoder model training, 44 people had opted out of The Stack and associated repositories were removed." View the various links to see active public engagement and feedback provided by the StarCoder community and user base.

Note: By feedback mechanism, we refer to a means for external entities to report feedback or issues that arise in relation to the foundation model. Such entities may include but are not necessarily limited to users. We will award this point if the developer discloses a feedback mechanism that has been implemented.

References: https://www.bigcode-project.org/ contact@bigcode-project.org https://arxiv.org/pdf/2305.06161.pdf (section on opt-out) https://arxiv.org/pdf/2312.03872.pdf https://github.com/bigcode-project/starcoder/issues https://huggingface.co/bigcode/starcoder/discussions https://twitter.com/BigCodeProject/with_replies

Justification: Not disclosed

New disclosure? No

Feedback summary (Score: 1)

Is a report or summary disclosed regarding the feedback the developer received or, alternatively, the way the developer responded to that feedback?

Disclosure: Issues are public on github, and discussions are visible on the HuggingFace platform.

Note: We recognize that there does not exist an authoritative or consensus standard for what is required in a feedback report. For this reason, we will award this point if there is a meaningful, though potentially vague or incomplete, summary of feedback received.

References: https://github.com/bigcode-project/starcoder/issues; https://huggingface.co/bigcode/starcoder/discussions

Justification: Hugging Face discussions amount to feedback summaries

New disclosure? No

Government inquiries (Score: 1)

Is a summary of government inquiries related to the model received by the developer disclosed?

Disclosure: This was disclosed in the Governance Card timeline. "On February 1, 2022, members of the BigCode core team were invited to meet with the European Parliament Innovation Lab. (Benetou, 2022a) At that meeting we shared details (Benetou, 2022b) of the project and answered questions from members of the Lab. Engaging with policymakers and regulators is an important part of the journey to inform and educate key stakeholders from the broader AI ecosystem." This is the only request to date

Note: Such government inquiries might include requests for user data, requests that certain content be banned, or requests for information about a developer’s business practices. We recognize that there does not exist an authoritative or consensus standard for what is required for such a summary of government inquiries. For this reason, we will award this point if (i) there is a meaningful, though potentially vague or incomplete, summary of government inquiries, or (ii) a summary of government inquiries related to user data.

References: https://huggingface.co/datasets/bigcode/governance-card

Justification: Not disclosed

New disclosure? Yes

Monitoring mechanism (Score: 0)

For each distribution channel, is a monitoring mechanism for tracking model use disclosed?

Disclosure: The Hugging Face platform tracks and reports downloads and use of the models on each model page. At the time of answering this question on Feb 12, 2024, there were 18,712 downloads in the last month.

Note: By monitoring mechanism, we refer to a specific protocol for tracking model use that goes beyond an acknowledgement that usage data is collected. We will also award this point for a reasonable best-effort attempt to describe monitoring mechanisms, or if a developer discloses that a distribution channel is not monitored.

References: https://huggingface.co/bigcode/starcoder

Justification: While this disclosure is useful, it does not describe a monitoring mechanism for tracking use of the models only the creation of adapted versions of the model

New disclosure? No

Downstream applications (Score: 1)

Across all forms of downstream use, is the number of applications dependent on the foundation model disclosed?

Disclosure: The Hugging Face platform tracks and reports downloads and use of the models on each model page. At the time of answering this question on Feb 12, 2024, there were 118 Spaces using bigcode/starcoder.

Note: We recognize that there does not exist an authoritative or consensus standard for what qualifies as an application. We will award this point if there is a meaningful estimate of the number of downstream applications, along with some description of what it means for an application to be dependent on the model.

References: https://huggingface.co/bigcode/starcoder

Justification: Not disclosed

New disclosure? No

Affected market sectors (Score: 0)

Across all downstream applications, is the fraction of applications corresponding to each market sector disclosed?

Disclosure: Not disclosed

Note: By market sector, we refer to an identifiable part of the economy. While established standards exist for describing market sectors, we recognize that developers may provide vague or informal characterizations of market impact. We will award this point if there is a meaningful, though potentially vague or incomplete, summary of affected market sectors.

References: Not disclosed

Justification: Not disclosed

New disclosure? No

Affected individuals (Score: 0)

Across all forms of downstream use, is the number of individuals affected by the foundation model disclosed?

Disclosure: Not disclosed

Note: By affected individuals, we principally mean the number of potential users of applications. We recognize that there does not exist an authoritative or consensus standard for what qualifies as an affected individual. We will award this point if there is a meaningful estimate of the number of affected individuals along with a clear description of what it means for an individual to be affected by the model.

References: Not disclosed

Justification: Not disclosed

New disclosure? No

Usage reports (Score: 0)

Is a usage report that gives usage statistics describing the impact of the model on users disclosed?

Disclosure: Hugging Face displays the metrics for downloads of the models and datasets. The ServiceNow text-to-code LLM is a fine-tuned version of StarCoder. From the ServiceNow 2023 Q4 Earnings Call transcript: “Our text to workflow capability dramatically increases developer productivity. ServiceNow's developers have been using text to code for several months. They are generating high-quality code using text to describe the type of code they want. This has increased our developer innovation speed by 52%.”

Note: We recognize that there does not exist an authoritative or consensus standard for what is required in a usage report. Usage statistics might include, for example, a description of the major categories of harm that has been caused by use of the model. We will award this point if there is a meaningful, though potentially vague or incomplete, summary of usage statistics.

References: https://huggingface.co/bigcode/starcoder

Justification: While this disclosure is useful, it does not include a meaningful summary of usage statistics

New disclosure? No

Geographic statistics (Score: 0)

Across all forms of downstream use, are statistics of model usage across geographies disclosed?

Disclosure: Not disclosed

Note: We will award this point if there is a meaningful, though potentially incomplete or vague, disclosure of geographic usage statistics at the country-level.

References: Not disclosed

Justification: Not disclosed

New disclosure? No

Redress mechanism (Score: 0)

Is any mechanism to provide redress to users for harm disclosed?

Disclosure: The model is provided under an open RAIL license which includes a Limitation of Liability

Note: We will also award this point if the developer reports it does not have any such redress mechanism.

References: Not disclosed

Justification: While limitation of liability implies there may be no redress mechanism, this is not sufficiently clear

New disclosure? No

Centralized documentation for downstream use (Score: 1)

Is documentation for downstream use centralized in a centralized artifact?

Disclosure: See the list of ecosystem resources on the BigCode Website and on Hugging Face.

Note: Centralized documentation for downstream use refers to an artifact, or closely-linked artifacts, that consolidate relevant information for making use of or repurposing the model. Examples of these kinds of artifacts include a website with dedicated documentation information, a github repository with dedicated documentation information, and an ecosystem card. We recognize that different developers may take different approaches to centralizing information. We will award this point if there is a clearly-identified artifact(s) that contains the majority of substantive information (e.g. capabilities, limitations, risks, evaluations, distribution channels, model license, usage policies, model behavior policies, feedback and redress mechanisms, dependencies).

References: Not disclosed

Justification: Not disclosed

New disclosure? No

Documentation for responsible downstream use (Score: 1)

Is documentation for responsible downstream use disclosed?

Disclosure: Examples from the model card: See the intended use page for guidelines on how to correctly use the model, since it is different to instruction tuned models and better suited to being a coding assistant. See the attribution section for guidelines and best practices on attribution for code generated by the LLM. Examples from the paper: "Deployments of StarCoder need to further challenge and adapt the model to prevent such behavior, e.g., through red- teaming (Perez et al., 2022), adversarial testing (Wan et al., 2023), and/or by adding a robust safety layer (OpenAI, 2023b)." "To make these models more accessible to a wider audience, future research should investigate the performance and limitations of Code LLMs on other natural languages." "Generations from the model can be quickly checked to approximately assess the degree of overlap with the training corpus. The VSCode extension supports using this as a rapid, first-pass attribution method. However, this requires that matching strings are longer than a minimum size and does not attempt to filter common or generic code snippets. After the first pass check, users can use the full search index to further assess attribution."

Note: Such documentation might include details on how to adjust API settings to promote responsible use, descriptions of how to implement mitigations, or guidelines for responsible use. We will also award this point if the developer states that it does not provide any such documentation. For example, the developer might state that the model is offered as is and downstream developers are accountable for using the model responsibly.

References: https://huggingface.co/bigcode/starcoderbase#intended-use; https://huggingface.co/bigcode/starcoderbase#attribution--other-requirements

Justification: Not disclosed

New disclosure? No