ETHICS: Data Scientist

Roberto Alamilla
9 min readDec 11, 2020

While there is no single definition of data science, it can be broadly thought of as scientific, computational and analytical methods used to process and extract information from data. It is synonymous with ‘big data’ , machine learning (artificial intelligence), and data pipelines (automated systems that capture and process data). Data science brings together several fields including maths, statistics, computer science and third party companies (AWS, GCP, VMWare, etc)

The ethical significance of data science, and the implications for industries and the wider public, is constantly evolving. As data science methods become more common within statistical and actuarial fields, there are both opportunities and challenges for individuals working in data science (‘practitioners’). For example, managing privacy, fairness and bias can be difficult and complex when using algorithmic methods. Additionally, public perceptions are still developing around many aspects of data science, including the use of artificial intelligence (AI) in systems and decision making, and ‘big data’ sources about people, such as social media and mobile phone data. The importance of keeping up with these challenging ethical issues. While agreement for good practice was found across many aspects, some of the more complex issues will require further thought and input from the professions.

The five principals for you may considerate:

1. Seek to enhance the value of data science for society.

As the impact that data science can have on society could be significant, an important ethical consideration is what the potential implications could be on society as a whole. A common principal within ethical frameworks discussing data science and AI is for practitioners to attempt to seek outcomes within their work which support the improvement of public wellbeing. This could involve practitioners seeking to share the benefits of data science and balancing this with the wellbeing of potentially affected individuals.

Considering the potential impact that models have on decisions (especially in relation to people):

• Understanding how the models will be used, what impact there may be and for whom

• Understanding who the relevant stakeholders are

• Understanding the potential biases, errors, assumptions and risks inherent in predictive modeling

  • Inviting peer and bias review

Seeking to act in the public interest:

• Choosing or seeking work that is considered to have a benefit to the public/society

• Understanding how the benefits may be distributed across society

• Considering who or what could be affected by outcomes

• Thinking about what could mitigate any risks for unfairness

  • Leaders encouraging a culture which values social justice and fairness

2. Avoid harm.

Data science has the potential to cause harm and this ethical consideration therefore focuses on how practitioners can avoid this by working in a manner that respects the privacy, equality and autonomy of individuals and groups, and speaking up about potential harm or ethical violations. Practitioners may be subject to legal and regulatory obligations in relation to the privacy of individuals, relevant to the jurisdiction in which they are working, as well as regulatory obligations to speak up about harm or violations of legal requirements. This can also be applied to work relating to businesses, animals or the environment, with consideration of commercial rights, animal welfare and the protection of environmental resources.

Using data that is ‘ethically sourced’, for example where data subjects have knowingly given their data and the perceptions of data subjects have been considered:

• Understanding and asking questions about the origin of the data, such as how it was collected

• Investigating the attached consents and legal uses for the data

• Considering the privacy, dignity and fair treatment of individuals when selecting data

  • Leaders being aware of and understanding the implications of using ‘big data’

Linking in privacy and ethics into work, as well as legal and regulatory requirements:

• Following good data handling practices, including data security, and considering ways to reduce the need for collecting, storing and utilizing personally identifiable information

• Considering the impact of deriving demographics or linking with other data

  • Being transparent and providing evidence of privacy considerations to the public and/or regulators

Embedding ethical risk management into work:

• Using ethical assessment checkpoints

• Seeking independent or domain-expert advice on assumptions and risks

• Appropriately monitoring and communicating ethical risks and potential biases to relevant stakeholders

• Regular ethical reviews of models/algorithms

• Applying organizational or industrial ethical policies

  • Leaders putting clear ethical governance in place for data science work

Educating the workforce to recognize the risks for harm and/or unfairness:

• Communicating to others what data science can and cannot do

• Helping non-technical individuals understand the ethical, professional and technical issues relevant to a project

  • Leaders ensuring that their organization as a whole understands the ethical principles and policies surrounding data science.

Considering the impact on the environment and its resources:

• Fully understanding error, bias and risks associated with models applied to environmental problems

  • Evaluating the energy-cost of storing and processing large volumes of data, and considering different options such as cloud computing, aggregating data and regularly reviewing the need for data

3. Apply and maintain professional competence.

This ethical principle expects data science practitioners to apply best practice and comply with all relevant legal and regulatory requirements, as well as applicable professional body codes.

Professional competence involves fully understanding the sources of error and bias in data, using ‘clean’ data (eg edited for missing, inconsistent or erroneous values), and supporting work with robust statistical and algorithmic methods that are appropriate to the question being asked. Practitioners can also thoroughly assess and balance the benefits of the work versus the risks posed by it, and keep models under regular review.

Complying with relevant professional and regulatory practices:

• Investigating what the relevant professional and legal/regulatory requirements are to the work in which you are involved

• Ensuring that the work complies with these requirements and justifying this to relevant authorities

  • Being aware of any evolving legal requirements and communicating these appropriately to relevant stakeholders

Ensuring the business’s ethics policies, procedures and governance are applied to data science work:

• Providing executive staff with enough information on the advantages and limitations of the work to make decisions about the use of models

• Understanding and communicating to stakeholders and decision-makers the ethical risks of the project, such as bias, uncertainty, quality issues, individual/ commercial harm, methodology assumptions, disadvantages of chosen methods

• Leaders setting clear lines of responsibility

  • Educating the workforce appropriately about relevant policies and procedures

Following best practice for data analysis:

• Using appropriate statistical and algorithmic methods for the question being addressed

  • Leaders taking the level of uncertainty in work into account when using data science in any decision making

Monitoring and maintaining models:

• Having clear owners for models

• Monitoring the performance of models

• Having regular reviews of models

  • Having an agreed action plan for updating or retiring models that are no longer fit for purpose

Validating and improving work:

• Thoroughly quality-assuring data and analysis, including for bias, error, uncertainty

• Encouraging challenge from others involved in the project

• Sharing appropriate code, diagnostics and results with colleagues/customers to

ensure they sufficiently understand the models

• Publishing results publicly where possible

  • Seeking feedback from peers, academia or subject matter experts

4. Seek to preserve or increase trustworthiness.

The public’s trust and confidence in the work of data scientists can be affected by the way ethical principles are applied. Practitioners can help to increase the trustworthiness to their work by considering ethical principles throughout all stages of a project. This is another reoccurring theme that encourages practitioners to be transparent and honest when communicating about the way data is used. Transparency can include fully explaining how algorithms are being used, if and why any decisions have been delegated, and being open about the risks and biases. Engaging widely with a diverse range of stakeholders and considering public perceptions both from the outset, and throughout projects, can help to build trustworthiness and ensure all potential biases are understood.

Engaging with appropriate ethical bodies:

  • Leaders in this area may wish to communicate with ethical bodies and the public about how data is used and handled in relation to data science work

Avoiding unnecessary complexity in methods to improve transparency:

• Considering simpler models and documenting the performance of different models tested

  • Disclosing the reasons for any improvements in accuracy if one model is favored over another

Building trust through transparent communication with the public:

• Communicating the ways in which data is used and the benefits to customers/ wider public

• Taking accountability for involvement in the work

• Considering the views of the public and data subjects along with associated consents and legal uses of data

• Publishing and sharing methods with relevant internal stakeholders, customers and the wider public, clearly referencing limitations, bias and uncertainty

  • Being honest and objective in communications

5. Maintain accountability and oversight.

Another key issue in data ethics around automation and AI is the question of how practitioners maintain human accountability and oversight within their work. Being accountable can include being mindful of how and when to delegate any decision making to systems, and having governance in place to ensure systems deliver the intended objectives. When deciding to delegate any decision making, it would be useful to fully understand and explain the potential implications of doing so, as the work could lead to introducing advanced AI systems which do not have adequate governance. Practitioners should note that delegating any decisions to these systems does not remove any of their individual responsibilities.

Maintaining human oversight of automated solutions:

• Implementing model governance by deciding how to monitor the model over time, such as setting review points and sign-off

• Agreeing where responsibility lies for models in production (approval, reviews, longer-term quality assurance)

  • Agreeing a review process

Clearly defining operating constraints:

• Adding clear constraints within systems and corresponding documentation to control and communicate what systems can and cannot do

Public involvement:

• Using public engagement and involvement to feed into projects, where appropriate

  • Checking what public concerns may be

Governance and accountability structures

• Having ethics oversight committees where appropriate

  • Having clear processes for accountability so that stakeholders have a clear route to any redress

Data Science presents one with the ability to find patterns in data and innovate new data products for the greater social good and is considered to be ethically- neutral. It does not come with its own perspective of either: what is correct or incorrect; nor: what is good or bad in using it. While data science has no value framework, organizations, on the other hand, have value systems in place. By asking and seeking answers to various ethical questions, one can ensure it is used in a way that aligns with the organizations’ ethics and values.

The future will be completely driven by Machine Learning and Data Science forms the epicenter to this feature. Machines are fueled by the data they are trained on. Every advertisement, every self-driving cars, every medical diagnosis provided by a machine will be based on certain data. Data Ethics is a rapidly improvising field-of-study. Increasingly, those collecting, sharing, and working with data are delving into the ethics of their methods and in some cases, they are being forced to encounter those ethics in the face of public criticism. A failure to handle data ethically can severely impact people and could lead to a loss of trust in projects, products, or organizations.

There are indeed more principles we need to create a more powerful technology become available. Data scientists, data engineers, database administrators and everyone associated with controlling data should have a voice in the ethical debate about how data should be used. Organizations should openly address these difficulties in formal and informal conventions.

1. Data science code of professional conduct

https://www.datascienceassn.org/code-of-conduct.html

2. Reducing discrimination in AI with new methodology

https://www.ibm.com/blogs/research/2017/12/ai-reducing-discrimination

3. Vanrij Menam — Why we need ethical AI

https://vanrijmenam.nl/why-we-need-ethical-ai-5-initiatives-ensure-ethics-ai/

--

--