一、科技公司裁员数据集分析

1.背景描述

根据就业追踪机构Layoffs.fyi估计，自2022年初以来，已有超过400,000个科技工作岗位被裁减。

在刚刚发布的AI人才报告显示，全世界的顶尖AI人才中，几乎每两人中就有一人是中国培养出的人才！开卷后，美国码农们也都被迫开启了地狱模式，为一个offer面试12场。[1]

本数据集包含了 2020.3 - 2024.1 的科技公司裁员信息的 1400+ 条数据记录。

2.数据说明

字段	说明
Company	公司名称
Location_HQ	总部所在城市
Country	国家
Continent	所在大陆
Laid_Off	裁员人数
Date_layoffs	裁员时间
Percentage	裁员比例
Company_Size_before_Layoffs	公司裁员前规模
Company_Size_after_layoffs	公司裁员后规模
Industry	行业
Stage	融资阶段
Money_Raised_in_$_mil	筹集资金（单位：百万美元）
Year	记录年份

融资阶段

字段	说明
Series A-Z	第A-Z轮
Post IPO	已上市
Acquired	被收购
Private Equity	私募股权

3.数据来源

www.kaggle.com/datasets/ul...
layoffs.fyi/

4.问题描述

全球或区域裁员趋势分析

行业裁员分析

公司规模与裁员关系

融资阶段与裁员风险

二、数据查看

1.加载数据

python 复制代码

import pandas as pd
data=pd.read_excel('tech_layoffs.xlsx')
data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

| | # | Company | Location_HQ | Country | Continent | Laid_Off | Date_layoffs | Percentage | Company_Size_before_Layoffs | Company_Size_after_layoffs | Industry | Stage | Money_Raised_in_ $_mil | Year | lat | lng | | 0 | 3 | ShareChat | Bengaluru | India | Asia | 200 | 2023-12-20 | 15.0 | 1333 | 1133 | Consumer | Series H |$ 1700 | 2023 | 12.97194 | 77.59369 |
| 1 | 4 | InSightec | Haifa | Israel | Asia | 100 | 2023-12-19 | 20.0 | 500 | 400 | Healthcare | Unknown | $733 | 2023 | 32.81841 | 34.98850 | | 2 | 6 | Enphase Energy | San Francisco Bay Area | USA | North America | 350 | 2023-12-18 | 10.0 | 3500 | 3150 | Energy | Post-IPO |$ 116 | 2023 | 37.54827 | -121.98857 |
| 3 | 7 | Udaan | Bengaluru | India | Asia | 100 | 2023-12-18 | 10.0 | 1000 | 900 | Retail | Unknown | 1500 | 2023 | 12.97194 | 77.59369 |

4	14	Cruise	San Francisco Bay Area	USA	North America	900	2023-12-14	24.0	3750	2850	Transportation	Acquired	$15000	2023	37.77493	-122.41942

2.时间字段转换

查看发现Date_layoffs 为裁员时间，该字段转为datatime类型

python 复制代码

help(pd.to_datetime)

sql 复制代码

Help on function to_datetime in module pandas.core.tools.datetimes:

to_datetime(arg: 'DatetimeScalarOrArrayConvertible', errors: 'str' = 'raise', dayfirst: 'bool' = False, yearfirst: 'bool' = False, utc: 'bool | None' = None, format: 'str | None' = None, exact: 'bool' = True, unit: 'str | None' = None, infer_datetime_format: 'bool' = False, origin='unix', cache: 'bool' = True) -> 'DatetimeIndex | Series | DatetimeScalar | NaTType | None'
    Convert argument to datetime.
    
    Parameters
    ----------
    arg : int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
        The object to convert to a datetime.
    errors : {'ignore', 'raise', 'coerce'}, default 'raise'
        - If 'raise', then invalid parsing will raise an exception.
        - If 'coerce', then invalid parsing will be set as NaT.
        - If 'ignore', then invalid parsing will return the input.
    dayfirst : bool, default False
        Specify a date parse order if `arg` is str or its list-likes.
        If True, parses dates with the day first, eg 10/11/12 is parsed as
        2012-11-10.
        Warning: dayfirst=True is not strict, but will prefer to parse
        with day first (this is a known bug, based on dateutil behavior).
    yearfirst : bool, default False
        Specify a date parse order if `arg` is str or its list-likes.
    
        - If True parses dates with the year first, eg 10/11/12 is parsed as
          2010-11-12.
        - If both dayfirst and yearfirst are True, yearfirst is preceded (same
          as dateutil).
    
        Warning: yearfirst=True is not strict, but will prefer to parse
        with year first (this is a known bug, based on dateutil behavior).
    utc : bool, default None
        Return UTC DatetimeIndex if True (converting any tz-aware
        datetime.datetime objects as well).
    format : str, default None
        The strftime to parse time, eg "%d/%m/%Y", note that "%f" will parse
        all the way up to nanoseconds.
        See strftime documentation for more information on choices:
        https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.
    exact : bool, True by default
        Behaves as:
        - If True, require an exact format match.
        - If False, allow the format to match anywhere in the target string.
    
    unit : str, default 'ns'
        The unit of the arg (D,s,ms,us,ns) denote the unit, which is an
        integer or float number. This will be based off the origin.
        Example, with unit='ms' and origin='unix' (the default), this
        would calculate the number of milliseconds to the unix epoch start.
    infer_datetime_format : bool, default False
        If True and no `format` is given, attempt to infer the format of the
        datetime strings based on the first non-NaN element,
        and if it can be inferred, switch to a faster method of parsing them.
        In some cases this can increase the parsing speed by ~5-10x.
    origin : scalar, default 'unix'
        Define the reference date. The numeric values would be parsed as number
        of units (defined by `unit`) since this reference date.
    
        - If 'unix' (or POSIX) time; origin is set to 1970-01-01.
        - If 'julian', unit must be 'D', and origin is set to beginning of
          Julian Calendar. Julian day number 0 is assigned to the day starting
          at noon on January 1, 4713 BC.
        - If Timestamp convertible, origin is set to Timestamp identified by
          origin.
    cache : bool, default True
        If True, use a cache of unique, converted dates to apply the datetime
        conversion. May produce significant speed-up when parsing duplicate
        date strings, especially ones with timezone offsets. The cache is only
        used when there are at least 50 values. The presence of out-of-bounds
        values will render the cache unusable and may slow down parsing.
    
        .. versionchanged:: 0.25.0
            - changed default value from False to True.
    
    Returns
    -------
    datetime
        If parsing succeeded.
        Return type depends on input:
    
        - list-like: DatetimeIndex
        - Series: Series of datetime64 dtype
        - scalar: Timestamp
    
        In case when it is not possible to return designated types (e.g. when
        any element of input is before Timestamp.min or after Timestamp.max)
        return will have datetime.datetime type (or corresponding
        array/Series).
    
    See Also
    --------
    DataFrame.astype : Cast argument to a specified dtype.
    to_timedelta : Convert argument to timedelta.
    convert_dtypes : Convert dtypes.
    
    Examples
    --------
    Assembling a datetime from multiple columns of a DataFrame. The keys can be
    common abbreviations like ['year', 'month', 'day', 'minute', 'second',
    'ms', 'us', 'ns']) or plurals of the same
    
    >>> df = pd.DataFrame({'year': [2015, 2016],
    ...                    'month': [2, 3],
    ...                    'day': [4, 5]})
    >>> pd.to_datetime(df)
    0   2015-02-04
    1   2016-03-05
    dtype: datetime64[ns]
    
    If a date does not meet the `timestamp limitations
    <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
    #timeseries-timestamp-limits>`_, passing errors='ignore'
    will return the original input instead of raising any exception.
    
    Passing errors='coerce' will force an out-of-bounds date to NaT,
    in addition to forcing non-dates (or non-parseable dates) to NaT.
    
    >>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
    datetime.datetime(1300, 1, 1, 0, 0)
    >>> pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
    NaT
    
    Passing infer_datetime_format=True can often-times speedup a parsing
    if its not an ISO8601 format exactly, but in a regular format.
    
    >>> s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000'] * 1000)
    >>> s.head()
    0    3/11/2000
    1    3/12/2000
    2    3/13/2000
    3    3/11/2000
    4    3/12/2000
    dtype: object
    
    >>> %timeit pd.to_datetime(s, infer_datetime_format=True)  # doctest: +SKIP
    100 loops, best of 3: 10.4 ms per loop
    
    >>> %timeit pd.to_datetime(s, infer_datetime_format=False)  # doctest: +SKIP
    1 loop, best of 3: 471 ms per loop
    
    Using a unix epoch time
    
    >>> pd.to_datetime(1490195805, unit='s')
    Timestamp('2017-03-22 15:16:45')
    >>> pd.to_datetime(1490195805433502912, unit='ns')
    Timestamp('2017-03-22 15:16:45.433502912')
    
    .. warning:: For float arg, precision rounding might happen. To prevent
        unexpected behavior use a fixed-width exact type.
    
    Using a non-unix epoch origin
    
    >>> pd.to_datetime([1, 2, 3], unit='D',
    ...                origin=pd.Timestamp('1960-01-01'))
    DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'],
                  dtype='datetime64[ns]', freq=None)
    
    In case input is list-like and the elements of input are of mixed
    timezones, return will have object type Index if utc=False.
    
    >>> pd.to_datetime(['2018-10-26 12:00 -0530', '2018-10-26 12:00 -0500'])
    Index([2018-10-26 12:00:00-05:30, 2018-10-26 12:00:00-05:00], dtype='object')
    
    >>> pd.to_datetime(['2018-10-26 12:00 -0530', '2018-10-26 12:00 -0500'],
    ...                utc=True)
    DatetimeIndex(['2018-10-26 17:30:00+00:00', '2018-10-26 17:00:00+00:00'],
                  dtype='datetime64[ns, UTC]', freq=None)

python 复制代码

data['Date_layoffs']=pd.to_datetime(data['Date_layoffs'])

python 复制代码

data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

4	14	Cruise	San Francisco Bay Area	USA	North America	900	2023-12-14	24.0	3750	2850	Transportation	Acquired	$15000	2023	37.77493	-122.41942

3.删除无意义列

可以看到

没有控制
此外 #、lat、lng没有意义，就删掉吧。
drop 默认是drop后的值，如果 inplace 默认 False，如果 True，执行 inplace 操作并返回 Nothing。

python 复制代码

data.info()

yaml 复制代码

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1418 entries, 0 to 1417
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   #                            1418 non-null   int64         
 1   Company                      1418 non-null   object        
 2   Location_HQ                  1418 non-null   object        
 3   Country                      1418 non-null   object        
 4   Continent                    1418 non-null   object        
 5   Laid_Off                     1418 non-null   int64         
 6   Date_layoffs                 1418 non-null   datetime64[ns]
 7   Percentage                   1418 non-null   float64       
 8   Company_Size_before_Layoffs  1418 non-null   int64         
 9   Company_Size_after_layoffs   1418 non-null   int64         
 10  Industry                     1418 non-null   object        
 11  Stage                        1418 non-null   object        
 12  Money_Raised_in_$_mil        1418 non-null   object        
 13  Year                         1418 non-null   int64         
 14  lat                          1418 non-null   float64       
 15  lng                          1418 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(5), object(7)
memory usage: 177.4+ KB

python 复制代码

help(pd.core.frame.DataFrame.drop)

sql 复制代码

Help on function drop in module pandas.core.frame:

drop(self, labels=None, axis: 'Axis' = 0, index=None, columns=None, level: 'Level | None' = None, inplace: 'bool' = False, errors: 'str' = 'raise')
    Drop specified labels from rows or columns.
    
    Remove rows or columns by specifying label names and corresponding
    axis, or by specifying directly index or column names. When using a
    multi-index, labels on different levels can be removed by specifying
    the level. See the `user guide <advanced.shown_levels>`
    for more information about the now unused levels.
    
    Parameters
    ----------
    labels : single label or list-like
        Index or column labels to drop.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Whether to drop labels from the index (0 or 'index') or
        columns (1 or 'columns').
    index : single label or list-like
        Alternative to specifying axis (``labels, axis=0``
        is equivalent to ``index=labels``).
    columns : single label or list-like
        Alternative to specifying axis (``labels, axis=1``
        is equivalent to ``columns=labels``).
    level : int or level name, optional
        For MultiIndex, level from which the labels will be removed.
    inplace : bool, default False
        If False, return a copy. Otherwise, do operation
        inplace and return None.
    errors : {'ignore', 'raise'}, default 'raise'
        If 'ignore', suppress error and only existing labels are
        dropped.
    
    Returns
    -------
    DataFrame or None
        DataFrame without the removed index or column labels or
        None if ``inplace=True``.
    
    Raises
    ------
    KeyError
        If any of the labels is not found in the selected axis.
    
    See Also
    --------
    DataFrame.loc : Label-location based indexer for selection by label.
    DataFrame.dropna : Return DataFrame with labels on given axis omitted
        where (all or any) data are missing.
    DataFrame.drop_duplicates : Return DataFrame with duplicate rows
        removed, optionally only considering certain columns.
    Series.drop : Return Series with specified index labels removed.
    
    Examples
    --------
    >>> df = pd.DataFrame(np.arange(12).reshape(3, 4),
    ...                   columns=['A', 'B', 'C', 'D'])
    >>> df
       A  B   C   D
    0  0  1   2   3
    1  4  5   6   7
    2  8  9  10  11
    
    Drop columns
    
    >>> df.drop(['B', 'C'], axis=1)
       A   D
    0  0   3
    1  4   7
    2  8  11
    
    >>> df.drop(columns=['B', 'C'])
       A   D
    0  0   3
    1  4   7
    2  8  11
    
    Drop a row by index
    
    >>> df.drop([0, 1])
       A  B   C   D
    2  8  9  10  11
    
    Drop columns and/or rows of MultiIndex DataFrame
    
    >>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
    ...                              ['speed', 'weight', 'length']],
    ...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
    ...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
    >>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
    ...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
    ...                         [250, 150], [1.5, 0.8], [320, 250],
    ...                         [1, 0.8], [0.3, 0.2]])
    >>> df
                    big     small
    lama    speed   45.0    30.0
            weight  200.0   100.0
            length  1.5     1.0
    cow     speed   30.0    20.0
            weight  250.0   150.0
            length  1.5     0.8
    falcon  speed   320.0   250.0
            weight  1.0     0.8
            length  0.3     0.2
    
    >>> df.drop(index='cow', columns='small')
                    big
    lama    speed   45.0
            weight  200.0
            length  1.5
    falcon  speed   320.0
            weight  1.0
            length  0.3
    
    >>> df.drop(index='length', level=1)
                    big     small
    lama    speed   45.0    30.0
            weight  200.0   100.0
    cow     speed   30.0    20.0
            weight  250.0   150.0
    falcon  speed   320.0   250.0
            weight  1.0     0.8

python 复制代码

# 删除3列，并替换
data.drop(['#',"lat",'lng'],axis=1,inplace=True)

三、分析

1.时间-裁员图

python 复制代码

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(40, 80))
sns.relplot(data=data, x="Date_layoffs", y="Laid_Off", kind="line", ci = None)
plt.xticks(rotation=45)
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(12))
plt.show()

go 复制代码

D:\miniconda3\envs\p2\lib\site-packages\seaborn\axisgrid.py:848: FutureWarning: 

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  func(*plot_args, **plot_kwargs)



<Figure size 4000x8000 with 0 Axes>

2.时间-裁员-阶段图

不同阶段裁员数量截然不同，当然跟时间也密切相关。

python 复制代码

sns.relplot(data=data, x="Date_layoffs", y="Laid_Off", hue="Stage", kind="line")

xml 复制代码

<seaborn.axisgrid.FacetGrid at 0x1b0d7bae3a0>

3.公司裁员前规模-裁员散点图

python 复制代码

sns.relplot(x="Company_Size_before_Layoffs", y="Laid_Off", data=data, kind="scatter", hue = "Company_Size_before_Layoffs")
plt.xticks(rotation=45)
plt.show()

4.裁员总量Top10年度

python 复制代码

top10layoffs_year = data.groupby("Year").agg({"Laid_Off": "sum"}).nlargest(10, "Laid_Off")
top10layoffs_year

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

| | Laid_Off |
| Year | |
| 2023 | 177026 |
| 2022 | 129031 |
| 2020 | 61960 |
| 2021 | 6790 |

2024	4355

5.每年裁员前5公司

python 复制代码

top5layoffs_company = data.groupby(["Year", "Company"]).agg({"Laid_Off": "sum"}).reset_index()
top5layoffs_company = top5layoffs_company.groupby("Year").apply(lambda x: x.nlargest(5, "Laid_Off")).reset_index(drop=True)
top5layoffs_company

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

| | Year | Company | Laid_Off |
| 0 | 2020 | Uber | 7525 |
| 1 | 2020 | Groupon | 2800 |
| 2 | 2020 | Swiggy | 2250 |
| 3 | 2020 | Airbnb | 1900 |
| 4 | 2020 | PaisaBazaar | 1500 |
| 5 | 2021 | Katerra | 2434 |
| 6 | 2021 | Zillow | 2000 |
| 7 | 2021 | Better.com | 900 |
| 8 | 2021 | Dropbox | 315 |
| 9 | 2021 | Delivery Hero | 300 |
| 10 | 2022 | Meta | 11000 |
| 11 | 2022 | Amazon | 10150 |
| 12 | 2022 | Cisco | 4100 |
| 13 | 2022 | Peloton | 4084 |
| 14 | 2022 | Carvana | 4000 |
| 15 | 2023 | Amazon | 17000 |
| 16 | 2023 | Google | 12000 |
| 17 | 2023 | Meta | 10000 |
| 18 | 2023 | Microsoft | 10000 |
| 19 | 2023 | Ericsson | 8500 |
| 20 | 2024 | Unity | 1800 |
| 21 | 2024 | Google | 1000 |
| 22 | 2024 | Twitch | 500 |
| 23 | 2024 | Frontdesk | 200 |

24	2024	Discord	170

python 复制代码

sns.set(style="whitegrid")
sns.set_palette("bright")


plt.figure(figsize=(12, 8))
sns.barplot(data=top5layoffs_company, x="Laid_Off", y="Company", hue="Year", dodge=True)
plt.xlabel("Number of Layoffs")
plt.ylabel("Company")
plt.title("Top 10 Companies with Highest Layoffs per Year (2020-2024)")
plt.legend(title="Year")
plt.tight_layout()
plt.show()

科技公司裁员数据集分析

一、科技公司裁员数据集分析

1.背景描述

2.数据说明

3.数据来源

4.问题描述

二、数据查看

1.加载数据

2.时间字段转换

3.删除无意义列

三、分析

1.时间-裁员图

2.时间-裁员-阶段图

3.公司裁员前规模-裁员 散点图

4.裁员总量Top10年度

5.每年裁员前5公司

3.公司裁员前规模-裁员散点图