一、科技公司裁员数据集分析
1.背景描述
根据就业追踪机构Layoffs.fyi估计,自2022年初以来,已有超过400,000个科技工作岗位被裁减。
在刚刚发布的AI人才报告显示,全世界的顶尖AI人才中,几乎每两人中就有一人是中国培养出的人才!开卷后,美国码农们也都被迫开启了地狱模式,为一个offer面试12场。[1]
本数据集包含了 2020.3 - 2024.1 的科技公司裁员信息的 1400+ 条数据记录。

2.数据说明
字段 | 说明 |
---|---|
Company | 公司名称 |
Location_HQ | 总部所在城市 |
Country | 国家 |
Continent | 所在大陆 |
Laid_Off | 裁员人数 |
Date_layoffs | 裁员时间 |
Percentage | 裁员比例 |
Company_Size_before_Layoffs | 公司裁员前规模 |
Company_Size_after_layoffs | 公司裁员后规模 |
Industry | 行业 |
Stage | 融资阶段 |
Money_Raised_in_$_mil | 筹集资金(单位:百万美元) |
Year | 记录年份 |
- 融资阶段
字段 | 说明 |
---|---|
Series A-Z | 第A-Z轮 |
Post IPO | 已上市 |
Acquired | 被收购 |
Private Equity | 私募股权 |
3.数据来源
www.kaggle.com/datasets/ul...
layoffs.fyi/
4.问题描述
全球或区域裁员趋势分析
行业裁员分析
公司规模与裁员关系
融资阶段与裁员风险
二、数据查看
1.加载数据
python
import pandas as pd
data=pd.read_excel('tech_layoffs.xlsx')
data.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| | # | Company | Location_HQ | Country | Continent | Laid_Off | Date_layoffs | Percentage | Company_Size_before_Layoffs | Company_Size_after_layoffs | Industry | Stage | Money_Raised_in__mil | Year | lat | lng |
| 0 | 3 | ShareChat | Bengaluru | India | Asia | 200 | 2023-12-20 | 15.0 | 1333 | 1133 | Consumer | Series H | 1700 | 2023 | 12.97194 | 77.59369 |
| 1 | 4 | InSightec | Haifa | Israel | Asia | 100 | 2023-12-19 | 20.0 | 500 | 400 | Healthcare | Unknown | 733 | 2023 | 32.81841 | 34.98850 |
| 2 | 6 | Enphase Energy | San Francisco Bay Area | USA | North America | 350 | 2023-12-18 | 10.0 | 3500 | 3150 | Energy | Post-IPO | 116 | 2023 | 37.54827 | -121.98857 |
| 3 | 7 | Udaan | Bengaluru | India | Asia | 100 | 2023-12-18 | 10.0 | 1000 | 900 | Retail | Unknown | 1500 | 2023 | 12.97194 | 77.59369 |
4 | 14 | Cruise | San Francisco Bay Area | USA | North America | 900 | 2023-12-14 | 24.0 | 3750 | 2850 | Transportation | Acquired | $15000 | 2023 | 37.77493 | -122.41942 |
---|
2.时间字段转换
查看发现Date_layoffs 为裁员时间,该字段转为datatime类型
python
help(pd.to_datetime)
sql
Help on function to_datetime in module pandas.core.tools.datetimes:
to_datetime(arg: 'DatetimeScalarOrArrayConvertible', errors: 'str' = 'raise', dayfirst: 'bool' = False, yearfirst: 'bool' = False, utc: 'bool | None' = None, format: 'str | None' = None, exact: 'bool' = True, unit: 'str | None' = None, infer_datetime_format: 'bool' = False, origin='unix', cache: 'bool' = True) -> 'DatetimeIndex | Series | DatetimeScalar | NaTType | None'
Convert argument to datetime.
Parameters
----------
arg : int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
The object to convert to a datetime.
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
- If 'raise', then invalid parsing will raise an exception.
- If 'coerce', then invalid parsing will be set as NaT.
- If 'ignore', then invalid parsing will return the input.
dayfirst : bool, default False
Specify a date parse order if `arg` is str or its list-likes.
If True, parses dates with the day first, eg 10/11/12 is parsed as
2012-11-10.
Warning: dayfirst=True is not strict, but will prefer to parse
with day first (this is a known bug, based on dateutil behavior).
yearfirst : bool, default False
Specify a date parse order if `arg` is str or its list-likes.
- If True parses dates with the year first, eg 10/11/12 is parsed as
2010-11-12.
- If both dayfirst and yearfirst are True, yearfirst is preceded (same
as dateutil).
Warning: yearfirst=True is not strict, but will prefer to parse
with year first (this is a known bug, based on dateutil behavior).
utc : bool, default None
Return UTC DatetimeIndex if True (converting any tz-aware
datetime.datetime objects as well).
format : str, default None
The strftime to parse time, eg "%d/%m/%Y", note that "%f" will parse
all the way up to nanoseconds.
See strftime documentation for more information on choices:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.
exact : bool, True by default
Behaves as:
- If True, require an exact format match.
- If False, allow the format to match anywhere in the target string.
unit : str, default 'ns'
The unit of the arg (D,s,ms,us,ns) denote the unit, which is an
integer or float number. This will be based off the origin.
Example, with unit='ms' and origin='unix' (the default), this
would calculate the number of milliseconds to the unix epoch start.
infer_datetime_format : bool, default False
If True and no `format` is given, attempt to infer the format of the
datetime strings based on the first non-NaN element,
and if it can be inferred, switch to a faster method of parsing them.
In some cases this can increase the parsing speed by ~5-10x.
origin : scalar, default 'unix'
Define the reference date. The numeric values would be parsed as number
of units (defined by `unit`) since this reference date.
- If 'unix' (or POSIX) time; origin is set to 1970-01-01.
- If 'julian', unit must be 'D', and origin is set to beginning of
Julian Calendar. Julian day number 0 is assigned to the day starting
at noon on January 1, 4713 BC.
- If Timestamp convertible, origin is set to Timestamp identified by
origin.
cache : bool, default True
If True, use a cache of unique, converted dates to apply the datetime
conversion. May produce significant speed-up when parsing duplicate
date strings, especially ones with timezone offsets. The cache is only
used when there are at least 50 values. The presence of out-of-bounds
values will render the cache unusable and may slow down parsing.
.. versionchanged:: 0.25.0
- changed default value from False to True.
Returns
-------
datetime
If parsing succeeded.
Return type depends on input:
- list-like: DatetimeIndex
- Series: Series of datetime64 dtype
- scalar: Timestamp
In case when it is not possible to return designated types (e.g. when
any element of input is before Timestamp.min or after Timestamp.max)
return will have datetime.datetime type (or corresponding
array/Series).
See Also
--------
DataFrame.astype : Cast argument to a specified dtype.
to_timedelta : Convert argument to timedelta.
convert_dtypes : Convert dtypes.
Examples
--------
Assembling a datetime from multiple columns of a DataFrame. The keys can be
common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
If a date does not meet the `timestamp limitations
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#timeseries-timestamp-limits>`_, passing errors='ignore'
will return the original input instead of raising any exception.
Passing errors='coerce' will force an out-of-bounds date to NaT,
in addition to forcing non-dates (or non-parseable dates) to NaT.
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
datetime.datetime(1300, 1, 1, 0, 0)
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
NaT
Passing infer_datetime_format=True can often-times speedup a parsing
if its not an ISO8601 format exactly, but in a regular format.
>>> s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000'] * 1000)
>>> s.head()
0 3/11/2000
1 3/12/2000
2 3/13/2000
3 3/11/2000
4 3/12/2000
dtype: object
>>> %timeit pd.to_datetime(s, infer_datetime_format=True) # doctest: +SKIP
100 loops, best of 3: 10.4 ms per loop
>>> %timeit pd.to_datetime(s, infer_datetime_format=False) # doctest: +SKIP
1 loop, best of 3: 471 ms per loop
Using a unix epoch time
>>> pd.to_datetime(1490195805, unit='s')
Timestamp('2017-03-22 15:16:45')
>>> pd.to_datetime(1490195805433502912, unit='ns')
Timestamp('2017-03-22 15:16:45.433502912')
.. warning:: For float arg, precision rounding might happen. To prevent
unexpected behavior use a fixed-width exact type.
Using a non-unix epoch origin
>>> pd.to_datetime([1, 2, 3], unit='D',
... origin=pd.Timestamp('1960-01-01'))
DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'],
dtype='datetime64[ns]', freq=None)
In case input is list-like and the elements of input are of mixed
timezones, return will have object type Index if utc=False.
>>> pd.to_datetime(['2018-10-26 12:00 -0530', '2018-10-26 12:00 -0500'])
Index([2018-10-26 12:00:00-05:30, 2018-10-26 12:00:00-05:00], dtype='object')
>>> pd.to_datetime(['2018-10-26 12:00 -0530', '2018-10-26 12:00 -0500'],
... utc=True)
DatetimeIndex(['2018-10-26 17:30:00+00:00', '2018-10-26 17:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
python
data['Date_layoffs']=pd.to_datetime(data['Date_layoffs'])
python
data.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| | # | Company | Location_HQ | Country | Continent | Laid_Off | Date_layoffs | Percentage | Company_Size_before_Layoffs | Company_Size_after_layoffs | Industry | Stage | Money_Raised_in__mil | Year | lat | lng |
| 0 | 3 | ShareChat | Bengaluru | India | Asia | 200 | 2023-12-20 | 15.0 | 1333 | 1133 | Consumer | Series H | 1700 | 2023 | 12.97194 | 77.59369 |
| 1 | 4 | InSightec | Haifa | Israel | Asia | 100 | 2023-12-19 | 20.0 | 500 | 400 | Healthcare | Unknown | 733 | 2023 | 32.81841 | 34.98850 |
| 2 | 6 | Enphase Energy | San Francisco Bay Area | USA | North America | 350 | 2023-12-18 | 10.0 | 3500 | 3150 | Energy | Post-IPO | 116 | 2023 | 37.54827 | -121.98857 |
| 3 | 7 | Udaan | Bengaluru | India | Asia | 100 | 2023-12-18 | 10.0 | 1000 | 900 | Retail | Unknown | 1500 | 2023 | 12.97194 | 77.59369 |
4 | 14 | Cruise | San Francisco Bay Area | USA | North America | 900 | 2023-12-14 | 24.0 | 3750 | 2850 | Transportation | Acquired | $15000 | 2023 | 37.77493 | -122.41942 |
---|
3.删除无意义列
可以看到
- 没有控制
- 此外 #、lat、lng没有意义,就删掉吧。
- drop 默认是drop后的值,如果 inplace 默认 False,如果 True,执行 inplace 操作并返回 Nothing。
python
data.info()
yaml
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1418 entries, 0 to 1417
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 # 1418 non-null int64
1 Company 1418 non-null object
2 Location_HQ 1418 non-null object
3 Country 1418 non-null object
4 Continent 1418 non-null object
5 Laid_Off 1418 non-null int64
6 Date_layoffs 1418 non-null datetime64[ns]
7 Percentage 1418 non-null float64
8 Company_Size_before_Layoffs 1418 non-null int64
9 Company_Size_after_layoffs 1418 non-null int64
10 Industry 1418 non-null object
11 Stage 1418 non-null object
12 Money_Raised_in_$_mil 1418 non-null object
13 Year 1418 non-null int64
14 lat 1418 non-null float64
15 lng 1418 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(5), object(7)
memory usage: 177.4+ KB
python
help(pd.core.frame.DataFrame.drop)
sql
Help on function drop in module pandas.core.frame:
drop(self, labels=None, axis: 'Axis' = 0, index=None, columns=None, level: 'Level | None' = None, inplace: 'bool' = False, errors: 'str' = 'raise')
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding
axis, or by specifying directly index or column names. When using a
multi-index, labels on different levels can be removed by specifying
the level. See the `user guide <advanced.shown_levels>`
for more information about the now unused levels.
Parameters
----------
labels : single label or list-like
Index or column labels to drop.
axis : {0 or 'index', 1 or 'columns'}, default 0
Whether to drop labels from the index (0 or 'index') or
columns (1 or 'columns').
index : single label or list-like
Alternative to specifying axis (``labels, axis=0``
is equivalent to ``index=labels``).
columns : single label or list-like
Alternative to specifying axis (``labels, axis=1``
is equivalent to ``columns=labels``).
level : int or level name, optional
For MultiIndex, level from which the labels will be removed.
inplace : bool, default False
If False, return a copy. Otherwise, do operation
inplace and return None.
errors : {'ignore', 'raise'}, default 'raise'
If 'ignore', suppress error and only existing labels are
dropped.
Returns
-------
DataFrame or None
DataFrame without the removed index or column labels or
None if ``inplace=True``.
Raises
------
KeyError
If any of the labels is not found in the selected axis.
See Also
--------
DataFrame.loc : Label-location based indexer for selection by label.
DataFrame.dropna : Return DataFrame with labels on given axis omitted
where (all or any) data are missing.
DataFrame.drop_duplicates : Return DataFrame with duplicate rows
removed, optionally only considering certain columns.
Series.drop : Return Series with specified index labels removed.
Examples
--------
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),
... columns=['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Drop columns
>>> df.drop(['B', 'C'], axis=1)
A D
0 0 3
1 4 7
2 8 11
>>> df.drop(columns=['B', 'C'])
A D
0 0 3
1 4 7
2 8 11
Drop a row by index
>>> df.drop([0, 1])
A B C D
2 8 9 10 11
Drop columns and/or rows of MultiIndex DataFrame
>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
... ['speed', 'weight', 'length']],
... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
... [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
... data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
... [250, 150], [1.5, 0.8], [320, 250],
... [1, 0.8], [0.3, 0.2]])
>>> df
big small
lama speed 45.0 30.0
weight 200.0 100.0
length 1.5 1.0
cow speed 30.0 20.0
weight 250.0 150.0
length 1.5 0.8
falcon speed 320.0 250.0
weight 1.0 0.8
length 0.3 0.2
>>> df.drop(index='cow', columns='small')
big
lama speed 45.0
weight 200.0
length 1.5
falcon speed 320.0
weight 1.0
length 0.3
>>> df.drop(index='length', level=1)
big small
lama speed 45.0 30.0
weight 200.0 100.0
cow speed 30.0 20.0
weight 250.0 150.0
falcon speed 320.0 250.0
weight 1.0 0.8
python
# 删除3列,并替换
data.drop(['#',"lat",'lng'],axis=1,inplace=True)
三、分析
1.时间-裁员图
python
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(40, 80))
sns.relplot(data=data, x="Date_layoffs", y="Laid_Off", kind="line", ci = None)
plt.xticks(rotation=45)
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(12))
plt.show()
go
D:\miniconda3\envs\p2\lib\site-packages\seaborn\axisgrid.py:848: FutureWarning:
The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.
func(*plot_args, **plot_kwargs)
<Figure size 4000x8000 with 0 Axes>

2.时间-裁员-阶段图
不同阶段裁员数量截然不同,当然跟时间也密切相关。
python
sns.relplot(data=data, x="Date_layoffs", y="Laid_Off", hue="Stage", kind="line")
xml
<seaborn.axisgrid.FacetGrid at 0x1b0d7bae3a0>

3.公司裁员前规模-裁员 散点图
python
sns.relplot(x="Company_Size_before_Layoffs", y="Laid_Off", data=data, kind="scatter", hue = "Company_Size_before_Layoffs")
plt.xticks(rotation=45)
plt.show()

4.裁员总量Top10年度
python
top10layoffs_year = data.groupby("Year").agg({"Laid_Off": "sum"}).nlargest(10, "Laid_Off")
top10layoffs_year
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| | Laid_Off |
| Year | |
| 2023 | 177026 |
| 2022 | 129031 |
| 2020 | 61960 |
| 2021 | 6790 |
2024 | 4355 |
---|
5.每年裁员前5公司
python
top5layoffs_company = data.groupby(["Year", "Company"]).agg({"Laid_Off": "sum"}).reset_index()
top5layoffs_company = top5layoffs_company.groupby("Year").apply(lambda x: x.nlargest(5, "Laid_Off")).reset_index(drop=True)
top5layoffs_company
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| | Year | Company | Laid_Off |
| 0 | 2020 | Uber | 7525 |
| 1 | 2020 | Groupon | 2800 |
| 2 | 2020 | Swiggy | 2250 |
| 3 | 2020 | Airbnb | 1900 |
| 4 | 2020 | PaisaBazaar | 1500 |
| 5 | 2021 | Katerra | 2434 |
| 6 | 2021 | Zillow | 2000 |
| 7 | 2021 | Better.com | 900 |
| 8 | 2021 | Dropbox | 315 |
| 9 | 2021 | Delivery Hero | 300 |
| 10 | 2022 | Meta | 11000 |
| 11 | 2022 | Amazon | 10150 |
| 12 | 2022 | Cisco | 4100 |
| 13 | 2022 | Peloton | 4084 |
| 14 | 2022 | Carvana | 4000 |
| 15 | 2023 | Amazon | 17000 |
| 16 | 2023 | Google | 12000 |
| 17 | 2023 | Meta | 10000 |
| 18 | 2023 | Microsoft | 10000 |
| 19 | 2023 | Ericsson | 8500 |
| 20 | 2024 | Unity | 1800 |
| 21 | 2024 | Google | 1000 |
| 22 | 2024 | Twitch | 500 |
| 23 | 2024 | Frontdesk | 200 |
24 | 2024 | Discord | 170 |
---|
python
sns.set(style="whitegrid")
sns.set_palette("bright")
plt.figure(figsize=(12, 8))
sns.barplot(data=top5layoffs_company, x="Laid_Off", y="Company", hue="Year", dodge=True)
plt.xlabel("Number of Layoffs")
plt.ylabel("Company")
plt.title("Top 10 Companies with Highest Layoffs per Year (2020-2024)")
plt.legend(title="Year")
plt.tight_layout()
plt.show()
