计算机Assignment代写|Data Science Assignment

1. In this question, we will load and inspect our dataset.

(a) Load the dataset fifa22.csv and display the first 5 rows. It is okay if not all of the columns are visible.
(b) What is the unit of analysis in this dataset?
(c) How many observations and how many features are in the dataset?
(d) The “gender” column provides the gender of the player, with binary options of “M” for male and “F” for female. How many male players are in the dataset and how many
female players are in this dataset?
(e) This dataset includes all playable characters in the videogame FIFA 2022. Do you think this dataset is representative of the real-world population of professional football/soccer players? Briefly explain why or why not.
(f) One last piece of data cleaning before we get started with analysis. Our dataset includes many missing values, particularly for female players. This means that we don’t want to drop all rows where any value is NaN, as that will disproportionally remove women from our dataset. Instead, we want to be more targeted: drop only rows where the column passing contains a missing (NaN) value. The final dataframe should have 17,450 observations. Show your code for dropping these rows and display the shape of the final dataset.

(a) 加载数据集fifa22.csv并显示前5行。如果不是所有的列都是可见的,也没关系。
(b) 这个数据集的分析单位是什么?
(c) 数据集中有多少个观察值和多少个特征?
(d) “性别 “列提供了球员的性别,二进制选项 “M “代表男性,”F “代表女性。数据集中有多少名男性球员,有多少名女性球员?
(e) 这个数据集包括电子游戏《FIFA 2022》中的所有可玩角色。你认为这个数据集能代表现实世界中的职业足球/足球运动员吗?简要解释为什么或为什么不。
(f) 在我们开始分析之前的最后一项数据清理。我们的数据集包括许多缺失值,特别是女性球员。这意味着,我们不希望 剔除所有数值为NaN的行,因为这将不成比例地将女性从我们的数据集中删除。的数据集中。相反,我们希望更有针对性:只删除那些 列包含一个缺失(NaN)的值。最后的数据框架应该有17,450个 观测值。请展示你丢弃这些行的代码,并显示最终数据集的形状。数据集的形状。

2. In this question, we will perform multiple linear regression using the statsmodels package.

(a) Use the statsmodels package to estimate a multiple regression evaluating the effect on rank of four features: passing, attacking, defending, and skill. Display the output.
(b) How much of the variance in rank is explained by our features?
(c) Which, if any, of our features are significant at the 1% level?
(d) Holding passing, attacking, and defending constant, a 1-unit increase in “skill” is associated with what kind of change in ranking?

(a) 使用statsmodels软件包估计一个多元回归,评估四个特征对排名的影响:传球、防守和技能。四个特征对排名的影响:传球、进攻、防守和技能。显示输出结果。
(b) 我们的特征对排名的变异有多少解释?
(c) 如果有的话,我们的哪些特征在1%的水平上是显著的?
(d) 在传球、进攻和防守不变的情况下,”技能 “增加1个单位与排名的何种变化有关?

3. Now that we’ve gotten to know our data a little bit, we will use SciKit Learn and a test/train split to see how well our model – using the same DV and IVs as Q2 – can predict a player’s rank.

(a) Based on the statsmodels output from Q2, do you expect that these four features (passing, attacking, defending, and skill) will do a pretty good or pretty bad job at predicting rank for out-of-sample data? Briefly explain why or why not.
(b) Create an X dataframe with just four features: passing, attacking, defending, and skill. Create a Y dataframe (or series) with just the “rank” variable. Display the first five rows of each.
(c) Create a test/train split where 25% of the data is held out for testing. Use a random seed of 123 (i.e., set the random state to this value). To show your code has worked, display the first 5 rows of the X training data.
(d) Use SKLearn to train a linear regression using only the training data. Display the intercept and coefficients for your trained model (coefficients do not need to be labeled).
(e) Compare the coefficients estimated by both regression models. How does the coefficient for “attacking” change (if at all) when it is estimated in Q2 (using statsmodels and the full dataset) vs when it is estimated in Q3 (using SKLearn and just training data)?

(f) Use your trained SKLearn regression model to predict rank values for the hold-out set of X test data. Display at least the first three predicted values (in a format of your choice).
(g) Display a scatterplot in which the horizontal axis shows the actual value of the Y test data and the vertical axis displays the predicted Y values for the X test data.
(h) Calculate and display the Root Mean Squared Error (RMSE) for this model. Provide a brief interpretation of what this means in terms of the “average error” of the model.
(i) Reflecting on any of the analyses conducted above, do you feel that this model does a good job or a bad job of predicting player rank?

(a) 根据Q2的统计模型输出,你认为这四个特征(传球、进攻、防守和技巧)在预测样本外数据的排名方面会做得相当好还是相当差?排名?简要地解释一下为什么或为什么不。
(b) 创建一个只有四个特征的X数据框:传球、进攻、防守和技能。创建一个只有 “等级 “变量的Y数据框(或序列)。显示前五 行。
(c) 创建一个测试/训练分割,其中25%的数据被拿出来进行测试。使用一个随机的 123的种子(也就是说,将随机状态设置为这个值)。为了显示你的代码已经工作。显示X训练数据的前5行。
(d) 使用SKLearn来训练一个只使用训练数据的线性回归。显示你训练的模型的截距和系数(系数不需要标注)。
(e) 比较两个回归模型估计的系数。攻击 “的系数有什么变化?攻击 “的系数如何变化(如果有的话),当它在第二季度被估计时(使用statsmodels和 完整的数据集)与在第三季度(使用SKLearn和只是训练数据)估计时,”攻击 “的系数有什么变化(如果有)?

(f) 使用你训练好的SKLearn回归模型来预测X测试数据集的等级价值。X测试数据。至少显示前三个预测值(以你选择的格式)。
(g) 显示一个散点图,横轴显示Y测试数据的实际值,纵轴显示预测值。数据的实际值,纵轴显示X测试数据的Y预测值。
(h) 计算并显示该模型的均方根误差(RMSE)。提供一个 简要解释一下这在模型的 “平均误差 “方面意味着什么。
(i) 考虑到上面进行的任何分析,你觉得这个模型在预测方面做得 你觉得这个模型在预测球员排名方面做得很好还是很差?


Assignment Exmaple

Recent Case

Service Scope

oop|Processing|JS|Ruby|Scala|Rust|Data Mining|数据库|Oracle|Mysql|Sqlite|IOS|Data Mining|网络编程|多线程编程|Linux编程操作系统|计算机网络|留学生|编程|程序|代写|加急|个人代写|作业代写|Assignment