久久久久久精品无码人妻_青春草无码精品视频在线观_无码精品国产VA在线观看_国产色无码专区在线观看

COMP 330代做、Python設(shè)計(jì)程序代寫(xiě)

時(shí)間:2024-04-02  來(lái)源:  作者: 我要糾錯(cuò)



COMP 330 Assignment #5
1 Description
In this assignment, you will be implementing a regularized, logistic regression to classify text documents. The implementation will be in Python, on top of Spark. To handle the large data set that we will be
giving you, it is necessary to use Amazon AWS.
You will be asked to perform three subtasks: (1) data preparation, (2) learning (which will be done via
gradient descent) and (3) evaluation of the learned model.
Note: It is important to complete HW 5 and Lab 5 before you really get going on this assignment. HW
5 will give you an opportunity to try out gradient descent for learning a model, and Lab 5 will give you
some experience with writing efficient NumPy code, both of which will be important for making your A5
experience less challenging!
2 Data
You will be dealing with a data set that consists of around 170,000 text documents and a test/evaluation
data set that consists of 18,700 text documents. All but around 6,000 of these text documents are Wikipedia
pages; the remaining documents are descriptions of Australian court cases and rulings. At the highest level,
your task is to build a classifier that can automatically figure out whether a text document is an Australian
court case.
We have prepared three data sets for your use.
1. The Training Data Set (1.9 GB of text). This is the set you will use to train your logistic regression
model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
2. The Testing Data Set (200 MB of text). This is the set you will use to evaluate your model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
3. The Small Data Set (37.5 MB of text). This is for you to use for training and testing of your model on
a smaller data set:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/SmallTrainingDataOneLinePerDoc.txt
Some Data Details to Be Aware Of. You should download and look at the SmallTrainingData.txt
file before you begin. You’ll see that the contents are sort of a pseudo-XML, where each text document
begins with a <doc id = ... > tag, and ends with </doc>. All documents are contained on a single
line of text.
Note that all of the Australia legal cases begin with something like <doc id = ‘‘AU1222’’ ...>;
that is, the doc id for an Australian legal case always starts with AU. You will be trying to figure out if the
document is an Australian legal case by looking only at the contents of the document.
1
3 The Tasks
There are three separate tasks that you need to complete to finish the assignment. As usual, it makes
sense to implement these and run them on the small data set before moving to the larger one.
3.1 Task 1
First, you need to write Spark code that builds a dictionary that includes the 20,000 most frequent words
in the training corpus. This dictionary is essentially an RDD that has the word as the key, and the relative
frequency position of the word as the value. For example, the value is zero for the most frequent word, and
19,999 for the least frequent word in the dictionary.
To get credit for this task, give us the frequency position of the words “applicant”, “and”, “attack”,
“protein”, and “car”. These should be values from 0 to 19,999, or -1 if the word is not in the dictionary,
because it is not in the to 20,000.
Note that accomplishing this will require you to use a variant of your A4 solution. If you do not trust
your A4 solution and would like mine, you can post a private request on Piazza.
3.2 Task 2
Next, you will convert each of the documents in the training set to a TF-IDF vector. You will then use
a gradient descent algorithm to learn a logistic regression model that can decide whether a document is
describing an Australian court case or not. Your model should use l2 regularization; you can play with in
things a bit to determine the parameter controlling the extent of the regularization. We will have enough
data that you might find that the regularization may not be too important (that is, it may be that you get good
results with a very small weight given to the regularization constant).
I am going to ask that you not just look up the gradient descent algorithm on the Internet and implement
it. Start with the LLH function from class, and then derive your own gradient descent algorithm. We can
help with this if you get stuck.
At the end of each iteration, compute the LLH of your model. You should run your gradient descent
until the change in LLH across iterations is very small.
Once you have completed this task, you will get credit by (a) writing up your gradient update formula,
and (b) giving us the fifty words with the largest regression coefficients. That is, those fifty words that are
most strongly related with an Australian court case.
3.3 Task 3
Now that you have trained your model, it is time to evaluate it. Here, you will use your model to predict
whether or not each of the testing points correspond to Australian court cases. To get credit for this task,
you need to compute for us the F1 score obtained by your classifier—we will use the F1 score obtained as
one of the ways in which we grade your Task 3 submission.
Also, I am going to ask you to actually look at the text for three of the false positives that your model
produced (that is, Wikipedia articles that your model thought were Australian court cases). Write paragraph
describing why you think it is that your model was fooled. Were the bad documents about Australia? The
legal system?
If you don’t have three false positives, just use the ones that you had (if any).
4 Important Considerations
Some notes regarding training and implementation. As you implement and evaluate your gradient descent algorithm, here are a few things to keep in mind.
2
1. To get good accuracy, you will need to center and normalize your data. That is, transform your data so
that the mean of each dimension is zero, and the standard deviation is one. That is, subtract the mean
vector from each data point, and then divide the result by the vector of standard deviations computed
over the data set.
2. When classifying new data, a data point whose dot product with the set of regression coefs is positive
is a “yes”, a negative is a “no” (see slide 15 in the GLM lecture). You will be trying to maximize the
F1 of your classifier and you can often increase the F1 by choosing a different cutoff between “yes”
and “no” other than zero. Another thing that you can do is to add another dimension whose value is
one in each data point (we discussed this in class). The learning process will then choose a regression
coef for this special dimension that tends to balance the “yes” and “no” nicely at a cutoff of zero.
However, some students in the past have reported that this can increase the training time.
3. Students sometimes face overflow problems, both when computing the LLH and when computing the
gradient update. Some things that you can do to avoid this are, (1) use np.exp() which seems to
be quite robust, and (2) transform your data so that the standard deviation is smaller than one—if you
have problems with a standard deviation of one, you might try 10−2 or even 10−5
. You may need to
experiment a bit. Such are the wonderful aspects of implementing data science algorithms in the real
world!
4. If you find that your training takes more than a few hours to run to convergence on the largest data set,
it likely means that you are doing something that is inherently slow that you can speed up by looking
at your code carefully. One thing: there is no problem with first training your model on a small sample
of the large data set (say, 10% of the documents) then using the result as an initialization, and continue
training on the full data set. This can speed up the process of reaching convergence.
Big data, small data, and grading. The first two tasks are worth three points, the last four points. Since it
can be challenging to run everything on a large data set, we’ll offer you a small data option. If you train your
data on TestingDataOneLinePerDoc.txt, and then test your data on SmallTrainingDataOneLinePerDoc.twe’ll take off 0.5 points on Task 2 and 0.5 points on Task 3. This means you can still get an A, and
you don’t have to deal with the big data set. For the possibility of getting full credit, you can train
your data on the quite large TrainingDataOneLinePerDoc.txt data set, and then test your data
on TestingDataOneLinePerDoc.txt.
4.1 Machines to Use
If you decide to try for full credit on the big data set you will need to run your Spark jobs three to five
machines as workers, each having around 8 cores. If you are not trying for the full credit, you can likely
get away with running on a smaller cluster. Remember, the costs WILL ADD UP QUICKLY IF YOU
FORGET TO SHUT OFF YOUR MACHINES. Be very careful, and shut down your cluster as soon as
you are done working. You can always create a new one easily when you begin your work again.
4.2 Turnin
Create a single document that has results for all three tasks. Make sure to be very clear whether you
tried the big data or small data option. Turn in this document as well as all of your code. Please zip up all
of your code and your document (use .gz or .zip only, please!), or else attach each piece of code as well as
your document to your submission individually. Do NOT turn in anything other than your Python code and
請(qǐng)加QQ:99515681  郵箱:99515681@qq.com   WX:codinghelp













 

標(biāo)簽:

掃一掃在手機(jī)打開(kāi)當(dāng)前頁(yè)
  • 上一篇:AIC2100代寫(xiě)、Python設(shè)計(jì)程序代做
  • 下一篇:COMP3334代做、代寫(xiě)Python程序語(yǔ)言
  • 無(wú)相關(guān)信息
    昆明生活資訊

    昆明圖文信息
    蝴蝶泉(4A)-大理旅游
    蝴蝶泉(4A)-大理旅游
    油炸竹蟲(chóng)
    油炸竹蟲(chóng)
    酸筍煮魚(yú)(雞)
    酸筍煮魚(yú)(雞)
    竹筒飯
    竹筒飯
    香茅草烤魚(yú)
    香茅草烤魚(yú)
    檸檬烤魚(yú)
    檸檬烤魚(yú)
    昆明西山國(guó)家級(jí)風(fēng)景名勝區(qū)
    昆明西山國(guó)家級(jí)風(fēng)景名勝區(qū)
    昆明旅游索道攻略
    昆明旅游索道攻略
  • 短信驗(yàn)證碼平臺(tái) 理財(cái) WPS下載

    關(guān)于我們 | 打賞支持 | 廣告服務(wù) | 聯(lián)系我們 | 網(wǎng)站地圖 | 免責(zé)聲明 | 幫助中心 | 友情鏈接 |

    Copyright © 2025 kmw.cc Inc. All Rights Reserved. 昆明網(wǎng) 版權(quán)所有
    ICP備06013414號(hào)-3 公安備 42010502001045

    久久久久久精品无码人妻_青春草无码精品视频在线观_无码精品国产VA在线观看_国产色无码专区在线观看

    人妻丰满熟妇av无码区app| 18黄暴禁片在线观看| 免费在线精品视频| 亚洲国产精品久久久久婷蜜芽 | 成人在线观看黄| 日韩久久久久久久久久久久| 午夜av中文字幕| 污片在线免费看| 久久久久免费精品| 成人观看免费完整观看| 亚洲一区二区福利视频| 久久久久国产精品熟女影院| 精品国产免费av| 激情伊人五月天| www午夜视频| 欧美日韩一区二区在线免费观看| 一级做a爱视频| 青青青青草视频| 中文精品无码中文字幕无码专区 | 欧美v在线观看| 奇米精品一区二区三区| 99精品视频国产| 天堂中文av在线| 欧美视频国产视频| 国产乱码一区二区三区四区| 天天爽人人爽夜夜爽| 亚洲色图久久久| av污在线观看| 粉色视频免费看| 色婷婷综合在线观看| 国产精品99久久久久久大便| 日本特级黄色大片| 污色网站在线观看| 911福利视频| 欧美国产在线一区| 男女激烈动态图| 中文字幕第100页| www.色就是色| 免费黄色日本网站| 国产激情在线观看视频| 国产一区二区三区小说| koreanbj精品视频一区| aaa毛片在线观看| 成人污网站在线观看| 欧美久久在线观看| 国产xxxxx视频| 亚洲网中文字幕| 欧美午夜aaaaaa免费视频| 午夜免费福利小电影| 国产h视频在线播放| 亚洲男人天堂色| 亚洲黄色片免费看| 国产日韩亚洲欧美在线| 国产人妻互换一区二区| 99在线免费视频观看| 激情图片qvod| 国产av天堂无码一区二区三区| 国产奶头好大揉着好爽视频| 蜜桃网站在线观看| 波多野结衣乳巨码无在线| 欧美在线观看黄| 色综合久久久久无码专区| av女优在线播放| 日韩一级片播放| 特色特色大片在线| 男人揉女人奶房视频60分| 黄色av网址在线播放| 污视频网站观看| 成人免费观看cn| 欧洲精品一区二区三区久久| 久久久久久久久久伊人| 妓院一钑片免看黄大片| 久久精品国产精品亚洲精品色| 五月天开心婷婷| 日韩中文字幕在线视频观看 | 成年人视频观看| 日本a级片免费观看| 伊人成人222| 日本中文字幕亚洲| 亚洲制服中文字幕| 丝袜老师办公室里做好紧好爽| 黄色一级视频片| av在线免费看片| 欧美a v在线播放| 天天干天天操天天干天天操| 91视频最新入口| 韩国一区二区av| 亚洲一区二区偷拍| 成人3d动漫一区二区三区| 国产美女永久无遮挡| 亚洲色图偷拍视频| 乱子伦视频在线看| 亚洲欧美日韩三级| 国产 日韩 亚洲 欧美| 国产精品无码乱伦| av五月天在线| 欧美黑人经典片免费观看| 日韩手机在线观看视频| 男人女人黄一级| 99在线免费视频观看| 手机在线视频一区| 欧在线一二三四区| 一区二区免费av| 成人在线观看a| 欧美精品一区二区三区三州| 经典三级在线视频| 五月天婷婷在线观看视频| 嫩草av久久伊人妇女超级a| 中文字幕日韩综合| 日日噜噜夜夜狠狠| 青青草综合在线| 999在线精品视频| 亚洲36d大奶网| 成人亚洲精品777777大片| 免费在线观看亚洲视频| 精品无码国产一区二区三区av| 免费在线观看毛片网站| 亚洲娇小娇小娇小| 亚洲精品怡红院| 茄子视频成人免费观看| www.射射射| 国产 日韩 亚洲 欧美| 97超碰在线人人| 一本一道久久a久久综合蜜桃| 国产成人一区二区三区别| 97公开免费视频| 国产l精品国产亚洲区久久| 久久国产精品网| xxxx18hd亚洲hd捆绑| av无码久久久久久不卡网站| 日韩一级性生活片| 五月天视频在线观看| 日本美女视频一区| 超碰97在线看| 色姑娘综合天天| 日韩小视频网站| 激情六月天婷婷| 国产又爽又黄ai换脸| 爽爽爽在线观看| 亚洲综合激情五月| 中文字幕在线乱| 91免费版看片| 野外做受又硬又粗又大视频√| 五月天激情视频在线观看| 亚洲天堂网一区| 国产三级国产精品国产专区50| 欧美日韩性生活片| 精品久久久久久久免费人妻| 亚洲欧美另类动漫| 亚洲天堂一区二区在线观看| 日本女优爱爱视频| 欧美性大战久久久久xxx| 国产精品无码av在线播放| 日韩a在线播放| 亚洲一区在线不卡| 91免费视频污| 污污污污污污www网站免费| 美脚丝袜脚交一区二区| 每日在线更新av| 亚洲国产高清av| av电影一区二区三区| 精品无码一区二区三区爱欲| 亚洲精品无码久久久久久| 另类小说第一页| 日本美女久久久| 国产av国片精品| 污网站免费在线| 黄色片免费在线观看视频| 国产肥臀一区二区福利视频| 在线黄色免费看| 国产一线二线三线女| 久久午夜夜伦鲁鲁一区二区| 97超碰免费观看| 欧美色图另类小说| 九九热99视频| 黄页网站大全在线观看| 天天干天天干天天干天天干天天干| 国产在线青青草| 福利视频999| 黄页网站大全在线观看| 91香蕉视频导航| 9色porny| 一区二区三区视频网| 人妻激情另类乱人伦人妻| 国产精品国产三级国产专区51| 免费的av在线| www成人免费| 天天干天天操天天玩| 黄网站色视频免费观看| 丁香婷婷激情网| 搞av.com| 天堂在线精品视频| 99热成人精品热久久66| 热这里只有精品| 国产精品乱码久久久久| 国产一区 在线播放| 国产福利在线免费| 黑人巨大国产9丨视频| 日韩av资源在线| 日本大片免费看|