56:824:719 sec11 directed (independent) study [aka "data (science) project"] [current syllabus https://theaok.github.io/dirStu]

make no mistake, this is not walk in the park" the bar is high: to get an A it has to be "publishable" at the end of semester or at least "publishable" after 1 set of easly doable revisions as per my final comments
PhD students graded differently; bar higher: need to be research sophisticated!
Fall 2018; Thu 2.30-5.30 321 Cooper, computer lab in the back of the first fl [note, you can also use the lab outside of the class time--just stop by my office and ask me for the key; this semester i am here especially on wed and thu]

note: we may have additional labs
instructor
  • Adam Okulicz-Kozaryn adam.okulicz.kozaryn@gmail.com
  • office: 321 Cooper St, room 302; office hours: TBA, and by appointment
  • this semester always at school on Wed and Thu; usually whole day; stop by
  • assistant
  • Shourjya Deb shourjyadb@gmail.com [not here untill late Oct]
  • office: 321 Cooper St (3rd floor); office hours Wed 3-4, and by appointment
  • prerequisites

    You need to be comfortable using a computer. Knowledge of Stata and data-management/computer science is helpful but not necessary. We will cover the basics.

    course description

    Essentially, this is a simplified and applied version of a first half of my data management class (https://sites.google.com/site/adamokuliczkozaryn/datman) plus (especially in the second half of the class) working on a publishable paper (hence, the name: "directed/independent study"). We will focus on doing utilitarian things with data only (no fancy stuff from the data management class). We will focus way more on interpretation of results and brainstorming rather than technicalities.
    parts from data management course description
    Recently, there is more focus on (and even more need for) computer science in social science. Reverse is true to some degree, too: computer scientists are doing social science these days, e.g. http://arxiv.org/pdf/1409.8578.pdf

    Unfortunately, the main focus in Social Science is still on theory and data analysis, while data management is overlooked. Yet, data management is not only a fundamental part of social science research but also the part that takes most of the time. This class aims at filling this gap. It is an applied class with hands-on approach. You will see many exercises and tutorials.

    This is applied research class that aims to teach computer tools for social science research that can automate the process and increase the academic productivity. Much of this class is writing computers programs; if you do not like programming, this class is not for you... But you may not yet know whether you like it and you may start liking it in this class: it often happened before!

    We will use Stata only.

    The class covers the principles and practical techniques of data cleaning, data organization, quality control, and automation of research tasks. Topics covered include data types, labeling, recoding, data documentation, merging datasets, reshaping, and basic programming structures such as macros and loops.

    learning objectives/outcomes

  • learn the basics of the computer programming
  • learn the practice of data management (there will be some theory, too)
  • learn how to conduct reproducible research
  • learn how to automate research by programming
  • demonstrate mastery of the material by writing code for a project/paper using learned techniques; you may cowrite code (upto 2 people) but then the project should be 2 times better than a single-authored paper
  • required textbooks and materials

    There are no required textbooks. All required materials (code, readings) will be provided.

    recommended course materials

  • Most of the class is based on: Mitchell '' Data Management Using Stata: A Practical Handbook'' http://www.Stata.com/bookstore/dmus.html
  • A similar book, but with focus on organization is: Long '' The Workflow of Data Analysis Using Stata'' http://www.Stata.com/bookstore/wdaus.html
  • Programming, specifically, is covered in Chris Baum "An Introduction to Stata Programming" https://www.stata.com/bookstore/introduction-stata-programming/
  • If you are a beginner you may use: '' A Gentle Introduction to Stata, 3rd Edition'' https://www.stata.com/bookstore/gentle-introduction-to-stata/
  • Also for beginners: "Statistics with STATA: Version 12" by L Hamilton; good but overpriced https://www.stata.com/bookstore/statistics-with-stata/
  • There is actually no need to buy books; there are many excellent free on-line resources:
    UCLA is the best website: https://stats.idre.ucla.edu/stata/
    UCLA for data management: https://stats.idre.ucla.edu/stata/seminars/stata-data-management/
    and many more links here: http://www.Stata.com/links/resources1.html
  • MORE RESOURCES (not all necessarily recommended but some may be useful): some website listing resources (skip, you can read at home)

    software

  • We will use Stata version 15 (Intercooled/IC or higher: SE or MP).
  • Some lucky people can download it for free at https://software.rutgers.edu; but, i think, you have to be RU employee
  • Free on apps: http://apps.rutgers.edu (not apps.camden.rutgers.edu) (somewhat clunky; good for computing enthusiasts) See general instructions at https://oirt.rutgers.edu/software/remotexserver/ QGIS is also at apps.rutgers. First make sure you have it enabled: http://netid.rutgers.edu, on the left, click ``service activation'', and activate ``apps cloud service''. Then connect to apps.rutgers. Go to https://apps.rutgers.edu. To copy files install http://winscp.net, run it and connect to: Host name: "apps.rutgers.edu"; User name: "your Rutgers NetID"; Password: "your Rutgers password"
  • You can buy your own Stata 15 IC/Intercooled perpetual license for $200 https://www.stata.com/order/new/edu/gradplans/campus-gradplan
  • NOTE: there is not much difference between versions... if you have Stata >=12 you are fine; if you have Stata 8, it is quite old... I update every second version; new version is out every 1-2 years...
  • Stata is cross-platform: linux, mac, windows
  • auxiliary software

    GIT
  • when submitting ps in git just have one file that you will be updating for each ps!! [can just call it ps.do]
  • set up a repo on github.com or bitbucket.org or something else
  • that way we will be more productive: i will be able to give you much more comments and suggestions and offer more help writing code
  • it is also much more fun! and this is how code must be written if you are serious about it; to get started just go to one of the above websites
  • let me do a quick demo in github
  • howto get started with GIT (very quick HOWTO on basic use of github.com):
  • sign up or login at github.com
  • may see on your right 'New repository' button: click it; or may need to go to 'repositories', click 'new'; then pick some name for your repository, keep selected 'public', important!: must check 'Initialize this repository with a README', and click button 'Create repository'
  • now can simply hit button 'Upload files' and choose your dofile, important: add some meaningful commit message, say: 'first try on importing and exporting data, submitted as ps1' and hit 'Commit changes' button
  • then hit 'Settings' towards the top, and then on the left select 'Collaborators' tab and add me "theaok" and hit 'Add colaborator',that's it!
  • then I will download it, edit, and upload back
  • then you can click my commit message and see the so called diff--the difference between your version and my version
  • then you can download this latest version first, edit it, and upload it back when done--don't forget about a meaningful commit message--can keep on uploading newer versions as many times as you like
  • note: when you click the file, you can then click 'History' and see how the file evloved over time :)
  • below are general references on how to get started using it fully, probably the first two are most useful
  • http://www.sitepoint.com/git-for-beginners/
  • http://rogerdudler.github.io/git-guide/
  • http://stackoverflow.com/questions/315911/git-for-beginners-the-definitive-practical-guide
  • https://backlogtool.com/git-guide/en/intro/intro1_1.html
  • more about GIT
  • Tech Talk: Linus Torvalds on git http://www.youtube.com/watch?v=4XpnKHJAok8
  • a guide to git on windows http://nathanj.github.com/gitguide/tour.html
  • An introduction to git from The Chronicle of Higher Ed http://chronicle.com/blogs/profhacker/a-gentle-introduction-to-version-control/23064
  • a general paper about workflow (incl latex, git, emacs) http://www.kieranhealy.org/files/misc/workflow-apps.pdf
  • requirements

  • Strictly speaking an advice, rather than a requirement, but in practice really a requirement, as it is virtually impossible to succeed otherwise! Ask often many questions. This is a software class. It is different from all other classes! You will get stuck often and whenever stuck, email me, as opposed to pulling your hair out! And stop by my office, too.
  • There are 6 problem sets (ps) due the following week afterbeing posted (unless indicated otherwhise; some ps will be due in 2 weeks). You will be asked to write some computer code that does something that we covered in the class to your data. You may work in groups (<=2), but indicate who you worked with, and the more people in the group, the better/longer the code should be.
  • Students will write an empirical paper/report/etc on any topic using one or more of the techniques covered in this course. A typical paper will be 5 to 20 double spaced pages. I will give you comments and help with the paper, and it is a good opportunity to produce a paper.I will also grade the code that you wrote to produce the results in your paper. You will submit not only paper, but also code that produced results in the paper; in fact, you can just submit the code. Ideally, the paper should be submitted to a professional journal for a publication.
  • grading

  • problem sets 60% (6ps x 10%)
  • empirical paper (code, too; incl presentation(cool code and some cool output esp graphs2*5%)) 40%
  • min max grade
    90.0100.0A
    85.089.9B+
    80.084.9B
    75.079.9C+
    70.074.9C
    069.9F

    calendar

    warning! don't get behind: learning curve may be steep

    tentative: the most uptodate calendar is always on the website: url is at the of this document
    (university calendar: http://scheduling.rutgers.edu/calendar.shtml)

    calendar is continuously updated: see timestamps on slides, best save or print them at the beginning of the class (i will not print for you); almost all changes will be minor; i will tell you if there is any bigger change; the further the class ahead the less updated it is

    when printing handouts you can print multiple slides per sheet (i like 6) http://kb2.adobe.com/cps/332/332720.html#main_Print multiple

    [*] means bonus (extra/not required)

    sep13 introduction
    Overview of the class material and policies. We will fire up Stata and have a look at Stata's text editor. And go over step-by-step GIT above https://theaok.github.io/dirStu#gitSta
  • ps0.pdf
  • over next week: very important! and if you haven't used Stata, familiarize yourself with it: see above links, especially ucla website; the learning curve may be steep soon!
  • intro.pdf   intro.do
  • replication.pdf
  • if time: let's discuss your research interests and data for this class (ps0)
  • [*] Data revolution! Interesting articles from the Economist: http://www.economist.com/node/15557443 and everything in one pdf file: https://www.emc.com/collateral/analyst-reports/ar-the-economist-data-data-everywhere.pdf
  • [*] "The end of theory" from the wired magazine http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

  • data management

    sep20 data reading/saving (formats/conversion) and manipulationvid
    We will talk about different basic data formats, conversion between them, and how they can be imported/exported to/from Stata
  • ps1.pdf and ps2.pdf
  • readAndManipulate.pdf   readAndManipulate.do
  • https://stats.idre.ucla.edu/stata/seminars/notes/stata-class-notesentering-data/
  • https://stats.idre.ucla.edu/stata/seminars/notes/stata-class-notesexploring-data
  • https://stats.idre.ucla.edu/stata/seminars/notes/stata-class-notesmodifying-data
  • [*] if time: flip the class: we will flip the last half an hour or so
  • [*] Mitchell ch1-5,7
  • [*] This is for next week really, but it would help if you start looking at these asap, will also help with understanding better ps2: https://www.stata.com/manuals/u22.pdf and https://www.princeton.edu/~otorres/Merge101.pdf
  • [!] NOTE: around this time we need to get your project going : you need to have your own data and be reasonably comfortable with it so that you can be productive with it and we can work remotely on it; typically, we'll need to meet few times around this time! there will be assignments due and we will not slow down!
  • sep27 combining data vid
    This class covers the key command for this class: merge.
  • ps3.pdf
  • start with a look at git repos
  • start with keep/drop (s36) from last class
  • mergeAppendReshape.pdf   mergeAppendReshape.do
  • merge conceptual setup: https://www.stata.com/manuals/u22.pdf and https://www.princeton.edu/~otorres/Merge101.pdf
  • merge practice: (make sure you run examples that start on p12!): https://www.stata.com/manuals14/dmerge.pdf
  • a quick overview of what we have done so far and doing today http://dss.princeton.edu/training/DataPrep101.pdf
  • reshape (also see "help reshape", as usual...) https://stats.idre.ucla.edu/stata/modules/reshaping-data-wide-to-long/    https://stats.idre.ucla.edu/stata/modules/reshaping-data-long-to-wide/    
  • [*] Mitchell ALL
  • flip the class
  • oct4 Slow down, make sure everybody got merge and read/manipualtion basics, discussion, brainstorm: have a look at what you have so far in github.
    oct11 continue with merge: digest, practice; and: organization, documentation vid
    merge again: Start with a look at your repos, eg Rachel
    We already covered key topics, so we make sure today that everything is crystal clear. Also, labeling data, variables, and values. Keeping your data organized.
  • if needed: revisit the code from past
  • organize.pdf organize.do
  • [*] https://stats.idre.ucla.edu/stata/modules/labeling-data/
  • [*] Scott Long "The Workflow of Data Analysis Using Stata" ALL
  • [*] Mitchell ch5
  • wrap up what we did so far, review, revisit, maybe do http://dss.princeton.edu/training/DataPrep101.pdf; new module next week!
  • flip the class: flip the last 45min or so; maybe: exercises from last class; tutorials from links from last class

  • oct18 graphics and exporting results vid
    Graphics is critical in understanding data, and understanding data is critical in data management.

    Results from Stata are data too, and need to manage them too!
  • ps4.pdf
  • graphics.pdf   graphics.do
  • exp.do and see at home http://dss.princeton.edu/training/Outreg2.pdf AND great resource per regressions: just follow these examples in your dofile! https://stats.idre.ucla.edu/stata/webbooks/reg/
  • oct25 presentations: go over your graph code, practice graphics; and if time: some data management theory
  • we will pick up with exporting results from the previous class
  • please make sure you added some graphs to github or jupyter, i will add there graphs too, and we will spend most of the class disucssing graphs for each of you
  • in class: let's try for your data: hist, perc; tab, plot sort; gr matrix; scatter, mlab(UA); bar charts
  • theory.pdf try to do as much as we can and need: possibly just get to key slides based on scott long in first part as for tehory for soc sci; and quick look at IT theory from the box in plos one article

  • part2: working on paper: directed study/project part

    Now having covered Stata, we will focus on producing the research, work more one-on-one and spend class time on discussions and brainstorming; note: each class will spend most of it discussing your research, be prepared, have each class something new; you also may want to have a brief presentation of what you have accomplished since last week
    nov1 theory; discuss final project; discuss your research vid
  • ps5.pdf
  • presentations of ps4: graphs
  • we will focus on research questions and hypotheses for your projects; and data, variables and execution of testing of your hypotheses (also see ps5)
  • nov8
  • ideally bring and present a draft of your ps5
  • also, while not covering new stata material, data and stata questions and discussions are also a part of this second part of the class
  • nov15
  • ps6.pdf
  • do quick theory from last class
  • final_project.pdf (esp sec: inline response and activism v science)
  • ps5 presentations
  • continue discussing your projects: again please do take into account discussions from last week and improve your papers accordingly; also, as always, in this second part of the course, be prepared to present and discuss improvements in your papers AND new ideas/directions; also, bring any questions you may have
  • nov20 Tues!(Thanksgiving change of schedule)
  • pick up with final_project.pdf; maybe esp: lit rev
  • discuss our comments from last week and your responses to them (ideally, you may bring and present a draft of your inline response to them)
  • continue discussing your projects: again please do take into account discussions from last week and improve your papers accordingly; also, as always, in this second part of the course, be prepared to present and discuss improvements in your papers AND new ideas/directions; also, bring any questions you may have
  • nov29 presentations of ps6, and final project discussions
  • make sure you record all comments (verbal and written) and copy-paste (verbatim!) into next assignent and repond to them
  • if time revisit final_project.pdf
  • dec6: last class!
  • final project presentations 15min max; this is really important: i will give you a bunch of comments by email and verbally, and so may others: please save these comments and respond to all of them inline at the beginning of your final project!
  • final project discussions: focus on Rachel and Sarah
  • wrap.pdf
  • final project due on wed dec12 at 10pm
  • when submitting final project pdf, don't forget about the stata code!
  • final_project.pdf
  • just to be safe, delete the data you have posted online, you never know: someone may be picky about it

    rules

    do not share or link to class videos! These videocasts and podcasts are the exclusive copyrighted property of Rutgers University and the Professor teaching the course. Rutgers University and the Professor grant you a license only to replay them for your own personal use during the course. Sharing them with others (including other students), reproducing, distributing, or posting any part of them elsewhere -- including but not limited to any internet site -- will be treated as a copyright violation and an offense against the honesty provisions of the Code of Student Conduct. Furthermore, for Law Students, this will be reported by the Law School to the licensing authorities in any jurisdiction in which you may apply to the bar. attendance Attendance is recommended. Be advised that you are responsible for any material covered in the class, whether or not it was in the readings or lecture notes. You are also responsible for any announcements made in class. For most students, attendance is simply essential to learning the material. If you do need to miss a class, be sure to consult with a fellow student to learn what transpired.

    incompletes: Generally speaking, the material in this course is best learned as a single unit. I will grant incompletes only in cases where a substantial change in life circumstances occurs that is beyond the control of the student, and only with appropriate documentation.

    study groups. You are encouraged to form a regular study group. Many students over the years have found the study groups to be very helpful. Study groups are permitted and encouraged to work on the problem sets together. However, each individual student should write up his or her own answer to hand in, based on his or her own understanding of the material. Do not hand in a copy of another person’s problem set, even a member of your own group. Writing up your own answer helps you to internalize the group discussions and is a crucial step in the learning process.

    Academic Integrity. I am very serious about this. Make no mistake--I may appear accommodating and informal--but I am extremely strict about academic integrity. Violations of academic integrity include cheating on tests or handing in assignments that do not reflect your own work and/or the work of a study group in which you actively participated. Handing in your own work that was performed not for this class (e.g. other class, any other project) is cheating, too. I have a policy of zero tolerance for cheating. Violations will be referred to the appropriate university authorities.

    For more information see http://fas.camden.rutgers.edu/student-experience/academic-integrity-policy

    Accommodating Students with Disabilities. Any student with a disability affecting performance in the class should contact the disability office ASAP: http://learn.camden.rutgers.edu/disability/disabilities.html